Stanza is a collection of accurate and efficient tools for many human languages in one place. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing.
Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities. The toolkit is designed to be parallel among more than 60 languages, using the Universal Dependencies formalism.
Stanza is built with highly accurate neural network components that also enable efficient training and evaluation with your own annotated data. The modules are built on top of the PyTorch library. You will get much faster performance if you run this system on a GPU-enabled machine.
In addition, Stanza includes a Python interface to the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.
To summarize, Stanza features:
- Native Python implementation requiring minimal efforts to set up;
- Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging, dependency parsing, and named entity recognition;
- Pretrained neural models supporting 66 (human) languages;
- A stable, officially maintained Python interface to CoreNLP.
Below is an overview of Stanza’s neural network NLP pipeline:
We strongly recommend installing Stanza with
pip, which is as simple as:
pip install stanza
To see Stanza’s neural pipeline in action, you can launch the Python interactive interpreter, and try the following commands:
>>> import stanza >>> stanza.download('en') # download English model >>> nlp = stanza.Pipeline('en') # initialize English neural pipeline >>> doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence
You should be able to see all the annotations in the example by running the following commands:
>>> print(doc) >>> print(doc.entities)
Aside from the neural pipeline, Stanza also provides the official Python wrapper for accessing the Java Stanford CoreNLP package. For more details, please see Stanford CoreNLP Client.
Stanza is licensed under the Apache License, Version 2.0 (the “License”), you may not use the software package except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
We are also grateful to community contributors for their help in improving Stanza.
If you use Stanza in your work, please cite this paper:
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations. 2020. [pdf][bib]
If you use the biomedical and clinical model packages in Stanza, please also cite our biomedical models paper:
Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D. Manning, Curtis P. Langlotz. (2020). Biomedical and Clinical English Model Packages in the Stanza Python NLP Library. arXiv preprint arXiv:2004.14530.
If you use Stanford CoreNLP through the Stanza python client, please also follow the instructions here to cite the proper publications.