Available Models & Languages
Table of contents
Stanza provides pretrained NLP models for a total 66 human languages. On this page we provide detailed information on these models.
Pretrained models in Stanza can be divided into two categories, based on the datasets they were trained on:
- Universal Dependencies (UD) models, which are trained on the UD treebanks, and cover functionalities including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging and dependency parsing;
- NER models, which support named entity tagging for 8 languages, and are trained on various NER datasets.
Available UD Models
The following table lists all UD models supported by Stanza and pretrained on the Universal Dependencies v2.8 datasets. You can find more information about the POS tags, morphological features, and syntactic relations used on the Universal Dependencies website. We recommend you always use the lastest released models. However, you can still use these earlier models by downloading them and putting them in the correct directory. You can find performance of all available models on the System Performance page.
Table Notes
- marks models which have very low unlabeled attachment score (UAS) when evaluated end-to-end (from tokenization all the way to dependency parsing). Specifically, their UAS is lower than 50% on the Universal Dependencies 2.8 test set. The output of these models is unlikely to be accurate enough for serious syntactic analysis. In some cases, these models are resource constrained. If you are interested in improving the accuracy of these languages, please get in contact with the treebank authors regarding expanding the training data.
- marks the default package for a language, which is the package trained on the largest treebank available for that language.
- The copyright and licensing status of machine learning models is not very clear (to us). We list in the table below the Treebank License of the underlying data from which each language pack (set of machine learning models for a treebank) was trained. To the extent that The Trustees of Leland Stanford Junior University have ownership and rights over these language packs, all these Stanza language packs are made available under the Open Data Commons Attribution License v1.0.
Language | Language code | Package | Version | Treebank License | Treebank Doc | Notes |
---|---|---|---|---|---|---|
Afrikaans | af | afribooms | 1.0.0 | ![]() | ||
Ancient Greek | grc | proiel | 1.0.0 | ![]() | ||
grc | perseus | 1.0.0 | ![]() | |||
Arabic | ar | padt | 1.0.0 | ![]() | ||
Armenian | hy | armtdp | 1.0.0 | ![]() | ||
Basque | eu | bdt | 1.0.0 | ![]() | ||
Belarusian | be | hse | 1.0.0 | ![]() | ||
Bulgarian | bg | btb | 1.0.0 | ![]() | ||
Buryat | bxr | bdt | 1.0.0 | ![]() | ||
Catalan | ca | ancora | 1.0.0 | ![]() | ||
Chinese (simplified) | zh / zh-hans | gsdsimp | 1.0.0 | ![]() | ||
Chinese (traditional) | zh-hant | gsd | 1.0.0 | ![]() | ||
Classical Chinese | lzh | kyoto | 1.0.0 | ![]() | ||
Coptic | cop | scriptorium | 1.0.0 | ![]() | ||
Croatian | hr | set | 1.0.0 | ![]() | ||
Czech | cs | cac | 1.0.0 | ![]() | ||
cs | cltt | 1.0.0 | ![]() | |||
cs | fictree | 1.0.0 | ![]() | |||
cs | pdt | 1.0.0 | ![]() | |||
Danish | da | ddt | 1.0.0 | ![]() | ||
Dutch | nl | alpino | 1.0.0 | ![]() | ||
nl | lassysmall | 1.0.0 | ![]() | |||
English | en | ewt | 1.0.0 | ![]() | ||
en | gum | 1.0.0 | ![]() | |||
en | lines | 1.0.0 | ![]() | |||
en | partut | 1.0.0 | ![]() | |||
Estonian | et | edt | 1.0.0 | ![]() | ||
et | ewt | 1.0.0 | ![]() | |||
Finnish | fi | ftb | 1.0.0 | ![]() | ||
fi | tdt | 1.0.0 | ![]() | |||
French | fr | gsd | 1.0.0 | ![]() | ||
fr | partut | 1.0.0 | ![]() | |||
fr | sequoia | 1.0.0 | LGPLLR | |||
fr | spoken | 1.0.0 | ![]() | |||
Galician | gl | ctg | 1.0.0 | ![]() | ||
gl | treegal | 1.0.0 | LGPLLR | |||
German | de | gsd | 1.0.0 | ![]() | ||
de | hdt | 1.0.0 | ![]() | |||
Gothic | got | proiel | 1.0.0 | ![]() | ||
Greek | el | gdt | 1.0.0 | ![]() | ||
Hebrew | he | htb | 1.0.0 | ![]() | ||
Hindi | hi | hdtb | 1.0.0 | ![]() | ||
Hungarian | hu | szeged | 1.0.0 | ![]() | ||
Indonesian | id | gsd | 1.0.0 | ![]() | ||
Irish | ga | idt | 1.0.0 | ![]() | ||
Italian | it | isdt | 1.0.0 | ![]() | ||
it | partut | 1.0.0 | ![]() | |||
it | postwita | 1.0.0 | ![]() | |||
it | twittiro | 1.0.0 | ![]() | |||
it | vit | 1.0.0 | ![]() | |||
Japanese | ja | gsd | 1.0.0 | ![]() | ||
Kazakh | kk | ktb | 1.0.0 | ![]() | ||
Korean | ko | gsd | 1.0.0 | ![]() | ||
ko | kaist | 1.0.0 | ![]() | |||
Kurmanji | kmr | mg | 1.0.0 | ![]() | ||
Latin | la | ittb | 1.0.0 | ![]() | ||
la | proiel | 1.0.0 | ![]() | |||
la | perseus | 1.0.0 | ![]() | |||
Latvian | lv | lvtb | 1.0.0 | ![]() | ||
Lithuanian | lt | alksnis | 1.0.0 | ![]() | ||
lt | hse | 1.0.0 | ![]() | |||
Livvi | olo | kkpp | 1.0.0 | ![]() | ||
Maltese | mt | mudt | 1.0.0 | ![]() | ||
Marathi | mr | ufal | 1.0.0 | ![]() | ||
North Sami | sme | giella | 1.0.0 | ![]() | ||
Norwegian (Bokmaal) | no / nb | bokmaal | 1.0.0 | ![]() | ||
Norwegian (Nynorsk) | nn | nynorsk | 1.0.0 | ![]() | ||
nn | nynorsklia | 1.0.0 | ![]() | |||
Old Church Slavonic | cu | proiel | 1.0.0 | ![]() | ||
Old French | fro | srcmf | 1.0.0 | ![]() | ||
Old Russian | orv | torot | 1.0.0 | ![]() | ||
Persian | fa | seraji | 1.0.0 | ![]() | ||
Polish | pl | lfg | 1.0.0 | ![]() | ||
pl | pdb | 1.0.0 | ![]() | |||
Portuguese | pt | bosque | 1.0.0 | ![]() | ||
pt | gsd | 1.0.0 | ![]() | |||
Romanian | ro | nonstandard | 1.0.0 | ![]() | ||
ro | rrt | 1.0.0 | ![]() | |||
Russian | ru | gsd | 1.0.0 | ![]() | ||
ru | syntagrus | 1.0.0 | ![]() | |||
ru | taiga | 1.0.0 | ![]() | |||
Scottish Gaelic | gd | arcosg | 1.0.0 | ![]() | ||
Serbian | sr | set | 1.0.0 | ![]() | ||
Slovak | sk | snk | 1.0.0 | ![]() | ||
Slovenian | sl | ssj | 1.0.0 | ![]() | ||
sl | sst | 1.0.0 | ![]() | |||
Spanish | es | ancora | 1.0.0 | ![]() | ||
es | gsd | 1.0.0 | ![]() | |||
Swedish | sv | lines | 1.0.0 | ![]() | ||
sv | talbanken | 1.0.0 | ![]() | |||
Swedish Sign Language | swl | sslc | 1.0.0 | ![]() | ||
Tamil | ta | ttb | 1.0.0 | ![]() | ||
Telugu | te | mtg | 1.0.0 | ![]() | ||
Turkish | tr | imst | 1.0.0 | ![]() | ||
Ukrainian | uk | iu | 1.0.0 | ![]() | ||
Upper Sorbian | hsb | ufal | 1.0.0 | ![]() | ||
Urdu | ur | udtb | 1.0.0 | ![]() | ||
Uyghur | ug | udt | 1.0.0 | ![]() | ||
Vietnamese | vi | vtb | 1.0.0 | ![]() | ||
Wolof | wo | wtb | 1.0.0 | ![]() |
Other Available Models for Tokenization
We have trained a couple Thai tokenizer models based on publicly available datasets. The Inter-BEST dataset had some strange sentence tokenization according to the authors of pythainlp, so we used their software to resegment the sentences before training. As this is a questionable standard to use, we made the Orchid tokenizer the default.
Dataset | Token Accuracy | Sentence Accuracy | Notes |
---|---|---|---|
Orchid | 87.98 | 70.99 | |
BEST | 95.73 | 77.93 | Sentences are re-split using pythainlp |
Available NER Models
A description of the models available for the NER tool, along with their performance on test datasets, can be found here.
Available Sentiment Models
A description of the sentiment tool and the models available for that tool can be found here.
Available Conparse Models
A description of the constituency parser and the models available for that tool can be found here.
Training New Models
To train new models, please see the documents on training and adding a new language.