Available Models & Languages

Table of contents

Stanza provides pretrained NLP models for a total 66 human languages. On this page we provide detailed information on these models.

Pretrained models in Stanza can be divided into two categories, based on the datasets they were trained on:

  1. Universal Dependencies (UD) models, which are trained on the UD treebanks, and cover functionalities including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging and dependency parsing;
  2. NER models, which support named entity tagging for 8 languages, and are trained on various NER datasets.

Available UD Models

The following table lists all UD models supported by Stanza and pretrained on the Universal Dependencies v2.8 datasets. You can find more information about the POS tags, morphological features, and syntactic relations used on the Universal Dependencies website. We recommend you always use the lastest released models. However, you can still use these earlier models by downloading them and putting them in the correct directory. You can find performance of all available models on the System Performance page.

Table Notes

  1. marks models which have very low unlabeled attachment score (UAS) when evaluated end-to-end (from tokenization all the way to dependency parsing). Specifically, their UAS is lower than 50% on the Universal Dependencies 2.8 test set. The output of these models is unlikely to be accurate enough for serious syntactic analysis. In some cases, these models are resource constrained. If you are interested in improving the accuracy of these languages, please get in contact with the treebank authors regarding expanding the training data.
  2. marks the default package for a language, which is the package trained on the largest treebank available for that language.
  3. The copyright and licensing status of machine learning models is not very clear (to us). We list in the table below the Treebank License of the underlying data from which each language pack (set of machine learning models for a treebank) was trained. To the extent that The Trustees of Leland Stanford Junior University have ownership and rights over these language packs, all these Stanza language packs are made available under the Open Data Commons Attribution License v1.0.
LanguageLanguage codePackageVersionTreebank LicenseTreebank DocNotes
Afrikaansafafribooms1.0.0Creative Commons License
Ancient Greekgrcproiel1.0.0Creative Commons License
 grcperseus1.0.0Creative Commons License 
Arabicarpadt1.0.0Creative Commons License
Armenianhyarmtdp1.0.0Creative Commons License
Basqueeubdt1.0.0Creative Commons License
Belarusianbehse1.0.0Creative Commons License
Bulgarianbgbtb1.0.0Creative Commons License
Buryatbxrbdt1.0.0Creative Commons License
Catalancaancora1.0.0GNU License
Chinese (simplified)zh / zh-hansgsdsimp1.0.0Creative Commons License
Chinese (traditional)zh-hantgsd1.0.0Creative Commons License
Classical Chineselzhkyoto1.0.0Creative Commons License
Copticcopscriptorium1.0.0Creative Commons License
Croatianhrset1.0.0Creative Commons License
Czechcscac1.0.0Creative Commons License 
 cscltt1.0.0Creative Commons License 
 csfictree1.0.0Creative Commons License 
 cspdt1.0.0Creative Commons License
Danishdaddt1.0.0Creative Commons License
Dutchnlalpino1.0.0Creative Commons License
 nllassysmall1.0.0Creative Commons License 
Englishenewt1.0.0Creative Commons License
 engum1.0.0Creative Commons License 
 enlines1.0.0Creative Commons License 
 enpartut1.0.0Creative Commons License 
Estonianetedt1.0.0Creative Commons License
 etewt1.0.0Creative Commons License 
Finnishfiftb1.0.0Creative Commons License 
 fitdt1.0.0Creative Commons License
Frenchfrgsd1.0.0Creative Commons License
 frpartut1.0.0Creative Commons License 
 frspoken1.0.0Creative Commons License 
Galicianglctg1.0.0Creative Commons License
Germandegsd1.0.0Creative Commons License
 dehdt1.0.0Creative Commons License 
Gothicgotproiel1.0.0Creative Commons License
Greekelgdt1.0.0Creative Commons License
Hebrewhehtb1.0.0Creative Commons License
Hindihihdtb1.0.0Creative Commons License
Hungarianhuszeged1.0.0Creative Commons License
Indonesianidgsd1.0.0Creative Commons License
Irishgaidt1.0.0Creative Commons License
Italianitisdt1.0.0Creative Commons License
 itpartut1.0.0Creative Commons License 
 itpostwita1.0.0Creative Commons License 
 ittwittiro1.0.0Creative Commons License 
 itvit1.0.0Creative Commons License 
Japanesejagsd1.0.0Creative Commons License
Kazakhkkktb1.0.0Creative Commons License
Koreankogsd1.0.0Creative Commons License
 kokaist1.0.0Creative Commons License 
Kurmanjikmrmg1.0.0Creative Commons License
Latinlaittb1.0.0Creative Commons License
 laproiel1.0.0Creative Commons License 
 laperseus1.0.0Creative Commons License 
Latvianlvlvtb1.0.0Creative Commons License
Lithuanianltalksnis1.0.0Creative Commons License
 lthse1.0.0Creative Commons License
Livviolokkpp1.0.0Creative Commons License
Maltesemtmudt1.0.0Creative Commons License
Marathimrufal1.0.0Creative Commons License
North Samismegiella1.0.0Creative Commons License
Norwegian (Bokmaal)no / nbbokmaal1.0.0Creative Commons License
Norwegian (Nynorsk)nnnynorsk1.0.0Creative Commons License
 nnnynorsklia1.0.0Creative Commons License 
Old Church Slavoniccuproiel1.0.0Creative Commons License
Old Frenchfrosrcmf1.0.0Creative Commons License
Old Russianorvtorot1.0.0Creative Commons License
Persianfaseraji1.0.0Creative Commons License
Polishpllfg1.0.0GNU License 
 plpdb1.0.0Creative Commons License
Portugueseptbosque1.0.0Creative Commons License
 ptgsd1.0.0Creative Commons License 
Romanianrononstandard1.0.0Creative Commons License 
 rorrt1.0.0Creative Commons License
Russianrugsd1.0.0Creative Commons License 
 rusyntagrus1.0.0Creative Commons License
 rutaiga1.0.0Creative Commons License 
Scottish Gaelicgdarcosg1.0.0Creative Commons License
Serbiansrset1.0.0Creative Commons License
Slovaksksnk1.0.0Creative Commons License
Slovenianslssj1.0.0Creative Commons License
 slsst1.0.0Creative Commons License 
Spanishesancora1.0.0GNU License
 esgsd1.0.0Creative Commons License 
Swedishsvlines1.0.0Creative Commons License 
 svtalbanken1.0.0Creative Commons License
Swedish Sign Languageswlsslc1.0.0Creative Commons License
Tamiltattb1.0.0Creative Commons License
Telugutemtg1.0.0Creative Commons License
Turkishtrimst1.0.0Creative Commons License
Ukrainianukiu1.0.0Creative Commons License
Upper Sorbianhsbufal1.0.0Creative Commons License
Urduurudtb1.0.0Creative Commons License
Uyghurugudt1.0.0Creative Commons License
Vietnamesevivtb1.0.0Creative Commons License
Wolofwowtb1.0.0Creative Commons License

Other Available Models for Tokenization

We have trained a couple Thai tokenizer models based on publicly available datasets. The Inter-BEST dataset had some strange sentence tokenization according to the authors of pythainlp, so we used their software to resegment the sentences before training. As this is a questionable standard to use, we made the Orchid tokenizer the default.

DatasetToken AccuracySentence AccuracyNotes
BEST95.7377.93Sentences are re-split using pythainlp

Available NER Models

A description of the models available for the NER tool, along with their performance on test datasets, can be found here.

Available Sentiment Models

A description of the sentiment tool and the models available for that tool can be found here.

Available Conparse Models

A description of the constituency parser and the models available for that tool can be found here.

Training New Models

To train new models, please see the documents on training and adding a new language.