Stanza provides pretrained NLP models for a total 66 human languages. On this page we provide detailed information on these models.

Pretrained models in Stanza can be divided into two categories, based on the datasets they were trained on:

  1. Universal Dependencies (UD) models, which are trained on the UD treebanks, and cover functionalities including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging and dependency parsing;
  2. NER models, which support named entity tagging for 8 languages, and are trained on various NER datasets.

Downloading Models

Downloading models is as simple as calling the stanza.download() method. We provide detailed examples on how to use the download interface on the Getting Started page. Detailed descriptions of all available options (i.e., arguments) of the download method are listed below:

Option Type Default Description
lang str 'en' Language code (e.g., "en") or language name (e.g., "English") for the language to process with the Pipeline. See the tables of available models below for a complete list of supported languages.
dir str '~/stanza_resources' Directory for storing the models downloaded for Stanza. By default, Stanza stores its models in a folder in your home directory.
package str 'default' Package to download for processors, where each package typically specifies what data the models are trained on. We provide a “default” package for all languages that contains NLP models most users will find useful, which will be used when the package argument isn’t specified. See table below for a complete list of available packages.
processors dict or str dict() Processors to download models for. This can either be specified as a comma-seperated list of processor names to use (e.g., 'tokenize,pos'), or a Python dictionary with processor names as keys and package names as corresponding values (e.g., {'tokenize': 'ewt', 'pos': 'ewt'}). All unspecified processors will fall back to using the package specified by the package argument. A list of all Processors supported can be found here.
logging_level str 'INFO' Controls the level of logging information to display during download. Can be one of 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CIRTICAL', or 'FATAL'. Less information will be displayed going from 'DEBUG' to 'FATAL'.
verbose str None Simplified option for logging level. If True, logging level will be set to 'INFO'. If False, logging level will be set to 'ERROR' (i.e., only show errors).

Available UD Models

The following table lists all UD models supported by Stanza and pretrained on the Universal Dependencies v2.5 datasets. You can find more information about the POS tags, morphological features, and syntactic relations used on the Universal Dependencies website. We recommend you always use the lastest released models. However, you can still use these earlier models by downloading them and putting them in the correct directory. You can find performance of all available models on the System Performance page.

Table Notes

  1. marks models which have very low unlabeled attachment score (UAS) when evaluated end-to-end (from tokenization all the way to dependency parsing). Specifically, their UAS is lower than 50% on the Universal Dependencies 2.5 test set. Users should be very cautious in using the output of these models for serious syntactic analysis.
  2. marks the default package for a language, which is the package trained on the largest treebank available for that language.
  3. The copyright and licensing status of machine learning models is not very clear (to us). We list in the table below the Treebank License of the underlying data from which each language pack (set of machine learning models for a treebank) was trained. To the extent that The Trustees of Leland Stanford Junior University have ownership and rights over these language packs, all these Stanza language packs are made available under the Open Data Commons Attribution License v1.0.
Language Language code Package Version Treebank License Treebank Doc Notes
Afrikaans af afribooms 1.0.0 Creative Commons License
Ancient Greek grc proiel 1.0.0 Creative Commons License
  grc perseus 1.0.0 Creative Commons License  
Arabic ar padt 1.0.0 Creative Commons License
Armenian hy armtdp 1.0.0 Creative Commons License
Basque eu bdt 1.0.0 Creative Commons License
Belarusian be hse 1.0.0 Creative Commons License
Bulgarian bg btb 1.0.0 Creative Commons License
Buryat bxr bdt 1.0.0 Creative Commons License
Catalan ca ancora 1.0.0 GNU License
Chinese (simplified) zh / zh-hans gsdsimp 1.0.0 Creative Commons License
Chinese (traditional) zh-hant gsd 1.0.0 Creative Commons License
Classical Chinese lzh kyoto 1.0.0 Creative Commons License
Coptic cop scriptorium 1.0.0 Creative Commons License
Croatian hr set 1.0.0 Creative Commons License
Czech cs cac 1.0.0 Creative Commons License  
  cs cltt 1.0.0 Creative Commons License  
  cs fictree 1.0.0 Creative Commons License  
  cs pdt 1.0.0 Creative Commons License
Danish da ddt 1.0.0 Creative Commons License
Dutch nl alpino 1.0.0 Creative Commons License
  nl lassysmall 1.0.0 Creative Commons License  
English en ewt 1.0.0 Creative Commons License
  en gum 1.0.0 Creative Commons License  
  en lines 1.0.0 Creative Commons License  
  en partut 1.0.0 Creative Commons License  
Estonian et edt 1.0.0 Creative Commons License
  et ewt 1.0.0 Creative Commons License  
Finnish fi ftb 1.0.0 Creative Commons License  
  fi tdt 1.0.0 Creative Commons License
French fr gsd 1.0.0 Creative Commons License
  fr partut 1.0.0 Creative Commons License  
  fr sequoia 1.0.0 LGPLLR  
  fr spoken 1.0.0 Creative Commons License  
Galician gl ctg 1.0.0 Creative Commons License
  gl treegal 1.0.0 LGPLLR  
German de gsd 1.0.0 Creative Commons License
  de hdt 1.0.0 Creative Commons License  
Gothic got proiel 1.0.0 Creative Commons License
Greek el gdt 1.0.0 Creative Commons License
Hebrew he htb 1.0.0 Creative Commons License
Hindi hi hdtb 1.0.0 Creative Commons License
Hungarian hu szeged 1.0.0 Creative Commons License
Indonesian id gsd 1.0.0 Creative Commons License
Irish ga idt 1.0.0 Creative Commons License
Italian it isdt 1.0.0 Creative Commons License
  it partut 1.0.0 Creative Commons License  
  it postwita 1.0.0 Creative Commons License  
  it twittiro 1.0.0 Creative Commons License  
  it vit 1.0.0 Creative Commons License  
Japanese ja gsd 1.0.0 Creative Commons License
Kazakh kk ktb 1.0.0 Creative Commons License
Korean ko gsd 1.0.0 Creative Commons License  
  ko kaist 1.0.0 Creative Commons License
Kurmanji kmr mg 1.0.0 Creative Commons License
Latin la ittb 1.0.0 Creative Commons License
  la proiel 1.0.0 Creative Commons License  
  la perseus 1.0.0 Creative Commons License  
Latvian lv lvtb 1.0.0 Creative Commons License
Lithuanian lt alksnis 1.0.0 Creative Commons License  
  lt hse 1.0.0 Creative Commons License
Livvi olo kkpp 1.0.0 Creative Commons License
Maltese mt mudt 1.0.0 Creative Commons License
Marathi mr ufal 1.0.0 Creative Commons License
North Sami sme giella 1.0.0 Creative Commons License
Norwegian (Bokmaal) no / nb bokmaal 1.0.0 Creative Commons License
Norwegian (Nynorsk) nn nynorsk 1.0.0 Creative Commons License
  nn nynorsklia 1.0.0 Creative Commons License  
Old Church Slavonic cu proiel 1.0.0 Creative Commons License
Old French fro srcmf 1.0.0 Creative Commons License
Old Russian orv torot 1.0.0 Creative Commons License
Persian fa seraji 1.0.0 Creative Commons License
Polish pl lfg 1.0.0 GNU License
  pl pdb 1.0.0 Creative Commons License  
Portuguese pt bosque 1.0.0 Creative Commons License
  pt gsd 1.0.0 Creative Commons License  
Romanian ro nonstandard 1.0.0 Creative Commons License  
  ro rrt 1.0.0 Creative Commons License
Russian ru gsd 1.0.0 Creative Commons License  
  ru syntagrus 1.0.0 Creative Commons License
  ru taiga 1.0.0 Creative Commons License  
Scottish Gaelic gd arcosg 1.0.0 Creative Commons License
Serbian sr set 1.0.0 Creative Commons License
Slovak sk snk 1.0.0 Creative Commons License
Slovenian sl ssj 1.0.0 Creative Commons License
  sl sst 1.0.0 Creative Commons License  
Spanish es ancora 1.0.0 GNU License
  es gsd 1.0.0 Creative Commons License  
Swedish sv lines 1.0.0 Creative Commons License  
  sv talbanken 1.0.0 Creative Commons License
Swedish Sign Language swl sslc 1.0.0 Creative Commons License
Tamil ta ttb 1.0.0 Creative Commons License
Telugu te mtg 1.0.0 Creative Commons License
Turkish tr imst 1.0.0 Creative Commons License
Ukrainian uk iu 1.0.0 Creative Commons License
Upper Sorbian hsb ufal 1.0.0 Creative Commons License
Urdu ur udtb 1.0.0 Creative Commons License
Uyghur ug udt 1.0.0 Creative Commons License
Vietnamese vi vtb 1.0.0 Creative Commons License
Wolof wo wtb 1.0.0 Creative Commons License

Available NER Models

The following table lists all NER models supported by Stanza, pretrained on various NER datasets. Again, you can find performance of all available models on the System Performance page.

Table Notes

  1. marks the default package for a language.
  2. For packages with 4 named entity types, supported types include PER (Person), LOC (Location), ORG (Organization) and MISC (Miscellaneous); for package with 18 named entity types, supported types include PERSON, NORP (Nationalities/religious/political group), FAC (Facility), ORG (Organization), GPE (Countries/cities/states), LOC (Location), PRODUCT,EVENT, WORK_OF_ART, LAW, LANGUAGE, DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL (details can be found on page 21 of this OntoNotes documentation).
Language LANGUAGE CODE PACKAGE # Types CORPUS DOC NOTES
Arabic ar AQMAR 4
Chinese zh OntoNotes 18
Dutch nl CoNLL02 4
Dutch nl WikiNER 4  
English en CoNLL03 4  
English en OntoNotes 18
French fr WikiNER 4
German de CoNLL03 4
German de GermEval14 4  
Russian ru WikiNER 4
Spanish es CoNLL02 4
Spanish es AnCora 4