Models for Human Languages

Downloading and Using Models

Downloading a language pack (a set of machine learning models for a human language that you wish to use in the StanfordNLP pipeline) is as simple as

>>> import stanfordnlp
>>> stanfordnlp.download('ar')    # replace "ar" with the language or treebank code you need, see below

The language code or treebank code can be looked up in the next section. If only the language code is specified, we will download the default models for that language. If you are seeking the language pack built from a specific treebank, you can download the corresponding models with the appropriate treebank code. By default, language packs are stored in a stanfordnlp_resources folder inside your home directory.

To use the default language pack for any language, simply build the pipeline as follows:

>>> nlp = stanfordnlp.Pipeline(lang="es")    # replace "es" with the language of interest

If you are using a non-default treebank for the langauge, make sure to also specify the treebank code, for example:

>>> nlp = stanford.Pipeline(lang="it", treebank="it_postwita")

Human Languages Supported by StanfordNLP

Below is a list of all the (human) languages supported by StanfordNLP (through this Python neural pipeline). All languages are built using data from and are annotated according to Universal Dependencies v2. You can find more information about the POS tags, morphological features, and syntactic relations used on the Universal Dependencies website. The performance of these systems on the CoNLL 2018 Shared Task official test set (in our unofficial evaluation) can be found here.

Notes

  1. marks models which have very low unlabeled attachment score (UAS) when evaluated end-to-end (from tokenization all the way to dependency parsing). Specifically, their UAS is lower than 50% on the CoNLL 2018 Shared Task test set. Users should be very cautious in using the output of these models for serious syntactic analysis.
  2. marks models that are at least 1% absolute UAS worse than the full neural pipeline presented in our paper (which uses the Tensorflow counterparts for the tagger and the parser), so that might raise a yellow flag for people wishing to do parser comparison experiments, but in general these models are fine to use for syntactic analysis.
  3. marks the default language pack for a language, which is the language pack trained on the largest treebank available for that language.
  4. The copyright and licensing status of machine learning models is not very clear (to us). We list in the table below the Treebank License of the underlying data from which each language pack (set of machine learning models for a treebank) was trained. To the extent that The Trustees of Leland Stanford Junior University have ownership and rights over these language packs, all these StanfordNLP language packs are made available under the Open Data Commons Attribution License v1.0.
Language Treebank Language code Treebank code Models Version Treebank License Treebank Doc Notes
Afrikaans AfriBooms af af_afribooms download 0.2.0 Creative Commons License
Ancient Greek Perseus grc grc_perseus download 0.2.0 Creative Commons License  
  PROIEL grc grc_proiel download 0.2.0 Creative Commons License
Arabic PADT ar ar_padt download 0.2.0 Creative Commons License
Armenian ArmTDP hy hy_armtdp download 0.2.0 Creative Commons License
Basque BDT eu eu_bdt download 0.2.0 Creative Commons License
Bulgarian BTB bg bg_btb download 0.2.0 Creative Commons License
Buryat BDT bxr bxr_bdt download 0.2.0 Creative Commons License
Catalan AnCora ca ca_ancora download 0.2.0 GNU License
Chinese (traditional) GSD zh zh_gsd download 0.2.0 Creative Commons License
Croatian SET hr hr_set download 0.2.0 Creative Commons License
Czech CAC cs cs_cac download 0.2.0 Creative Commons License  
  FicTree cs cs_fictree download 0.2.0 Creative Commons License  
  PDT cs cs_pdt download 0.2.0 Creative Commons License
Danish DDT da da_ddt download 0.2.0 Creative Commons License
Dutch Alpino nl nl_alpino download 0.2.0 Creative Commons License
  LassySmall nl nl_lassysmall download 0.2.0 Creative Commons License  
English EWT en en_ewt download 0.2.0 Creative Commons License
  GUM en en_gum download 0.2.0 Creative Commons License  
  LinES en en_lines download 0.2.0 Creative Commons License  
Estonian EDT et et_edt download 0.2.0 Creative Commons License
Finnish FTB fi fi_ftb download 0.2.0 GNU License  
  TDT fi fi_tdt download 0.2.0 Creative Commons License
French GSD fr fr_gsd download 0.2.0 Creative Commons License
  Sequoia fr fr_sequoia download 0.2.0 LGPLLR  
  Spoken fr fr_spoken download 0.2.0 Creative Commons License
Galician CTG gl gl_ctg download 0.2.0 Creative Commons License
  TreeGal gl gl_treegal download 0.2.0 LGPLLR
German GSD de de_gsd download 0.2.0 Creative Commons License
Gothic PROIEL got got_proiel download 0.2.0 Creative Commons License
Greek GDT el el_gdt download 0.2.0 Creative Commons License
Hebrew HTB he he_htb download 0.2.0 Creative Commons License
Hindi HDTB hi hi_hdtb download 0.2.0 Creative Commons License
Hungarian Szeged hu hu_szeged download 0.2.0 Creative Commons License
Indonesian GSD id id_gsd download 0.2.0 Creative Commons License
Irish IDT ga ga_idt download 0.2.0 Creative Commons License
Italian ISDT it it_isdt download 0.2.0 Creative Commons License
  PoSTWITA it it_postwita download 0.2.0 Creative Commons License  
Japanese GSD ja ja_gsd download 0.2.0 Creative Commons License
Kazakh KTB kk kk_ktb download 0.2.0 Creative Commons License
Korean GSD ko ko_gsd download 0.2.0 Creative Commons License  
  Kaist ko ko_kaist download 0.2.0 Creative Commons License
Kurmanji MG kmr kmr_mg download 0.2.0 Creative Commons License
Latin ITTB la la_ittb download 0.2.0 Creative Commons License
  Perseus la la_perseus download 0.2.0 Creative Commons License  
  PROIEL la la_proiel download 0.2.0 Creative Commons License  
Latvian LVTB lv lv_lvtb download 0.2.0 Creative Commons License
North Sami Giella sme sme_giella download 0.2.0 Creative Commons License
Norwegian Bokmaal no_bokmaal no_bokmaal download 0.2.0 Creative Commons License
  Nynorsk no_nynorsk no_nynorsk download 0.2.0 Creative Commons License
  NynorskLIA no_nynorsk no_nynorsklia download 0.2.0 Creative Commons License  
Old Church Slavonic PROIEL cu cu_proiel download 0.2.0 Creative Commons License
Old French SRCMF fro fro_srcmf download 0.2.0 Creative Commons License
Persian Seraji fa fa_seraji download 0.2.0 Creative Commons License
Polish LFG pl pl_lfg download 0.2.0 GNU License
  SZ pl pl_sz download 0.2.0 GNU License  
Portuguese Bosque pt pt_bosque download 0.2.0 Creative Commons License
Romanian RRT ro ro_rrt download 0.2.0 Creative Commons License
Russian SynTagRus ru ru_syntagrus download 0.2.0 Creative Commons License
  Taiga ru ru_taiga download 0.2.0 Creative Commons License
Serbian SET sr sr_set download 0.2.0 Creative Commons License
Slovak SNK sk sk_snk download 0.2.0 Creative Commons License
Slovenian SSJ sl sl_ssj download 0.2.0 Creative Commons License
  SST sl sl_sst download 0.2.0 Creative Commons License  
Spanish AnCora es es_ancora download 0.2.0 GNU License
Swedish LinES sv sv_lines download 0.2.0 Creative Commons License  
  Talbanken sv sv_talbanken download 0.2.0 Creative Commons License
Turkish IMST tr tr_imst download 0.2.0 Creative Commons License
Ukrainian IU uk uk_iu download 0.2.0 Creative Commons License
Upper Sorbian UFAL hsb hsb_ufal download 0.2.0 Creative Commons License
Urdu UDTB ur ur_udtb download 0.2.0 Creative Commons License
Uyghur UDT ug ug_udt download 0.2.0 Creative Commons License
Vietnamese VTB vi vi_vtb download 0.2.0 Creative Commons License

Models History

Models from earlier releases can be downloaded using the version argument. Note that not every release has a distinct model set.

>>> import stanfordnlp
>>> stanfordnlp.download('ar', version='0.1.0')  

Models from earlier releases can also be found in the table below.

Language Treebank Language code Treebank code Models
Afrikaans AfriBooms af af_afribooms 0.1.0
Ancient Greek Perseus grc grc_perseus 0.1.0
  PROIEL grc grc_proiel 0.1.0
Arabic PADT ar ar_padt 0.1.0
Armenian ArmTDP hy hy_armtdp 0.1.0
Basque BDT eu eu_bdt 0.1.0
Bulgarian BTB bg bg_btb 0.1.0
Buryat BDT bxr bxr_bdt 0.1.0
Catalan AnCora ca ca_ancora 0.1.0
Chinese (traditional) GSD zh zh_gsd 0.1.0
Croatian SET hr hr_set 0.1.0
Czech CAC cs cs_cac 0.1.0
  FicTree cs cs_fictree 0.1.0
  PDT cs cs_pdt 0.1.0
Danish DDT da da_ddt 0.1.0
Dutch Alpino nl nl_alpino 0.1.0
  LassySmall nl nl_lassysmall 0.1.0
English EWT en en_ewt 0.1.0
  GUM en en_gum 0.1.0
  LinES en en_lines 0.1.0
Estonian EDT et et_edt 0.1.0
Finnish FTB fi fi_ftb 0.1.0
  TDT fi fi_tdt 0.1.0
French GSD fr fr_gsd 0.1.0
  Sequoia fr fr_sequoia 0.1.0
  Spoken fr fr_spoken 0.1.0
Galician CTG gl gl_ctg 0.1.0
  TreeGal gl gl_treegal 0.1.0
German GSD de de_gsd 0.1.0
Gothic PROIEL got got_proiel 0.1.0
Greek GDT el el_gdt 0.1.0
Hebrew HTB he he_htb 0.1.0
Hindi HDTB hi hi_hdtb 0.1.0
Hungarian Szeged hu hu_szeged 0.1.0
Indonesian GSD id id_gsd 0.1.0
Irish IDT ga ga_idt 0.1.0
Italian ISDT it it_isdt 0.1.0
  PoSTWITA it it_postwita 0.1.0
Japanese GSD ja ja_gsd 0.1.0
Kazakh KTB kk kk_ktb 0.1.0
Korean GSD ko ko_gsd 0.1.0
  Kaist ko ko_kaist 0.1.0
Kurmanji MG kmr kmr_mg 0.1.0
Latin ITTB la la_ittb 0.1.0
  Perseus la la_perseus 0.1.0
  PROIEL la la_proiel 0.1.0
Latvian LVTB lv lv_lvtb 0.1.0
North Sami Giella sme sme_giella 0.1.0
Norwegian Bokmaal no_bokmaal no_bokmaal 0.1.0
  Nynorsk no_nynorsk no_nynorsk 0.1.0
  NynorskLIA no_nynorsk no_nynorsklia 0.1.0
Old Church Slavonic PROIEL cu cu_proiel 0.1.0
Old French SRCMF fro fro_srcmf 0.1.0
Persian Seraji fa fa_seraji 0.1.0
Polish LFG pl pl_lfg 0.1.0
  SZ pl pl_sz 0.1.0
Portuguese Bosque pt pt_bosque 0.1.0
Romanian RRT ro ro_rrt 0.1.0
Russian SynTagRus ru ru_syntagrus 0.1.0
  Taiga ru ru_taiga 0.1.0
Serbian SET sr sr_set 0.1.0
Slovak SNK sk sk_snk 0.1.0
Slovenian SSJ sl sl_ssj 0.1.0
  SST sl sl_sst 0.1.0
Spanish AnCora es es_ancora 0.1.0
Swedish LinES sv sv_lines 0.1.0
  Talbanken sv sv_talbanken 0.1.0
Turkish IMST tr tr_imst 0.1.0
Ukrainian IU uk uk_iu 0.1.0
Upper Sorbian UFAL hsb hsb_ufal 0.1.0
Urdu UDTB ur ur_udtb 0.1.0
Uyghur UDT ug ug_udt 0.1.0
Vietnamese VTB vi vi_vtb 0.1.0