Models for Human Languages
Downloading and Using Models
Downloading a language pack (a set of machine learning models for a human language that you wish to use in the StanfordNLP pipeline) is as simple as
>>> import stanfordnlp
>>> stanfordnlp.download('ar') # replace "ar" with the language or treebank code you need, see below
The language code or treebank code can be looked up in the next section. If only the language code is specified, we will download the default models for that language. If you are seeking the language pack built from a specific treebank, you can download the corresponding models with the appropriate treebank code. By default, language packs are stored in a stanfordnlp_resources
folder inside your home directory.
To use the default language pack for any language, simply build the pipeline as follows:
>>> nlp = stanfordnlp.Pipeline(lang="es") # replace "es" with the language of interest
If you are using a non-default treebank for the langauge, make sure to also specify the treebank code, for example:
>>> nlp = stanford.Pipeline(lang="it", treebank="it_postwita")
Human Languages Supported by StanfordNLP
Below is a list of all the (human) languages supported by StanfordNLP (through this Python neural pipeline). All languages are built using data from and are annotated according to Universal Dependencies v2. You can find more information about the POS tags, morphological features, and syntactic relations used on the Universal Dependencies website.
The performance of these systems on the CoNLL 2018 Shared Task official test set (in our unofficial evaluation) can be found here.
Notes
- marks models which have very low unlabeled attachment score (UAS) when evaluated end-to-end (from tokenization all the way to dependency parsing). Specifically, their UAS is lower than 50% on the CoNLL 2018 Shared Task test set. Users should be very cautious in using the output of these models for serious syntactic analysis.
- marks models that are at least 1% absolute UAS worse than the full neural pipeline presented in our paper (which uses the Tensorflow counterparts for the tagger and the parser), so that might raise a yellow flag for people wishing to do parser comparison experiments, but in general these models are fine to use for syntactic analysis.
- marks the default language pack for a language, which is the language pack trained on the largest treebank available for that language.
- The copyright and licensing status of machine learning models is not very clear (to us). We list in the table below the Treebank License of the underlying data from which each language pack (set of machine learning models for a treebank) was trained. To the extent that The Trustees of Leland Stanford Junior University have ownership and rights over these language packs, all these StanfordNLP language packs are made available under the Open Data Commons Attribution License v1.0.
Models History
Models from earlier releases can be downloaded using the version argument. Note that not every release
has a distinct model set.
>>> import stanfordnlp
>>> stanfordnlp.download('ar', version='0.1.0')
Models from earlier releases can also be found in the table below.
Language |
Treebank |
Language code |
Treebank code |
Models |
Afrikaans |
AfriBooms |
af |
af_afribooms |
0.1.0 |
Ancient Greek |
Perseus |
grc |
grc_perseus |
0.1.0 |
|
PROIEL |
grc |
grc_proiel |
0.1.0 |
Arabic |
PADT |
ar |
ar_padt |
0.1.0 |
Armenian |
ArmTDP |
hy |
hy_armtdp |
0.1.0 |
Basque |
BDT |
eu |
eu_bdt |
0.1.0 |
Bulgarian |
BTB |
bg |
bg_btb |
0.1.0 |
Buryat |
BDT |
bxr |
bxr_bdt |
0.1.0 |
Catalan |
AnCora |
ca |
ca_ancora |
0.1.0 |
Chinese (traditional) |
GSD |
zh |
zh_gsd |
0.1.0 |
Croatian |
SET |
hr |
hr_set |
0.1.0 |
Czech |
CAC |
cs |
cs_cac |
0.1.0 |
|
FicTree |
cs |
cs_fictree |
0.1.0 |
|
PDT |
cs |
cs_pdt |
0.1.0 |
Danish |
DDT |
da |
da_ddt |
0.1.0 |
Dutch |
Alpino |
nl |
nl_alpino |
0.1.0 |
|
LassySmall |
nl |
nl_lassysmall |
0.1.0 |
English |
EWT |
en |
en_ewt |
0.1.0 |
|
GUM |
en |
en_gum |
0.1.0 |
|
LinES |
en |
en_lines |
0.1.0 |
Estonian |
EDT |
et |
et_edt |
0.1.0 |
Finnish |
FTB |
fi |
fi_ftb |
0.1.0 |
|
TDT |
fi |
fi_tdt |
0.1.0 |
French |
GSD |
fr |
fr_gsd |
0.1.0 |
|
Sequoia |
fr |
fr_sequoia |
0.1.0 |
|
Spoken |
fr |
fr_spoken |
0.1.0 |
Galician |
CTG |
gl |
gl_ctg |
0.1.0 |
|
TreeGal |
gl |
gl_treegal |
0.1.0 |
German |
GSD |
de |
de_gsd |
0.1.0 |
Gothic |
PROIEL |
got |
got_proiel |
0.1.0 |
Greek |
GDT |
el |
el_gdt |
0.1.0 |
Hebrew |
HTB |
he |
he_htb |
0.1.0 |
Hindi |
HDTB |
hi |
hi_hdtb |
0.1.0 |
Hungarian |
Szeged |
hu |
hu_szeged |
0.1.0 |
Indonesian |
GSD |
id |
id_gsd |
0.1.0 |
Irish |
IDT |
ga |
ga_idt |
0.1.0 |
Italian |
ISDT |
it |
it_isdt |
0.1.0 |
|
PoSTWITA |
it |
it_postwita |
0.1.0 |
Japanese |
GSD |
ja |
ja_gsd |
0.1.0 |
Kazakh |
KTB |
kk |
kk_ktb |
0.1.0 |
Korean |
GSD |
ko |
ko_gsd |
0.1.0 |
|
Kaist |
ko |
ko_kaist |
0.1.0 |
Kurmanji |
MG |
kmr |
kmr_mg |
0.1.0 |
Latin |
ITTB |
la |
la_ittb |
0.1.0 |
|
Perseus |
la |
la_perseus |
0.1.0 |
|
PROIEL |
la |
la_proiel |
0.1.0 |
Latvian |
LVTB |
lv |
lv_lvtb |
0.1.0 |
North Sami |
Giella |
sme |
sme_giella |
0.1.0 |
Norwegian |
Bokmaal |
no_bokmaal |
no_bokmaal |
0.1.0 |
|
Nynorsk |
no_nynorsk |
no_nynorsk |
0.1.0 |
|
NynorskLIA |
no_nynorsk |
no_nynorsklia |
0.1.0 |
Old Church Slavonic |
PROIEL |
cu |
cu_proiel |
0.1.0 |
Old French |
SRCMF |
fro |
fro_srcmf |
0.1.0 |
Persian |
Seraji |
fa |
fa_seraji |
0.1.0 |
Polish |
LFG |
pl |
pl_lfg |
0.1.0 |
|
SZ |
pl |
pl_sz |
0.1.0 |
Portuguese |
Bosque |
pt |
pt_bosque |
0.1.0 |
Romanian |
RRT |
ro |
ro_rrt |
0.1.0 |
Russian |
SynTagRus |
ru |
ru_syntagrus |
0.1.0 |
|
Taiga |
ru |
ru_taiga |
0.1.0 |
Serbian |
SET |
sr |
sr_set |
0.1.0 |
Slovak |
SNK |
sk |
sk_snk |
0.1.0 |
Slovenian |
SSJ |
sl |
sl_ssj |
0.1.0 |
|
SST |
sl |
sl_sst |
0.1.0 |
Spanish |
AnCora |
es |
es_ancora |
0.1.0 |
Swedish |
LinES |
sv |
sv_lines |
0.1.0 |
|
Talbanken |
sv |
sv_talbanken |
0.1.0 |
Turkish |
IMST |
tr |
tr_imst |
0.1.0 |
Ukrainian |
IU |
uk |
uk_iu |
0.1.0 |
Upper Sorbian |
UFAL |
hsb |
hsb_ufal |
0.1.0 |
Urdu |
UDTB |
ur |
ur_udtb |
0.1.0 |
Uyghur |
UDT |
ug |
ug_udt |
0.1.0 |
Vietnamese |
VTB |
vi |
vi_vtb |
0.1.0 |