Getting started

Installation

To get started with StanfordNLP, we strongly recommend that you install it through PyPI. Once you have pip installed, simply run in your command line

pip install stanfordnlp

This will take care of all of the dependencies necessary to run StanfordNLP. The neural pipeline of StanfordNLP depends on PyTorch 1.0.0 or a later version with compatible APIs.

Quick Example

To try out StanfordNLP, you can simply follow these steps in the interactive Python interpreter:

>>> import stanfordnlp
>>> stanfordnlp.download('en')   # This downloads the English models for the neural pipeline
>>> nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()

The last command here will print out the words in the first sentence in the input string (or Document, as it is represented in StanfordNLP), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its “head”), along with the dependency relation between the words. The output should look like:

('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')

To build a pipeline for other languages, simply pass in the language code to the constructor like this stanfordnlp.Pipeline(lang="fr"). For a full list of languages (and their corresponnding language codes) supported by StanfordNLP, please see this section.

We also provide a demo script in our Github repostory that demonstrates how one uses StanfordNLP in other languages than English, for example Chinese (traditional)

python demo/pipeline_demo.py -l zh

And expect outputs like the following:

---
tokens of first sentence:
達沃斯	達沃斯	PROPN
世界	世界	NOUN
經濟	經濟	NOUN
論壇	論壇	NOUN
是	是	AUX
每年	每年	DET
全球	全球	NOUN
政	政	PART
商界	商界	NOUN
領袖	領袖	NOUN
聚	聚	VERB
在	在	VERB
一起	一起	NOUN
的	的	PART
年度	年度	NOUN
盛事	盛事	NOUN
。	。	PUNCT

---
dependency parse of first sentence:
('達沃斯', '4', 'nmod')
('世界', '4', 'nmod')
('經濟', '4', 'nmod')
('論壇', '16', 'nsubj')
('是', '16', 'cop')
('每年', '10', 'nmod')
('全球', '10', 'nmod')
('政', '9', 'case:pref')
('商界', '10', 'nmod')
('領袖', '11', 'nsubj')
('聚', '16', 'acl:relcl')
('在', '11', 'mark')
('一起', '11', 'obj')
('的', '11', 'mark:relcl')
('年度', '16', 'nmod')
('盛事', '0', 'root')
('。', '16', 'punct')

Troubleshooting

  • Why do I keep getting a SyntaxError: invalid syntax error message while trying to import stanfordnlp?

    StanfordNLP will not work with Python 3.5 or below. If you have trouble importing the package, please try to upgrade your Python.

  • Why am I getting an OSError: [Errno 22] Invalid argument error and therefore a Vector file is not provided exception while the model is being loaded?

    If you are getting this error, it is very likely that you are running macOS and using Python with version <= 3.6.7 or <= 3.7.1. If this is the case, then you are affected by a known Python bug on macOS, and upgrading your Python to >= 3.6.8 or >= 3.7.2 should solve this issue. If you are not running macOS or already have the specified Python version and still seeing this issue, please report this to us via the GitHub issue tracker.

Models for Human Languages

Downloading and Using Models

Downloading models for human languages of your interest for use in the StanfordNLP pipeline is as simple as

>>> import stanfordnlp
>>> stanfordnlp.download('ar')    # replace "ar" with the language code or treebank code you need, see below

The language code or treebank code can be looked up in the next section. If only the language code is specified, we will download the default models for that language (indicated by in the table), which are the models trained on the largest treebank available in that language. If you really care about the models of a specific treebank, you can also download the corresponding models with the treebank code.

To use the default model for any language, simply build the pipeline as follows

>>> nlp = stanfordnlp.Pipeline(lang="es")    # replace "es" with the language of interest

If you are using a non-default treebank for the langauge, make sure to also specify the treebank code, for example

>>> nlp = stanford.Pipeline(lang="it", treebank="it_postwita")

Human Languages Supported by StanfordNLP

Below is a list of all of the (human) languages supported by StanfordNLP (through its neural pipeline). The performance of these systems on the CoNLL 2018 Shared Task official test set (in our unofficial evaluation) can be found here.

Note

  1. Models marked with have significantly low unlabeled attachment score (UAS) when evaluated end-to-end (from tokenization all the way to dependency parsing). Specifically, their UAS is lower than 50% on the CoNLL 2018 Shared Task test set. Any use of these models for serious syntactical analysis is strongly discouraged.
  2. marks models that are at least 1% absolute UAS worse than the full neural pipeline presented in our paper (which uses the Tensorflow counterparts for the tagger and the parser), so that might raise a yellow flag.
Language Treebank Language code Treebank code Models Version Treebank License Treebank Doc Notes
Afrikaans AfriBooms af af_afribooms download 0.1.0 Creative Commons License
Ancient Greek Perseus grc grc_perseus download 0.1.0 Creative Commons License  
  PROIEL grc grc_proiel download 0.1.0 Creative Commons License
Arabic PADT ar ar_padt download 0.1.0 Creative Commons License
Armenian ArmTDP hy hy_armtdp download 0.1.0 Creative Commons License
Basque BDT eu eu_bdt download 0.1.0 Creative Commons License
Bulgarian BTB bg bg_btb download 0.1.0 Creative Commons License
Buryat BDT bxr bxr_bdt download 0.1.0 Creative Commons License
Catalan AnCora ca ca_ancora download 0.1.0 GNU License
Chinese (traditional) GSD zh zh_gsd download 0.1.0 Creative Commons License
Croatian SET hr hr_set download 0.1.0 Creative Commons License
Czech CAC cs cs_cac download 0.1.0 Creative Commons License  
  FicTree cs cs_fictree download 0.1.0 Creative Commons License  
  PDT cs cs_pdt download 0.1.0 Creative Commons License
Danish DDT da da_ddt download 0.1.0 Creative Commons License
Dutch Alpino nl nl_alpino download 0.1.0 Creative Commons License
  LassySmall nl nl_lassysmall download 0.1.0 Creative Commons License  
English EWT en en_ewt download 0.1.0 Creative Commons License
  GUM en en_gum download 0.1.0 Creative Commons License  
  LinES en en_lines download 0.1.0 Creative Commons License  
Estonian EDT et et_edt download 0.1.0 Creative Commons License
Finnish FTB fi fi_ftb download 0.1.0 GNU License  
  TDT fi fi_tdt download 0.1.0 Creative Commons License
French GSD fr fr_gsd download 0.1.0 Creative Commons License
  Sequoia fr fr_sequoia download 0.1.0 LGPLLR  
  Spoken fr fr_spoken download 0.1.0 Creative Commons License
Galician CTG gl gl_ctg download 0.1.0 Creative Commons License
  TreeGal gl gl_treegal download 0.1.0 LGPLLR
German GSD de de_gsd download 0.1.0 Creative Commons License
Gothic PROIEL got got_proiel download 0.1.0 Creative Commons License
Greek GDT el el_gdt download 0.1.0 Creative Commons License
Hebrew HTB he he_htb download 0.1.0 Creative Commons License
Hindi HDTB hi hi_hdtb download 0.1.0 Creative Commons License
Hungarian Szeged hu hu_szeged download 0.1.0 Creative Commons License
Indonesian GSD id id_gsd download 0.1.0 Creative Commons License
Irish IDT ga ga_idt download 0.1.0 Creative Commons License
Italian ISDT it it_isdt download 0.1.0 Creative Commons License
  PoSTWITA it it_postwita download 0.1.0 Creative Commons License  
Japanese GSD ja ja_gsd download 0.1.0 Creative Commons License
Kazakh KTB kk kk_ktb download 0.1.0 Creative Commons License
Korean GSD ko ko_gsd download 0.1.0 Creative Commons License  
  Kaist ko ko_kaist download 0.1.0 Creative Commons License
Kurmanji MG kmr kmr_mg download 0.1.0 Creative Commons License
Latin ITTB la la_ittb download 0.1.0 Creative Commons License
  Perseus la la_perseus download 0.1.0 Creative Commons License  
  PROIEL la la_proiel download 0.1.0 Creative Commons License  
Latvian LVTB lv lv_lvtb download 0.1.0 Creative Commons License
North Sami Giella sme sme_giella download 0.1.0 Creative Commons License
Norwegian Bokmaal no_bokmaal no_bokmaal download 0.1.0 Creative Commons License
  Nynorsk no_nynorsk no_nynorsk download 0.1.0 Creative Commons License
  NynorskLIA no_nynorsk no_nynorsklia download 0.1.0 Creative Commons License  
Old Church Slavonic PROIEL cu cu_proiel download 0.1.0 Creative Commons License
Old French SRCMF fro fro_srcmf download 0.1.0 Creative Commons License
Persian Seraji fa fa_seraji download 0.1.0 Creative Commons License
Polish LFG pl pl_lfg download 0.1.0 GNU License
  SZ pl pl_sz download 0.1.0 GNU License  
Portuguese Bosque pt pt_bosque download 0.1.0 Creative Commons License
Romanian RRT ro ro_rrt download 0.1.0 Creative Commons License
Russian SynTagRus ru ru_syntagrus download 0.1.0 Creative Commons License
  Taiga ru ru_taiga download 0.1.0 Creative Commons License
Serbian SET sr sr_set download 0.1.0 Creative Commons License
Slovak SNK sk sk_snk download 0.1.0 Creative Commons License
Slovenian SSJ sl sl_ssj download 0.1.0 Creative Commons License
  SST sl sl_sst download 0.1.0 Creative Commons License  
Spanish AnCora es es_ancora download 0.1.0 GNU License
Swedish LinES sv sv_lines download 0.1.0 Creative Commons License  
  Talbanken sv sv_talbanken download 0.1.0 Creative Commons License
Turkish IMST tr tr_imst download 0.1.0 Creative Commons License
Ukrainian IU uk uk_iu download 0.1.0 Creative Commons License
Upper Sorbian UFAL hsb hsb_ufal download 0.1.0 Creative Commons License
Urdu UDTB ur ur_udtb download 0.1.0 Creative Commons License
Uyghur UDT ug ug_udt download 0.1.0 Creative Commons License
Vietnamese VTB vi vi_vtb download 0.1.0 Creative Commons License