Link

Available Models & Languages

Table of contents


Stanza provides pretrained NLP models for a total of 80 human languages. On this page we provide detailed information on these models.

Pretrained models in Stanza can be divided into two categories, based on the datasets they were trained on:

  1. Universal Dependencies (UD) models, which are trained on the UD treebanks, and cover functionalities including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging and dependency parsing;
  2. NER models, which support named entity tagging for 8 languages, and are trained on various NER datasets.

Available UD Models

Tokenization, MWT (if applicable), POS, Lemma, and dependency parsing is provided using data from Universal Dependencies v2.12. Results for these models are on the performance page. There are also past results from previous versions of the models.

Other Available Models for Tokenization

Myanmar

The Asian Language Treebank Project has a Myanmar dataset. We used this dataset to build a tokenizer for Myanmar.

Stanza includes a script which converts the trees and the officially proposed train/dev/split to an ersatz UD dataset:

https://github.com/stanfordnlp/stanza/blob/v1.5.1/stanza/utils/datasets/tokenization/convert_my_alt.py

Sindhi

The NLP team at ISRA graciously provided us with several passages of tokenized Sindhi text. We used this to add a tokenizer for Sindhi to Stanza. This is particularly useful in that it allows us to incorporate a Sindhi NER model.

With permission, we are currently hosting the Sindhi tokenization data on StanfordNLP’s github.

Thai

We have trained a couple Thai tokenizer models based on publicly available datasets. The Inter-BEST dataset had some strange sentence tokenization according to the authors of pythainlp, so we used their software to resegment the sentences before training. As this is a questionable standard to use, we made the Orchid tokenizer the default.

DatasetToken AccuracySentence AccuracyNotes
Orchid87.9870.99 
BEST95.7377.93Sentences are re-split using pythainlp

Available NER Models

A description of the models available for the NER tool, along with their performance on test datasets, can be found here.

Available Sentiment Models

A description of the sentiment tool and the models available for that tool can be found here.

Available Conparse Models

A description of the constituency parser and the models available for that tool can be found here.

Training New Models

To train new models, please see the documents on training and adding a new language.