Link

Download Models

Table of contents


Stanza provides pretrained NLP models for a total 70 human languages. On this page we provide detailed information on how to download these models to process text in a language of your choosing.

Automatic download

Pretrained models in Stanza can be divided into four categories, based on the datasets they were trained on:

  • Universal Dependencies (UD) models, which are trained on the UD treebanks, and cover functionalities including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging and dependency parsing;
  • NER models, which support named entity tagging for 8 languages, and are trained on various NER datasets.
  • Constituency models, trained on a specific constituency parser dataset
  • Sentiment models, similarly trained on a dataset for that specific language and task

Downloading Stanza models is as simple as calling the stanza.download() method. We provide detailed examples on how to use the download interface on the Getting Started page. Detailed descriptions of all available options (i.e., arguments) of the download method are listed below:

OptionTypeDefaultDescription
langstr'en'Language code (e.g., "en") or language name (e.g., "English") for the language to process with the Pipeline. See the tables of available models below for a complete list of supported languages.
model_dirstr'~/stanza_resources'Directory for storing the models downloaded for Stanza. By default, Stanza stores its models in a folder in your home directory.
packagestr'default'Package to download for processors, where each package typically specifies what data the models are trained on. We provide a “default” package for all languages that contains NLP models most users will find useful, which will be used when the package argument isn’t specified. See table below for a complete list of available packages.
processorsdict or strdict()Processors to download models for. This can either be specified as a comma-seperated list of processor names to use (e.g., 'tokenize,pos'), or a Python dictionary with processor names as keys and package names as corresponding values (e.g., {'tokenize': 'ewt', 'pos': 'ewt'}). All unspecified processors will fall back to using the package specified by the package argument. A list of all Processors supported can be found here.
logging_levelstr'INFO'Controls the level of logging information to display during download. Can be one of 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CIRTICAL', or 'FATAL'. Less information will be displayed going from 'DEBUG' to 'FATAL'.
verbosestrNoneSimplified option for logging level. If True, logging level will be set to 'INFO'. If False, logging level will be set to 'ERROR' (i.e., only show errors).

Manual download

In some cases, it is not convenient, practical, or possible to download models automatically. In such cases, it will be necessary to manually download the models and populate the resources directory.

By default, the resources are downloaded to ~/stanza_resources. There are a couple ways to change this. One is by defining the environment variable $STANZA_RESOURCES_DIR. The other is, when creating a Pipeline, specify a different directory via the model_dir parameter.

If you are in a situation where you cannot download resources while creating the Pipeline, it will also be necessary to provide the argument download_method=None, as by default the Pipeline checks for updated models at creation time.

Once you have chosen the location for the models download, if you are able to download models programmatically, the easiest way to populate that directory will be using the stanza.download() method described above.

Otherwise, models are available for download in separate HuggingFace repos for the language in question. Each repo is specified by the short language code used by Stanza

So, the models for English are in the stanza-en repo

You will also need to download the resources to download the resources.json file appropriate for the Stanza version you are using. This is available in a Stanford git repo The resources file goes in $STANZA_RESOURCES/resources.json (without the version number)

Each language has the default models packaged in a default.zip file. The English one, for example, is in this subdirectory of the English models tree This goes in a language specific directory of $STANZA_RESOURCES_DIR so for example the English package goes in $STANZA_RESOURCES_DIR/en/default.zip From there, unzip the models package.

As of version 1.6.1, downloading resources.json, downloading the default.zip file for English, and unzipping default.zip results in the following:

$STANZA_RESOURCES_DIR/
$STANZA_RESOURCES_DIR/resources.json
$STANZA_RESOURCES_DIR/en
$STANZA_RESOURCES_DIR/en/default.zip
$STANZA_RESOURCES_DIR/en/backward_charlm/1billion.pt
$STANZA_RESOURCES_DIR/en/constituency/ptb3-revised_charlm.pt
$STANZA_RESOURCES_DIR/en/depparse/combined_charlm.pt
$STANZA_RESOURCES_DIR/en/forward_charlm/1billion.pt
$STANZA_RESOURCES_DIR/en/lemma/combined_nocharlm.pt
$STANZA_RESOURCES_DIR/en/ner/ontonotes_charlm.pt
$STANZA_RESOURCES_DIR/en/pos/combined_charlm.pt
$STANZA_RESOURCES_DIR/en/pretrain/conll17.pt
$STANZA_RESOURCES_DIR/en/pretrain/fasttextcrawl.pt
$STANZA_RESOURCES_DIR/en/sentiment/sstplus.pt
$STANZA_RESOURCES_DIR/en/tokenize/combined.pt

This is the expected layout if manually downloading model files.

To manually download a different package for the same language, individual model files can be downloaded and added to the appropriate place. For example, to use the ncbi_disease.pt NER model for English, that can be downloaded from https://huggingface.co/stanfordnlp/stanza-en/tree/main/models/ner into STANZA_RESOURCES_DIR/en/ner/ncbi_disease.pt. You can explore the stanza-en tree to find the specific model you are looking for.

If downloading several individual models, we are aware that can be tedious, especially when checking for updates. Please let us know if you need a package similar to default.zip for a different combination of models.