Model Training and Evaluation

Table of contents


All neural modules, including the tokenzier, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer, the dependency parser, and the named entity tagger, can be trained with your own data.

To train your own models, you will need to clone the source code from the stanza git repository and follow the procedures below. We also provide runnable scripts coupled with toy data that makes it much easier for users to get started with model training, and you can find at stanza-train git repository.

If you only want to run the processors with the pretrained models, please skip this and go to the Pipeline page.

Setting Environment Variables

We provide scripts that are useful for model training and evaluation in the scripts folder, the stanza/utils/datasets, and the stanza/utils/training directory. The first thing to do before training is to setup the environment variables by editing scripts/ The most important variables include:

  • UDBASE: This should point to the root directory of your training/dev/test universal dependencies data in CoNLL-U format. The dir is organized as {UDBASE}/{corpus}/{corpus_short}-ud-{train,dev,test}.conllu. For examples of this folder, you can download the raw data from the CoNLL 2018 UD Shared Task website.
  • NERBASE: This should point to the root directory of your training/dev/test named entity recognition data in BIO format. The dir is organized as {NERBASE}/{corpus}/{train,dev,test}.bio. For examples of the data, you can read the CoNLL 2003 Shared Task paper and download the data.
  • DATA_ROOT: This is the root directory for storing intermediate training files generated by the scripts.
  • {module}_DATA_DIR: The subdirectory for storing intermediate files used by each module.
  • WORDVEC_DIR: The directory to store all word vector files (see below).

These variables are used by the .sh files in the scripts folder. They are also used by the python scripts, and so they need to be part of your environment. You can achieve this by running source before using those python scripts or by adding them to your profile, eg ~/.bashrc or wherever appropriate.

Note that Windows users will not be able to run this .sh script, but they can still use the python tools after adding these variables via the control panel.

Preparing Word Vector Data

To train modules that make use of word representations, such as the POS/morphological features tagger and the dependency parser, it is highly recommended that you use pretrained embedding vectors. To replicate the system performance on the CoNLL 2018 shared task, we have prepared a script for you to download all word vector files. Simply run from the source directory:

bash scripts/ ${wordvec_dir}

where ${wordvec_dir} is the target directory to store the word vector files, and should be the same as where the environment variable WORDVEC_DIR is pointed to.

The above script will first download the pretrained word2vec embeddings released from the CoNLL 2017 Shared Task, which can be found here. For languages not in this list, it will download the FastText embeddings from Facebook. Note that the total size of all downloaded vector files will be ~30G, therefore please use this script with caution.

After running the script, your embedding vector files will be organized in the following way: ${WORDVEC_DIR}/{language}/{language_code}.vectors.xz. For example, the word2vec file for English should be put into $WORDVEC_DIR/English/en.vectors.xz. If you use your own vector files, please make sure you arrange them in a similar fashion as described above.

Converting UD data

A large repository of data is available at Most of our models are trained using this data. We provide python scripts for converting this data to the format used by our models at training time:

python stanza/utils/datasets/prepare_${module} ${corpus} ${other_args}

where ${module} is one of tokenize, mwt, pos, lemma, or depparse; ${corpus} is the full name of the corpus; ${other_args} are other arguments allowed by the training script.

Note that for the dependency parser, you also need to specify gold|predicted for the used POS tag type in the training/dev data.

python stanza/utils/datasets/ UD_English-EWT --gold

If predicted is used, the trained tagger model will first be run on the training/dev data to generate the predicted tags. predicted is the default.

Training with Scripts

We provide various scripts to ease the training process in the scripts and stanza/utils/training directories. To train a model, you can run the following command from the code root directory:

python stanza/utils/training/run_${module}.py ${corpus} ${other_args}

where ${module} is one of tokenize, mwt, pos, lemma, or depparse; ${corpus} is the full name of the corpus; ${other_args} are other arguments allowed by the training script.

NER is trained differently:

bash scripts/ ${corpus} ${other_args}

For example, you can use the following command to train a tokenizer with batch size 32 and a dropout rate of 0.33 on the UD_English-EWT corpus:

python stanza/utils/training/ UD_English-EWT --batch_size 32 --dropout 0.33

For a full list of available training arguments, please refer to the specific entry point of that module. By default model files will be saved to the saved_models directory during training (which can also be changed with the save_dir argument).


Model evaluation will be run automatically after each training run. Additionally, after you finish training all modules, you can evaluate the full universal dependency parsing pipeline with this command:

bash scripts/ ${corpus} ${split}

where ${split} is one of train, dev or test.


We strongly encourage you to train all modules with a GPU device. When a CUDA device is available and detected by the script, the CUDA device will be used automatically; otherwise, CPU will be used. However, you can force the training to happen on a CPU device, by specifying --cpu when calling the script.