Link

Model Training and Evaluation

Table of contents


Overview

All neural modules, including the tokenzier, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer, the dependency parser, and the named entity tagger, can be trained with your own data.

To train your own models, you will need to clone the source code from the stanza git repository and follow the procedures below. We also provide runnable scripts coupled with toy data that makes it much easier for users to get started with model training, and you can find at stanza-train git repository.

If you only want to run the processors with the pretrained models, please skip this and go to the Pipeline page.

Setting Environment Variables

Run source config.sh to set the following environment variables. You will need to edit these files to set values appropriate for your local system.

You may want to put these variables in your shell startup to save time. For example, .bashrc

Note that Windows users will not be able to run this .sh script, but they can still use the python tools after adding these variables via the control panel. Control Panel -> Edit environment variables and create entries similar to the values in config.sh.

We provide scripts that are useful for model training and evaluation in the scripts folder, the stanza/utils/datasets, and the stanza/utils/training directory. The first thing to do before training is to setup the environment variables by editing scripts/config.sh. The most important variables include:

  • UDBASE: This should point to the root directory of your training/dev/test universal dependencies data in CoNLL-U format. The dir is organized as {UDBASE}/{corpus}/{corpus_short}-ud-{train,dev,test}.conllu. For examples of this folder, you can download the raw data from the CoNLL 2018 UD Shared Task website. This environment variable should point to the root directory for ALL of the datasets, so for example you should have $UDBASE/UD_English-EWT, $UDBASE/UD_Vietnamese-VTB, etc
  • NERBASE: This should point to the root directory of your training/dev/test named entity recognition data in BIO format. The dir is organized as {NERBASE}/{corpus}/{train,dev,test}.bio. For examples of the data, you can read the CoNLL 2003 Shared Task paper and download the data.
  • DATA_ROOT: This is the root directory for storing intermediate training files generated by the scripts.
  • {module}_DATA_DIR: The subdirectory for storing intermediate files used by each module.
  • WORDVEC_DIR: The directory to store all word vector files (see below).

These variables are used by the .sh files in the scripts folder. They are also used by the python scripts, and so they need to be part of your environment.

Preparing Word Vector Data

To train modules that make use of word representations, such as the POS/morphological features tagger and the dependency parser, it is highly recommended that you use pretrained embedding vectors. The simplest method is to use the .pt file in the $STANZA_RESOURCES/lang/pretrain directory after downloading the existing models using stanza.download(lang).

python
>>> import stanza
>>> stanza.download("vi")   # or whichever language you are using

Once this is downloaded, the models each have a flag which tells the model where to find the .pt file. Each of pos, depparse, ner, and sentiment support the --wordvec_pretrain_file flag for specifying the exact path. If you don’t supply this flag, it will attempt to guess the location. In general, it will look in saved_models/{model}/{dataset}.pretrain.pt.

For more information and other options, see here.

Converting UD data

A large repository of data is available at www.universaldependencies.org. Most of our models are trained using this data. We provide python scripts for converting this data to the format used by our models at training time:

python stanza/utils/datasets/prepare_${module}_treebank.py ${corpus} ${other_args}

where ${module} is one of tokenize, mwt, pos, lemma, or depparse; ${corpus} is the full name of the corpus; ${other_args} are other arguments allowed by the training script.

Dependency Parser Data

By default, the prepare_depparse_treebank.py script uses the POS tagger to retag the training data. You can turn this off with the --gold flag.

# This will attempt to load the tagger and retag the data
# You will need to train the tagger or use the existing tagger for that to work
python stanza/utils/datasets/prepare_depparse_treebank.py UD_English-EWT
# You can use the original gold tags as follows
python stanza/utils/datasets/prepare_depparse_treebank.py UD_English-EWT --gold

The reasoning is that since the models will be using predicted tags when used in a pipeline, it is better to train the models with the predicted tags in the first place. In order to get the best results when retraining the dependency parser for use in a pipeline, you should first retrain the tagger if relevant and then use the new tagger model to produce the predicted tags.

NER Data

NER models have a different data format from other models. There is an existing script which covers a few different publicly available datasets:

stanza/utils/datasets/ner/prepare_ner_dataset.py

Note that for the various scripts supported, you need to first download the data (possibly after agreeing to a license) from the sources listed in the file.

For a new dataset not already supported, there is a specific .json format expected by our models. There is a conversion script called several times in prepare_ner_dataset.py which converts IOB format to our internal NER format:

prepare_ner_file.process_dataset(input_filename, output_filename)

Training with Scripts

We provide various scripts to ease the training process in the scripts and stanza/utils/training directories. To train a model, you can run the following command from the code root directory:

python stanza/utils/training/run_${module}.py ${corpus} ${other_args}

where ${module} is one of tokenize, mwt, pos, lemma, or depparse; ${corpus} is the full name of the corpus; ${other_args} are other arguments allowed by the training script.

NER is trained differently:

bash scripts/run_ner.sh ${corpus} ${other_args}

For example, you can use the following command to train a tokenizer with batch size 32 and a dropout rate of 0.33 on the UD_English-EWT corpus:

python stanza/utils/training/run_tokenize.py UD_English-EWT --batch_size 32 --dropout 0.33

You can also run ner_tagger.py directly:

python3 -m stanza.models.ner_tagger --wordvec_pretrain_file saved_models/pos/fi_ftb.pretrain.pt --train_file data/ner/fi_turku.train.json --eval_file data/ner/fi_turku.dev.json  --charlm --charlm_shorthand fi_conll17 --char_hidden_dim 1024  --lang fi --shorthand fi_turku --mode train

To get the prediction scores of an existing NER model, use --mode eval instead.

If you are training NER for a new language, you may want to use a charlm. You can also leave out the charlm arguments in the above command line and train without the charlm.

For a full list of available training arguments, please refer to the specific entry point of that module. By default model files will be saved to the saved_models directory during training (which can also be changed with the save_dir argument).

Evaluation

Model evaluation will be run automatically after each training run.

You can also run a single model’s dev or test set by using the run_${module}.py script with the --score_dev or --score_test flag.

Additionally, after you finish training all modules, you can evaluate the full universal dependency parsing pipeline with this command:

python stanza/utils/training/run_ete.py ${corpus} --score_${split}

where ${split} is one of dev or test. Running with no --score_ flag will give scores for the train data.

Devices

We strongly encourage you to train all modules with a GPU device. When a CUDA device is available and detected by the script, the CUDA device will be used automatically; otherwise, CPU will be used. However, you can force the training to happen on a CPU device, by specifying --cpu when calling the script.