Model Training and Evaluation
Table of contents
Overview
All neural modules, including the tokenzier, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer, the dependency parser, and the named entity tagger, can be trained with your own data.
To train your own models, you will need to clone the source code from the stanza git repository and follow the procedures below. We also provide runnable scripts coupled with toy data that makes it much easier for users to get started with model training, and you can find at stanza-train git repository.
If you only want to run the processors with the pretrained models, please skip this and go to the Getting Started page.
Setting Environment Variables
Run source config.sh
to set the following environment variables. You will need to edit these files to set values appropriate for your local system.
You may want to put these variables in your shell startup to save time. For example, .bashrc
In that case, you will not need to run the config script.
Make sure to use source
or the environment variables will not last. You can also set them in .bashrc for a more permanent solution
Note that Windows users will not be able to run this .sh
script, but they can still use the python tools after adding these variables via the control panel. Control Panel -> Edit environment variables and create entries similar to the values in config.sh
.
We provide scripts that are useful for model training and evaluation in the scripts
folder, the stanza/utils/datasets
, and the stanza/utils/training
directory. The first thing to do before training is to setup the environment variables by editing scripts/config.sh
. The most important variables include:
UDBASE
: This should point to the root directory of your training/dev/test universal dependencies data in CoNLL-U format. The dir is organized as{UDBASE}/{corpus}/{corpus_short}-ud-{train,dev,test}.conllu
. For examples of this folder, you can download the raw data from the CoNLL 2018 UD Shared Task website. This environment variable should point to the root directory for ALL of the datasets, so for example you should have$UDBASE/UD_English-EWT
,$UDBASE/UD_Vietnamese-VTB
, etcNERBASE
: This should point to the root directory of where you have collected various NER datasets in raw form. For example, if you want to retrain the Finnish Turku model, you would download that dataset and put it in$NERBASE/fi_turku
. The expected locations of the various datasets we already support are explained in the documentation ofprepare_ner_dataset.py
DATA_ROOT
: This is the root directory for storing intermediate training files generated by the scripts.{module}_DATA_DIR
: The subdirectory for storing intermediate files used by each module.NER_DATA_DIR
: For example, this subdirectory stores intermediate files used by NER. Those intermediate data files may include BIO files and/or files in the.json
format used internally by Stanza’s NER module.. The dir is organized as{NER_DATA_DIR}/{corpus}_{train,dev,test}.bio
. See NER Data for more info.WORDVEC_DIR
: The directory to store all word vector files (see below).HANDPARSED_DIR
: A fork of handparsed-treebank, only needed if retraining “combined” models
These variables are used by the .sh
files in the scripts
folder. They are also used by the python scripts, and so they need to be part of your environment.
Preparing Word Vector Data
To train modules that make use of word representations, such as the POS/morphological features tagger and the dependency parser, it is highly recommended that you use pretrained embedding vectors. The simplest method is to use the .pt
file in the $STANZA_RESOURCES/lang/pretrain
directory after downloading the existing models using stanza.download(lang)
.
python
>>> import stanza
>>> stanza.download("vi") # or whichever language you are using
Once this is downloaded, the models each have a flag which tells the model where to find the .pt
file. Each of pos, depparse, ner, conparse, and sentiment support the --wordvec_pretrain_file
flag for specifying the exact path. If you don’t supply this flag, it will attempt to guess the location. In general, it will look in saved_models/{model}/{dataset}.pretrain.pt
.
For more information and other options, see here.
run_pos.py and run_depparse.py support the –no_pretrain flag if you cannot find word embeddings for your language, but performance will degrade significantly. NER, conparse, and sentiment do not support that flag.
Input Files
In general, we use UD datasets for tokenizer, MWT, lemmatizer, pos, and depparse. The NER, constituency parser, and sentiment models all have different input formats. The conversion scripts turn the raw data of various forms into the formats expected by the tools, and the run
scripts expect a standardized format produced by the data conversion scripts.
Converting UD data
A large repository of data is available at www.universaldependencies.org. Most of our models are trained using this data. We provide python scripts for converting this data to the format used by our models at training time:
python -m stanza.utils.datasets.prepare_${module}_treebank ${corpus} ${other_args}
where ${module}
is one of tokenize
, mwt
, pos
, lemma
, or depparse
; ${corpus}
is the full name of the corpus; ${other_args}
are other arguments allowed by the training script.
Train, dev, test splits
The larger datasets on UD have data for each of train
, dev
, and test
, with the corresponding words in the name of the files. For example, from EWT:
en_ewt-ud-train.conllu
en_ewt-ud-dev.conllu
en_ewt-ud-test.conllu
en_ewt-ud-train.txt
en_ewt-ud-dev.txt
en_ewt-ud-test.txt
Some smaller datasets only have train
and test
, but the train
dataset is reasonably large. The preparation scripts for those languages will automatically randomly split the train
files into train
and dev
.
Some datasets are quite small and only have test
data. In general, we skip those datasets and do not build models for them.
In general, each of the processors need all three splits to function. For NER datasets with one large contiguous dataset, for example, you will need to split that yourself into three pieces.
Dependency Parser Data
By default, the prepare_depparse_treebank.py
script uses the POS tagger to retag the training data. You can turn this off with the --gold
flag.
# This will attempt to load the tagger and retag the data
# You will need to train the tagger or use the existing tagger for that to work
python -m stanza.utils.datasets.prepare_depparse_treebank UD_English-EWT
# You can use the original gold tags as follows
python -m stanza.utils.datasets.prepare_depparse_treebank UD_English-EWT --gold
The reasoning is that since the models will be using predicted tags when used in a pipeline, it is better to train the models with the predicted tags in the first place. In order to get the best results when retraining the dependency parser for use in a pipeline, you should first retrain the tagger if relevant and then use the new tagger model to produce the predicted tags.
NER Data
NER models have a different data format from other models. In many cases, publicly available datasets will be in a non-BIO format. There is a script which covers a few different publicly available datasets. The comments at the top of this script should give a good idea on how to download the data and use the script.
Note that the script uses NERBASE
quite often. This will default to extern_data/ner
from wherever you are running the script. If you organize the datasets in a different directory, you will want to set the NERBASE
environment variable accordingly.
export NERBASE=/home/john/ner
python3 -m stanza.utils.datasets.ner.prepare_ner_dataset
For example:
python -m stanza.utils.datasets.ner.prepare_ner_dataset fi_turku
On Windows, you can set the NERBASE
environment variable via the control panel.
Note that for the various scripts supported, you need to first download the data (possibly after agreeing to a license) from the sources listed in the file.
For a new dataset not already supported, there is a specific .json
format expected by our models. There is a conversion script called several times in prepare_ner_dataset.py
which converts IOB format to our internal NER format:
import stanza.utils.datasets.ner.prepare_ner_file as prepare_ner_file
prepare_ner_file.process_dataset(input_iob, output_json)
To add a new dataset, the easiest approach is to first write a function which reads the raw dataset in whatever format it is in, then convert that to BIO and write three intermediate train/dev/test files to NER_DATA_DIR
, then call process_dataset
to turn it into .json
. There are several examples of this in prepare_ner_dataset.py
If you add a new dataset to the conversion script, we will be happy to get a PR on our github to incorporate it into future releases of Stanza.
The program will look for the .json
files in the data/ner
directory, which you may need to create if this is your first time training a Stanza NER model. You can change the expected path by setting the $NER_DATA_DIR
environment variable.
At least one of the datasets is based on a raw constituency dataset, and the prepare_ner_dataset
script will look in $CONSTITUENCY_BASE
for that data.
If you are adding a new NER model not yet supported by Stanza, you may find it helpful to review a couple examples of PRs which added new models, such as the Japanese GSD dataset or the Marathi L3Cube dataset
Constituency Data
The constituency data is also in a different organizational model than the UD datasets. There is a script for the constituencies:
python3 -m stanza.utils.datasets.constituency.prepare_con_dataset
python3 -m stanza.utils.datasets.constituency.prepare_con_dataset ja_alt
This script expects the raw constituency datasets to be in $CONSTITUENCY_BASE
export CONSTITUENCY_BASE=/home/john/constituency
The expected end result is bracketed trees such as PTB brackets:
(ROOT (S (NP (NN Stuff)) (VP (VBZ goes) (ADVP (RB here)))))
The final data will go to $CONSTITUENCY_DATA_DIR
, which defaults to data/constituency
, but can be set to something else via environmental variables.
Filenames & short names
In general, the training files for the UD datasets follow the pattern
lang_dataset-ud-train|dev|test.conllu
For example, see the English EWT dataset
We try to replicate this pattern wherever possible. All of the UD based models (tokenizer, lemmatizer, etc) expect names to be in this format. The constituency parser files are converted to lang_dataset_train|dev|test.mrg
, and the NER model input files are lang_dataset.train|dev|test.json
. (Note that we generally keep the files in separate directories, so the similar names do not get confusing.)
You can see that the first part of the names, lang_dataset
, are common to all of the models. The code frequently refers to this as short_name
, so en_ewt
is the short name for UD_English-EWT, en_wsj
is the shortname for Penn Treebank, and en_ontonotes
is the short name for the OntoNotes NER dataset, etc. The training scripts will look for the converted input files in ${module}_DATA_DIR
and expect that you give them the appropriate shortname for the dataset you are training on.
For example, run_ner.py en_ontonotes
will expect a converted OntoNotes dataset in $NER_DATA_DIR
Stanza knows about all of the language codes used by UD, along with a few others, but there may be some missing ones if you are working on a new language. You can add missing language codes if needed.
Training with Scripts
We provide various scripts to ease the training process in the scripts
and stanza/utils/training
directories. To train a model, you can run the following command from the code root directory:
python -m stanza.utils.training.run_${module} ${corpus} ${other_args}
where ${module}
is one of tokenize
, mwt
, pos
, lemma
, depparse
, or ner
; ${corpus}
is the full name of the corpus; ${other_args}
are other arguments allowed by the training script.
For example, you can use the following command to train a tokenizer with batch size 32 and a dropout rate of 0.33 on the UD_English-EWT
corpus:
python -m stanza.utils.training.run_tokenize UD_English-EWT --batch_size 32 --dropout 0.33
NER is not trained from the UD datasets, but from other external datasets specific to each language. Nevertheless, there is a run_ner.py
script:
python -m stanza.utils.training.run_ner ${corpus} ${other_args}
python -m stanza.utils.training.run_ner fi_turku
You can also run ner_tagger.py
directly:
python3 -m stanza.models.ner_tagger --wordvec_pretrain_file saved_models/pos/fi_ftb.pretrain.pt --train_file data/ner/fi_turku.train.json --eval_file data/ner/fi_turku.dev.json --charlm --charlm_shorthand fi_conll17 --char_hidden_dim 1024 --lang fi --shorthand fi_turku --mode train
To get the prediction scores of an existing NER model when running ner_tagger.py
, use --mode eval
instead of --mode train
.
If you are training NER or constituency for a new language, you may want to use a charlm. You can also leave out the charlm arguments in the above command line and train without the charlm.
For a full list of available training arguments, please refer to the specific entry point of that module. By default model files will be saved to the saved_models
directory during training (which can also be changed with the save_dir
argument).
Evaluation
Model evaluation will be run automatically after each training run.
You can also run a single model’s dev or test set by using the run_${module}.py
script with the --score_dev
or --score_test
flag.
Additionally, after you finish training all modules, you can evaluate the full universal dependency parsing pipeline with this command:
python -m stanza.utils.training.run_ete ${corpus} --score_${split}
where ${split}
is one of dev
or test
. Running with no --score_
flag will give scores for the train data.
Devices
We strongly encourage you to train all modules with a GPU device. When a CUDA device is available and detected by the script, the CUDA device will be used automatically; otherwise, CPU will be used. However, you can force the training to happen on a CPU device, by specifying --cpu
when calling the script.