Retrain models for a UD dataset
Table of contents
Retrain UD models
There could be several reasons to retrain models for an existing UD dataset. For example:
- code changes which affect the quality of the model or the compatibility with existing serialized models
- new version of UD released
- updated word vectors
- proposed changes to UD dataset available in git or elsewhere
Here we will present an end to end example on how to retrain the models based on UD: tokenizer, MWT, lemmatizer, POS, depparse.
The below guide uses some changes to the codebase which will be available in Stanza 1.4.1. In the meantime, you can git clone the dev branch of our repo:
https://github.com/stanfordnlp/stanza
OS
We will work on Linux for this. It is possible to recreate most of these steps on Windows or another OS, with the exception that the environment variables need to be set differently.
As a reminder, Stanza only supports Python 3.6 or later. If you are using an earlier version of Python, it will not work.
Environment variables
To start, we will retrain the tokenizer module for UD_English-EWT. The default environment variables work well (especially if you’re me!) but may not be applicable to your system.
To rebuild the tokenizer, there are a few relevant environment variables:
PYTHONPATH
- if using a git clone of Stanza, you will want to set yourPYTHONPATH
to the home directory of that checkout. You can also use.
if you are running the scripts from that directoryUDBASE
- the home path of a Universal Dependencies download. For example, if you have a complete download of 2.10, you can set it to that directory:UDBASE=/u/scr/corpora/Universal_Dependencies/Universal_Dependencies_2.10/ud-treebanks-v2.10/
You will have this set correctly ifls $UDBASE
displays all of the UD packages, such asUD_Afrikaans-AfriBooms
,UD_Yupik-SLI
, and many more in between. You can also set up a specific directory for git checkouts if you want to track the latest changes, such asUDBASE=/home/$USER/ud/git
TOKENIZE_DATA_DIR
- by default, Stanza will write preprocessed datasets to$DATA_ROOT/tokenize
, which defaults todata/tokenize
STANZA_RESOURCES_DIR
- Supplemental models such as default word vectors will be downloaded here (although the tokenizer in particular does not use word vectors). You can change where they go by changing this variable.
This might look like this in a .bashrc
file:
export PYTHONPATH=.
export STANZA_RESOURCES_DIR=/nlp/scr/$USER/stanza_resources
export TOKENIZE_DATA_DIR=/nlp/scr/$USER/data/tokenize
# if using a complete installation of UD
export UDBASE=/u/scr/corpora/Universal_Dependencies/Universal_Dependencies_2.10/ud-treebanks-v2.10/
# if using a personal git install of some of the datasets
export UDBASE=/nlp/scr/$USER/Universal_Dependencies/git
Obtaining data
All of the UD based models use data available at Universal Dependencies
Individual language/dataset pairs are each in their own github repo. For this tutorial, we will use the UD English EWT repo. In July 2022, for example, this repo was updated with morphological feature changes which would not be released until UD 2.11 in November 2022.
The official release is just the final data files, not a collection of git releases
If you want to use data more recent than the most recent UD release, the first step is to git clone UD_English-EWT:
cd $UDBASE
git clone git@github.com:UniversalDependencies/UD_English-EWT.git
cd UD_English-EWT
git checkout dev # because we want the dev set updates, of course
Links for each of the git repos can be found on the UD homepage as part of the treebank description
Instead of downloading an individual repo, we can download them all from the UD home page and put that in $UDBASE
.
Updating data
If you wish to make changes to the data itself, this is the moment to do so. You can edit the {train,dev,test}.conllu
files in the dataset directory to make those changes. In the case of UD_English-EWT
, for example, you would edit
en_ewt-ud-train.conllu
en_ewt-ud-dev.conllu
en_ewt-ud-test.conllu
Please refer to the UD documentation for the expected format. It is not necessary to update the .txt
files, as the prepare scripts will rebuild them in your local data directory.
In the UD files, the tokenizer separates the raw text into sentences and “forms”. The MWT processor marks which tokens are composed of multiple words, which is represented in the datasets as:
1-2 Don't _ _ _ _ _ _ _ _
1 Do do AUX VBP Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 3 aux 3:aux _
2 n't not PART RB _ 3 advmod 3:advmod _
The lemmatizer processes lemmas, the POS processes upos, xpos, and features as a joint model, and depparse processes HEAD and DEPREL.
Preparing data
A script is included with Stanza which will read the dataset from $UDBASE
and write it to $TOKENIZE_DATA_DIR
:
python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-EWT
Running the model
There is also a script to run the model
python3 -m stanza.utils.training.run_tokenizer UD_English-EWT
This will train the model and put the result in saved_models/tokenize/en_ewt_tokenizer.pt
You can change the destination directory with --save_dir
:
python3 -m stanza.utils.training.run_tokenizer UD_English-EWT --save_dir somewhere/else
If you already have a model saved, the training script will not overwrite that model. You can make that happen with --force
:
python3 -m stanza.utils.training.run_tokenizer UD_English-EWT --force
If you want a different save name:
python3 -m stanza.utils.training.run_tokenizer UD_English-EWT --save_name en_ewt_variant_tokenizer.pt
Testing the model
The training script will report the dev score of the final model when it finishes.
After a model is built, you can test it on the dev and test sets:
python3 -m stanza.utils.training.run_tokenizer UD_English-EWT --score_dev
python3 -m stanza.utils.training.run_tokenizer UD_English-EWT --score_test
Other processors
The process is identical for MWT, Lemmatizer, and POS. Depparse will redo the tags by default.
Note that MWT does not apply for datasets with no multi-word tokens. If you attempt to run MWT on a dataset with no MWTs, you will get a message such as
2022-08-01 18:11:03 INFO: No training MWTS found for vi_vtb. Skipping
Model | Prepare script | Run script | Data dir env variable | Default save dir |
---|---|---|---|---|
MWT | stanza.utils.datasets.prepare_mwt_treebank | stanza.utils.training.run_mwt | MWT_DATA_DIR | saved_models/mwt |
Lemma | stanza.utils.datasets.prepare_lemma_treebank | stanza.utils.training.run_lemma | LEMMA_DATA_DIR | saved_models/lemma |
POS | stanza.utils.datasets.prepare_pos_treebank | stanza.utils.training.run_pos | POS_DATA_DIR | saved_models/pos |
Depparse | stanza.utils.datasets.prepare_depparse_treebank | stanza.utils.training.run_depparse | DEPPARSE_DATA_DIR | saved_models/depparse |
Word Vectors
The POS models use word vectors and charlm. We provide default word vectors for existing models, which the run_pos
script will download. You can also provide your own. You will first need to convert the word vectors to a .pt
file. Once that is done, you can specify a path to a new vectors file with the --wordvec_pretrain_file
argument.
The POS models also use charlm for languages where that is supported. (On the TODO list is adding that feature to depparse.) If you want to test the effect of a new set of word vectors, you may want to use the --no_charlm
flag to turn off the charlm models in the POS.
Depparse retagging
At runtime, the dependency parser will not have gold tags, but will have predicted tags from the POS tagger. Accordingly, the prepare_depparse_treebank
script will run the tagger to put predicted tags on the dependency dataset.
This behavior can be turned off with --gold
, but that is not recommended. One situation where you might want to turn off retagging is when testing the effects of different word embeddings. In such a situation, you can isolate the effect of the word vectors on the dependency parsing by using gold tags instead of tags predicted by POS.
If a tagger model is present in the pos save directory, that model will be used. In other words, a retrained model will be the preferred model to use. Otherwise, prepare_depparse_treebank
will download the default model for the given treebank. For some languages, that will include the charlm, which will also be automatically downloaded.
The path to the tagger model can also be manually specified with the --tagger_model
flag.
prepare_depparse_treebank
flags:
Flag | Behavior |
---|---|
--predicted | Turn on retagging for the depparse dataset (default) |
--gold | Turn off retagging for the depparse dataset |
--tagger_model | Where to find the POS tagger. An attempt will be made to find it if not specified |
--wordvec_pretrain_file | Word vectors to use for the POS tagger when retagging. A default will be downloaded if not specified |