PyTorch Dependency Parser

Overview

CoreNLP dependency parser models are trained with a PyTorch system for speed considerations. The PyTorch models can be converted to the format CoreNLP’s dependency parser expects.

The purpose of this library is to train models for the Java code base. If you want a full featured Python dependency parser, you should look into using Stanza.

The code repo can be found here.

Example Usage

First train a model. Make sure to have a recent PyTorch and Stanza installed.

Here is an example training command for training and Italian model (run from the code directory):

python train.py -l universal -d /path/to/data --train_file it-train.conllu --dev_file it-dev.conllu --embedding_file /path/to/it-embeddings.txt --embedding_size 100 --random_seed 21 --learning_rate .005 --l2_reg .01 --epsilon .001 --optimizer adamw --save_path /path/to/experiment-dir --job_id experiment-name --corenlp_tags --corenlp_tag_lang italian --n_epoches 2000

The data files should be *.conllu format.

Note that the above command will automatically tag the input data with the CoreNLP tagger. Thus you need to have CoreNLP and the Italian models (for this example) in your CLASSPATH, and you need the latest version of Stanza installed.

Why is this done? When CoreNLP runs a dependency parser, it relies on part of speech tags, so the training and development data used during training need to have the predicted tags CoreNLP will use for optimal performance.

After the model is trained, it can be converted to a format usable by CoreNLP:

python gen_model.py -o /path/to/italian-corenlp-parser.txt /path/to/experiment-dir/experiment-name

This will save a CoreNLP useable model at /path/to/italian-corenlp-parser.txt.

Where To Get Data

You can find data for training models at the official Universal Dependencies site.