Link

Migration

Table of contents


4.0.0

Dependency parsing for English, French, German, and Spanish now uses UD 2.0 dependencies

The neural models for English, French, German, and Spanish have been retrained with UD 2.0 dependencies, this will change several labels for the dependency parses. Info about UD 2.0 can be found here.

Tokenization has been upgraded to UD 2.0 for English, French, German, and Spanish.

Annotators, models, and rules for English, French, German, and Spanish now work with UD 2.0 tokenization by default. This includes models for tagging, parsing, named entity recognition (with an important exception), and KBP relation extraction. For example, the English tokenizer now splits most hyphenated tokens, does not normalize parentheses (e.g. turn ( into -LRB-), and does not normalize quotation marks.

A specialized tokenization that is mostly the UD 2.0 version is used for named entity recognition (see below).

Custom models trained with version 3.9.2 or earlier may suffer performance issues since they expect a different tokenization standard. It is advised to retrain models with tokenization that follows the UD 2.0 standard. If retraining an NER model, note that the training data should not split tokens on hyphen.

The tokenization process for these languages has been designed to maximize F1 on dev/test sets from the CoNLL 2018 shared task, similar to Stanza.

Examples of UD 2.0 tokenization for these languages can be found here.

NER Specific Tokenization

A complication is that the UD 2.0 standard for English and German says to split tokens on hyphen, but this can lead to diminished performance. Consider the example of double barrel names such as Daniel Day-Lewis or hyphenated place names such as Bergen-Enkheim. It was found that splitting on hyphen dropped F1 score, so the hyphen splitting is mostly deactivated for named entity recognition. The only exceptions are the following key words: based, area, registered, headquartered, native, born, raised, backed, controlled, owned, resident, trained, educated. So Chicago-based WILL be split into Chicago - based to allow for the token Chicago to be recognized as a CITY.

The NERAnnotator by default takes in UD 2.0 tokens, and then merges all tokens that were originally joined by a hyphen in the text (except for cases like Chicago-based). The model is run on the modified tokens list, and the labels are finally applied to the original UD 2.0 tokens. This behavior can be turned off by setting ner.useNERSpecificTokenization to false.

MWT annotator required for French, German, and Spanish

Related to the tokenization change, French, German, and Spanish now require the use of the MWTAnnotator which splits some tokens into multiple words with rules and statistical models. For instance the French token “des” is sometimes split into the words “de” and “les”.

Some multi-word token splitting for these languages used to occur in the tokenize annotator, but now this annotator focuses on creating tokens, and the mwt annotator is used to make token splitting decisions, sometimes via a dictionary and other times via a statistical model.

These languages require the mwt annotator be run immediately after the ssplit annotator.

For example, the German default annotators list has changed from

tokenize, ssplit, pos, ner, depparse

to

tokenize, ssplit, mwt, pos, ner, depparse

3.9.0

Annotator renaming

Original nameNew name
mentioncoref.mention
quote_attributionquote.attribution

Several annotators have been enhanced to run other annotators.

This original annotators list:

tokenize,ssplit,pos,lemma,ner,regexner,entitymentions,parse,mention,coref,quote,quote_attribution

can now be expressed as:

tokenize,ssplit,pos,lemma,ner,parse,coref,quote

The ner, coref, and quote annotators will run some of the annotators themselves as sub-annotators. This means for instance that the ner annotator will run a combination of CRF classifiers (adding ner tags to tokens), then the TokensRegex based regexner to produce fine-grained annotations (“LOCATION” -> “COUNTRY”), and then finally it will annotate the full entity mentions (“Joe”, “Smith” –> “Joe Smith”) with its internal entitymentions annotator.

AnnotatorSub-annotators
nerregexner,entitymentions
corefcoref.mention
quotequote.attribution

You can run the ner annotator without the additional annotators with these options

ner.applyFineGrained = false
ner.buildEntityMentions = false

If you wish to set parameters for the ner annotator’s internal regexner annotator set ner.fine.regexner properties. For instance:

ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/spanish/kbp_regexner_mapping_sp.tag

Likewise to set the ner annotator’s internal entitymentions annotator, set ner.entitymentions properties. For instance:

ner.entitymentions.acronyms = true

Likewise for coref annotation you can shut off the coref mention detection (if you want to use a custom coref mention annotator)

coref.useCustomMentionDetection = true

And for quote annotation, quote attribution can be deactivated with

quote.attributeQuotes = false