Table of contents
The neural models for English, French, German, and Spanish have been retrained with UD 2.0 dependencies, this will change several labels for the dependency parses. Info about UD 2.0 can be found here.
Annotators, models, and rules for English, French, German, and Spanish now work with UD 2.0 tokenization by default. This includes models for tagging, parsing, named entity recognition (with an important exception), and KBP relation extraction. For example, the English tokenizer now splits most hyphenated tokens, does not normalize parentheses (e.g. turn
-LRB-), and does not normalize quotation marks.
A specialized tokenization that is mostly the UD 2.0 version is used for named entity recognition (see below).
Custom models trained with version 3.9.2 or earlier may suffer performance issues since they expect a different tokenization standard. It is advised to retrain models with tokenization that follows the UD 2.0 standard. If retraining an NER model, note that the training data should not split tokens on hyphen.
The tokenization process for these languages has been designed to maximize F1 on dev/test sets from the CoNLL 2018 shared task, similar to Stanza.
Examples of UD 2.0 tokenization for these languages can be found here.
A complication is that the UD 2.0 standard for English and German says to split tokens on hyphen, but this can lead to diminished performance. Consider the example of double barrel names such as
Daniel Day-Lewis or hyphenated place names such as
Bergen-Enkheim. It was found that splitting on hyphen dropped F1 score, so the hyphen splitting is mostly deactivated for named entity recognition. The only exceptions are the following key words:
based, area, registered, headquartered, native, born, raised, backed, controlled, owned, resident, trained, educated. So
Chicago-based WILL be split into
based to allow for the token
Chicago to be recognized as a
The NERAnnotator by default takes in UD 2.0 tokens, and then merges all tokens that were originally joined by a hyphen in the text (except for cases like Chicago-based). The model is run on the modified tokens list, and the labels are finally applied to the original UD 2.0 tokens. This behavior can be turned off by setting
Related to the tokenization change, French, German, and Spanish now require the use of the MWTAnnotator which splits some tokens into multiple words with rules and statistical models. For instance the French token “des” is sometimes split into the words “de” and “les”.
Some multi-word token splitting for these languages used to occur in the
tokenize annotator, but now this annotator focuses on creating tokens, and the
mwt annotator is used to make token splitting decisions, sometimes via a dictionary and other times via a statistical model.
These languages require the
mwt annotator be run immediately after the
For example, the German default annotators list has changed from
tokenize, ssplit, pos, ner, depparse
tokenize, ssplit, mwt, pos, ner, depparse
|Original name||New name|
This original annotators list:
can now be expressed as:
quote annotators will run some of the annotators themselves as sub-annotators. This means for instance that the
ner annotator will run a combination of CRF classifiers (adding ner tags to tokens), then the TokensRegex based
regexner to produce fine-grained annotations (“LOCATION” -> “COUNTRY”), and then finally it will annotate the full entity mentions (“Joe”, “Smith” –> “Joe Smith”) with its internal
You can run the
ner annotator without the additional annotators with these options
ner.applyFineGrained = false ner.buildEntityMentions = false
If you wish to set parameters for the
ner annotator’s internal
regexner annotator set
ner.fine.regexner properties. For instance:
ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/spanish/kbp_regexner_mapping_sp.tag
Likewise to set the
ner annotator’s internal
entitymentions annotator, set
ner.entitymentions properties. For instance:
ner.entitymentions.acronyms = true
coref annotation you can shut off the coref mention detection (if you want to use a custom coref mention annotator)
coref.useCustomMentionDetection = true
quote annotation, quote attribution can be deactivated with
quote.attributeQuotes = false