TokensRegexNERAnnotator

NERCombinerAnnotator Runs TokensRegexNERAnnotator As A Sub-Annotator
Description
Options
Mapping files

NERCombinerAnnotator Runs TokensRegexNERAnnotator As A Sub-Annotator

Some annotators are actually more commonly used as sub-annotators. That is one annotator runs another as a sub-component.

In the current version of Stanford CoreNLP a TokensRegexNERAnnotator is run as a sub-annotator of a comprehensive named entity recognition process by the NERCombinerAnnotator. It is advised to review how the overall ner annotator works by reviewing the documentation for the NERCombinerAnnotator. In most cases you will probably just want to modify settings for the ner annotator to integrate your own rules into a pipeline rather than build a separate regexner annotator. For instance, the ner annotator handles building entity mentions from named entity tags, so you would want your custom rules to be run by the ner annotator.

That being said, it is possible to run this annotator in isolation by specifying regexner in your list of annotators.

This documentation remains as a detailed description of how a TokensRegexNERAnnotator works.

Description

RegexNER Implements a simple, rule-based NER system over token sequences using an extension of Java regular expressions. The original goal of this Annotator was to provide a simple framework to incorporate named entities and named entity labels that are not annotated in traditional NER corpora, and hence not recoginized by our statistical NER classifiers. However, you can also use this annotator to simply do rule-based NER. Here is a simple example of how to use RegexNER. RegexNER is implemented using TokensRegex. For more complex applications, you might consider using TokensRegex directly.

For English, we distribute CoreNLP with two files containing a default list of regular expressions, many just gazette entries, which label more fine-grained LOCATION-related subcategories (COUNTRY, STATE_OR_PROVINCE, CITY, NATIONALITY), the commonest online identifiers (URL, EMAIL, HANDLE), and a few miscellaneous categories originating from the TAC KBP evaluations (TITLE, IDEOLOGY, RELIGION, CRIMINAL_CHARGE, CAUSE_OF_DEATH). Here, TITLE refers to job titles.

Property name	Annotator class name	Generated Annotation
regexner	TokensRegexNERAnnotator	NamedEntityTagAnnotation

Options

Option name	Type	Default	Description
`regexner.ignorecase`	boolean	`false`	If `true`, case is ignored for all patterns in all files.
`regexner.mapping`	String	`see below`	Comma separated list of mapping files to use. Each mapping file is a tab-delimited file.
`regexner.mapping.header`	String	`pattern,ner,overwrite,priority,group`	Comma separated list of header fields (or `true` if header is specified in the mapping file).
`regexner.mapping.field.<fieldname>`	String	null	Class name for CoreLabel key for annotation fields other than NER.
`regexner.commonWords`	String	null	Comma separated list of files for common words to not annotate (in case your mapping isn’t very clean).
`regexner.backgroundSymbol`	String	O,MISC	Comma separated list of NER labels that can always be replaced.
`regexner.posmatchtype`	enum	`MATCH_AT_LEAST_ONE_TOKEN`	How should `validpospattern` be used to match the POS of the tokens. `MATCH_ALL_TOKENS` - All tokens have to match. `MATCH_AT_LEAST_ONE_TOKEN` - At least one token has to match. `MATCH_ONE_TOKEN_PHRASE_ONLY` - Only has to match for one token phrases.
`regexner.validpospattern`	regex	null	Regular expression pattern for matching POS tags.
`regexner.noDefaultOverwriteLabels`	String	null	Comma-separated list of output types for which default NER labels are not overwritten. For these types, only if the matched expression has an NER type matching the specified `overwriteableType` for the regex will the NER type be overwritten.
`regexner.verbose`	boolean	`false`	If `true`, turns on extra debugging messages.

Mapping files

The mapping file is a tab-delimited file.

The format of the default mapping file used by RegexNER is described in more detail on the Stanford NLP website.

The format and the output fields can be changed by specifying a different regexner.mapping.header.

For instance, if you wanted to mark “Stanford University” as having NER tag of “SCHOOL” and linked to “https://en.wikipedia.org/wiki/Stanford_University”, you can do so by adding normalized to the headers:

regexner.mapping.header=pattern,ner,normalized,overwrite,priority,group
# Not needed, but illustrate how to link a field to an annotation
regexner.mapping.field.normalized=edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation

Then you would have in your mapping file have entries such as:

Stanford University\tSCHOOL\thttps://en.wikipedia.org/wiki/Stanford_University

(In the above example, \t is used to indicate where a tab should occur.)

Note that the pattern field can be either

a sequence of regex over the token text, each separated by whitespace (matching “\s+”).
Example: University of .*\tSCHOOL

a TokensRegex expression (marked by starting with “( “ and ending with “ )”.
Example: ( /University/ /of/ [ {ner:LOCATION} ] )\tSCHOOL
Using TokensRegex patterns allows for matching on other annotated fields such as POS or NER.

TokensRegexNERAnnotator

Table of contents

NERCombinerAnnotator Runs TokensRegexNERAnnotator As A Sub-Annotator

Description

Options

Mapping files