Named Entity Recognition
Table of contents
Description
Recognizes named entities (person and company names, etc.) in text. Principally, this annotator uses one or more machine learning sequence models to label entities, but it may also call specialist rule-based components, such as for labeling and interpreting times and dates. Numerical entities that require normalization, e.g., dates, have their normalized value stored in NormalizedNamedEntityTagAnnotation. For more extensive support for rule-based NER, you may also want to look at the RegexNER annotator. The set of entities recognized is language-dependent, and the recognized set of entities is frequently more limited for other languages than what is described below for English. As the name “NERClassifierCombiner” implies, commonly this annotator will run several named entity recognizers and then combine their results but it can run just a single annotator or only rule-based quantity NER.
For English, by default, this annotator recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities (12 classes). Adding the regexner
annotator and using the supplied RegexNER pattern files adds support for the fine-grained and additional entity classes EMAIL, URL, CITY, STATE_OR_PROVINCE, COUNTRY, NATIONALITY, RELIGION, (job) TITLE, IDEOLOGY, CRIMINAL_CHARGE, CAUSE_OF_DEATH, (Twitter, etc.) HANDLE (12 classes) for a total of 24 classes. Named entities are recognized using a combination of three CRF sequence taggers trained on various corpora, including CoNLL, ACE, MUC, and ERE corpora. Numerical entities are recognized using a rule-based system.
Property name | Annotator class name | Generated Annotation |
---|---|---|
ner | NERClassifierCombiner | NamedEntityTagAnnotation and NormalizedNamedEntityTagAnnotation |
Options
Option name | Type | Default | Description |
---|---|---|---|
ner.model | List(String) | null | A comma-separated list of NER model names (or just a single name is okay). If none are specified, a default list of English models is used (3class, 7class, and MISCclass, in that order). The names will be looked for as classpath resources, filenames, or URLs. |
ner.rulesOnly | boolean | false | Whether or not to only run rules based NER. |
ner.statisticalOnly | boolean | false | Whether or not to only run statistical NER. |
ner.applyNumericClassifiers | boolean | true | Whether or not to use numeric classifiers, for money, percent, numbers, including SUTime. These are hardcoded for English, so if using a different language, this should be set to false. |
ner.applyFineGrained | boolean | true | whether or not to apply fine-grained NER tags (e.g. LOCATION –> CITY) ; this will slow down performance |
ner.buildEntityMentions | boolean | true | whether or not to build entity mentions from token NER tags |
ner.combinationMode | String | NORMAL | when set to NORMAL each tag can only be applied by the first CRF classifier that applies that tag ; when set to HIGH_RECALL all CRF classifiers can apply all of their tags |
ner.useNERSpecificTokenization | boolean | true | Whether or not to use NER-specific tokenization which merges tokens separated by hyphens. Models released with Stanford CoreNLP 4.0.0 expect a tokenization standard that does NOT split on hyphens. |
ner.useSUTime | boolean | true | Whether or not to use SUTime. SUTime at present only supports English; if not processing English, make sure to set this to false. |
sutime.markTimeRanges | boolean | false | Tells SUTime whether to mark phrases such as “From January to March” as a range, instead of marking “January” and “March” separately. |
sutime.includeRange | boolean | false | If marking time ranges, set the time range in the TIMEX output from SUTime. |
maxAdditionalKnownLCWords | int | - | Limit the size of the known lower case words set. Set this to 0 to prevent ordering issues (i.e. when this is nonzero running on document1 then document2 can have different results than running on document2 then document1 |
NER Pipeline Overview
The full named entity recognition pipeline has become fairly complex and involves a set of distinct phases integrating statistical and rule based approaches. Here is a breakdown of those distinct phases.
The main class that runs this process is edu.stanford.nlp.pipeline.NERCombinerAnnotator
Statistical Models
During this phase a series of trained CRF’s will be run on each sentence. These CRF’s are trained on large tagged data sets. They evaluate the entire sequence and pick the optimal tag sequence.
These are the default models that are run:
# tags: LOCATION, ORGANIZATION, PERSON
edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz
# tags: DATE, LOCATION, MONEY, ORGANIZATION, PERCENT, PERSON, TIME
edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz
# LOCATION, MISC, ORGANIZATION, PERSON
edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz
Tags written by one model cannot be overwritten by subsequent models in the series.
There are two options for how the models are combined. These are selected with the ner.combinationMode
property.
- NORMAL - any given tag can only be applied by one model (the first model that applies a tag)
- HIGH_RECALL - all models can apply all tags
So for example, if the ner.combinationMode
is set to NORMAL
, only the 3-class model’s ORGANIZATION tags will be applied. If it is set to HIGH_RECALL
, the 7-class and 4-class models’ ORGANIZATION tags will also be applied.
If you do not want to run any statistical models, set ner.model
to the empty string.
Numeric Sequences and SUTime
Next a series of rule based systems are run to recognize and tag numeric sequences and time related sequences.
This phase runs by default, but can be deactivated by setting ner.applyNumericClassifiers
to false
.
This produces tags such as NUMBER, ORDINAL, MONEY, DATE, and TIME
The class that runs this phase is edu.stanford.nlp.ie.regexp.NumberSequenceClassifier
SUTime (described in more detail below) is also used by default. You can deactivate this by setting ner.useSUTime
to false
.
Fine Grained NER
At this point, a series of rules used for the KBP 2017 competition will be run to create more fine-grained NER tags. These rules are applied using a TokensRegexNERAnnotator sub-annotator. That is the main NERCombinerAnnotator
builds a TokensRegexNERAnnotator
as a sub-annotator and runs it on all sentences as part of it’s entire tagging process. The purpose of these rules is give tokens more specific tags. So for instance California
would be tagged as a STATE_OR_PROVINCE
rather than just a LOCATION
.
The TokensRegexNERAnnotator
runs TokensRegex rules. You can review all of the settings for a TokensRegexNERAnnotator here.
NOTE: applying these rules will significantly slow down the tagging process.
The tags set by this phase include:
CAUSE_OF_DEATH, CITY, COUNTRY, CRIMINAL_CHARGE, EMAIL, HANDLE,
IDEOLOGY, NATIONALITY, RELIGION, STATE_OR_PROVINCE, TITLE, URL
If you do not want to run the fine-grained rules, set ner.applyFineGrained
to false
.
RegexNER Rules Format
There is a more detailed write up about RegexNER here
The format is a series of tab-delimited columns.
The first column is the tokens pattern, the second column is the NER tag to apply, the third is the types of NER tags that can be overwritten, and the fourth is a priority used for tie-breaking if two rules match a sequence.
Each space delimited entry represents a regex to match a token.
The rule (remember these are tab-delimited columns):
Los Angeles CITY LOCATION,MISC 1.0
means to match the token “Los” followed by the token “Angeles”, and label them both as CITY, provided they have a current NER tag of O, LOCATION, or MISC.
The rule:
Bachelor of (Arts|Science) DEGREE MISC 1.0
means to match the token “Bachelor”, then the token “of”, and finally either the token “Arts” or “Science”.
Customizing The Fine-Grained NER
Here is a breakdown of how to customize the fine-grained NER. The overall ner
annotator creates a sub-annotator called ner.fine.regexner
which is an instance of a TokensRegexNERAnnotator
.
The ner.fine.regexner.mapping
property allows one to specify a set of rules files and additional properties for each rules file.
The format is as follows:
- For each rules file there is a comma delimited list of options, ending in the path of the rules file
- Each entry for a rules file is separated by a
;
As an example, this is the default ner.fine.regexner.mapping
setting:
ignorecase=true,validpospattern=^(NN|JJ).*,edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab;edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab
The two rules files are:
edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab
edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab
The options for edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab
are:
ignorecase=true,validpospattern=^(NN|JJ).*
while there are no options set for edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab
in this example.
Here is a description of some common options for the TokensRegexNERAnnotator
sub-annotator used by ner
You can find more details on the page for the TokensRegexNERAnnotator
located here
Option name | Type | Default | Description |
---|---|---|---|
ignorecase | boolean | false | make patterns case-insensitive or not? |
validpospattern | regex | null | part of speech tag pattern that has to be matched |
If you want to set global settings that will apply for all rules files, remember to use ner.fine.regexner.ignorecase
and ner.fine.regexner.validpospattern
. If you are setting options for a specific rules file with the ner.fine.regexner.mapping
option, follow the pattern from above.
Additional TokensRegexNER Rules
After the fine-grained rules are run, there is also an option for a user to specify additional rules they would like to have run after the fine-grained NER phase.
This second TokensRegexNERAnnotator
sub-annotator has the name ner.additional.regexner
and is customized in the same manner. This is for the case when users want to run their own rules after the standard rules we provide.
For instance, suppose you want to match sports teams after the previous NER steps have been run.
Your rules file might look like this /path/to/sports_teams.rules
Boston Red Sox SPORTS_TEAM ORGANIZATION,MISC 1
Denver Broncos SPORTS_TEAM ORGANIZATION,MISC 1
Detroit Red Wings SPORTS_TEAM ORGANIZATION,MISC 1
Los Angeles Lakers SPORTS_TEAM ORGANIZATION,MISC 1
You could integrate this into the entire NER process by setting ner.additional.regexner.mapping
to /path/to/sports_teams.rules
By default no additional rules are run, so leaving ner.additional.regexner.mapping
blank will cause this phase to not be run at all.
Additional TokensRegex Rules
If you want to run a series of TokensRegex rules before entity building, you can also specify a set of TokensRegex rules. A TokensRegexAnnotator
sub-annotator will be called. It has the name ner.additional.tokensregex
.
Example command:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.additional.tokensregex.rules example.rules -file example.txt -outputFormat text
You can learn more about TokensRegex rules here
Entity Mention Detection
After all of the previous steps have been run, entity detection will be run to combine the tagged tokens into entities. The entity mention detection will be based off of the tagging scheme. This is accomplished with an EntityMentionsAnnotator
sub-annotator.
You can find a more detailed description of this annotator here
If a basic IO tagging scheme (example: PERSON, ORGANIZATION, LOCATION) is used, all contiguous sequences of tokens with the same tag will be marked as an entity.
If a more advanced tagging scheme (such as BIO with tags like B-PERSON and I-PERSON) is used, sequences with the same tag split by a B-tag will be turned into multiple entities.
All of our models and rule files use a basic tagging scheme, but you could create your own models and rules that use BIO.
For instance (Joe PERSON) (Smith PERSON) (Jane PERSON) (Smith PERSON)
will create the entity Joe Smith Jane Smith
.
On the other hand (Joe B-PERSON) (Smith I-PERSON) (Jane B-PERSON) (Smith I-PERSON)
will create two entities: Joe Smith
and Jane Smith
.
You can deactivate this with ner.buildEntityMentions
being set to false
.
At this point the NER process will be finished, having tagged tokens with NER tags and created entities.
Command Line Examples
There a variety of ways to customize an NER pipeline. Below are some example commands.
# run default NER
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -file example.txt -outputFormat text
# only run rules based NER (numeric classifiers, SUTime, TokensRegexNER, TokensRegex)
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.rulesOnly -file example.txt
# only run statistical NER
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.statisticalOnly -file example.txt
# shut off numeric classifiers
# note that in this case ner no longer requires pos or lemma
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ner -ner.applyNumericClassifiers false -file example.txt -outputFormat text
# shut off SUTime
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.useSUTime false -file example.txt -outputFormat text
# specify doc date for each document to be 2019-01-01
# other options for setting doc date specified below
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.docdate.useFixedDate 2019-01-01 -file example.txt
# shut off fine grained NER
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.applyFineGrained false -file example.txt -outputFormat text
# run fine-grained NER with a custom rules file
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.fine.regexner.mapping custom.rules -file example.txt -outputFormat text
# run fine-grained NER with two custom rules files
# the first rules file caseless.rules should be case-insensitive, the second rules file uses default options
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.fine.regexner.mapping "ignorecase=true,caseless.rules;cased.rules" -file example.txt -outputFormat text
# add additional rules to run after fine-grained NER
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.additional.regexner.mapping additional.rules -file example.txt -outputFormat text
# run tokens regex rules
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.additional.tokensregex.rules example.rules -file example.txt -outputFormat text
# don't build entity mentions
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner -ner.buildEntityMentions false -file example.txt -outputFormat text
Java API Example
package edu.stanford.nlp.examples;
import edu.stanford.nlp.pipeline.*;
import java.util.Properties;
import java.util.stream.Collectors;
public class NERPipelineDemo {
public static void main(String[] args) {
// set up pipeline properties
Properties props = new Properties();
props.setProperty("annotators", "tokenize,pos,lemma,ner");
// example customizations (these are commented out but you can uncomment them to see the results
// disable fine grained ner
// props.setProperty("ner.applyFineGrained", "false");
// customize fine grained ner
// props.setProperty("ner.fine.regexner.mapping", "example.rules");
// props.setProperty("ner.fine.regexner.ignorecase", "true");
// add additional rules, customize TokensRegexNER annotator
// props.setProperty("ner.additional.regexner.mapping", "example.rules");
// props.setProperty("ner.additional.regexner.ignorecase", "true");
// add 2 additional rules files ; set the first one to be case-insensitive
// props.setProperty("ner.additional.regexner.mapping", "ignorecase=true,example_one.rules;example_two.rules");
// set document date to be a specific date (other options are explained in the document date section)
// props.setProperty("ner.docdate.useFixedDate", "2019-01-01");
// only run rules based NER
// props.setProperty("ner.rulesOnly", "true");
// only run statistical NER
// props.setProperty("ner.statisticalOnly", "true");
// set up pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// make an example document
CoreDocument doc = new CoreDocument("Joe Smith is from Seattle.");
// annotate the document
pipeline.annotate(doc);
// view results
System.out.println("---");
System.out.println("entities found");
for (CoreEntityMention em : doc.entityMentions())
System.out.println("\tdetected entity: \t"+em.text()+"\t"+em.entityType());
System.out.println("---");
System.out.println("tokens and ner tags");
String tokensAndNERTags = doc.tokens().stream().map(token -> "("+token.word()+","+token.ner()+")").collect(
Collectors.joining(" "));
System.out.println(tokensAndNERTags);
}
}
SUTime
Stanford CoreNLP includes SUTime, Stanford’s temporal expression recognizer. SUTime is transparently called from the “ner” annotator, so no configuration is necessary. Furthermore, the “cleanxml” annotator can extract the reference date for a given XML document, so relative dates, e.g., “yesterday”, are transparently normalized with no configuration necessary.
SUTime supports the same annotations as before, i.e., NamedEntityTagAnnotation is set with the label of the numeric entity (DATE, TIME, DURATION, MONEY, PERCENT, or NUMBER) and NormalizedNamedEntityTagAnnotation is set to the value of the normalized temporal expression.
Also, SUTime sets the TimexAnnotation key to an edu.stanford.nlp.time.Timex object, which contains the complete list of TIMEX3 fields for the corresponding expressions, such as “val”, “alt_val”, “type”, “tid”. This might be useful to developers interested in recovering complete TIMEX3 expressions.
Reference dates are by default extracted from the “datetime” and “date” tags in an xml document. To set a different set of tags to use, use the clean.datetags property. When using the API, reference dates can be added to an Annotation
via edu.stanford.nlp.ling.CoreAnnotations.DocDateAnnotation
, although note that when processing an xml document, the cleanxml annotator will overwrite the DocDateAnnotation
if “datetime” or “date” are specified in the document.
Setting Document Date
The DocDateAnnotator
provides a variety of options for setting the document date. The ner
annotator will run this annotator as a sub-annotator. These can be specified by setting properties for the ner.docdate
sub-annotator.
Option | Example | Description |
---|---|---|
useFixedDate | 2019-01-01 | Provide a fixed date for each document. |
useMappingFile | dates.txt | Use a tab-delimited file to specify doc dates. First column is document ID, second column is date. |
usePresent | - | Give every document the present date. |
useRegex | NYT-([0-9]{4}-[0-9]{2}-[0-9]{2}).xml | Specify a regular expression matching file names. The first group will be extracted as the date. |
Accessing Entity Confidences
The following example shows how to access label confidences for tokens and entities. Each token stores the probability of its NER label given by the CRF that was used to assign the label in the CoreAnnotations.NamedEntityTagProbsAnnotation.class
. Each entity mention contains the probability of the token with the lowest label probability in its span. For example if Los Angeles
had the following probabilities:
{word: 'Los', 'tag': 'LOCATION', 'prob': .992}
{word: 'Angeles', 'tag': 'LOCATION', 'prob': .999}
the entity Los Angeles
would be assigned the LOCATION
tag with a confidence of .992
.
Below is code for accessing these confidences.
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import java.util.*;
public class NERConfidenceExample {
public static void main(String[] args) {
String exampleText = "Joe Smith lives in California.";
Properties props = new Properties();
props.setProperty("annotators", "tokenize,pos,lemma,ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
CoreDocument document = new CoreDocument(exampleText);
pipeline.annotate(document);
// get confidences for entities
for (CoreEntityMention em : document.entityMentions()) {
System.out.println(em.text() + "\t" + em.entityTypeConfidences());
}
// get confidences for tokens
for (CoreLabel token : document.tokens()) {
System.out.println(token.word() + "\t" + token.get(CoreAnnotations.NamedEntityTagProbsAnnotation.class));
}
}
}
Caseless models
It is possible to run Stanford CoreNLP with NER models that ignore capitalization. We have trained models like this for English. You can find details on the Caseless models page.
Training or retraining new models
The train/dev/test data files should be in the following format:
Joe PERSON
Smith PERSON
lives O
in O
California LOCATION
. O
He O
used O
to O
live O
in O
Oregon LOCATION
. O
In this example, each line is a token, followed by a tab, followed by the NER tag. A blank line represents a sentence break. The model that we release is trained on over a million tokens. The more training data you have, the more accurate your model should be.
The standard training data sets used for PERSON/LOCATION/ORGANIZATION/MISC must be purchased from the LDC, we do not distribute them.
Here is the command for starting the training process (make sure your CLASSPATH is set up to include all of the Stanford CoreNLP jars):
java -cp "*" edu.stanford.nlp.ie.crf.CRFClassifier -prop ner.model.props
The training process can be customized using a properties file. Here is an example properties file for training an English model(ner.model.props):
# location of training data
trainFileList = /path/to/conll.3class.train
# location of test data
testFile = /path/to/all.3class.test
# where to store the saved model
serializeTo = ner.model.ser.gz
type = crf
wordFunction = edu.stanford.nlp.process.AmericanizeFunction
useDistSim = false
# establish the data file format
map = word=0,answer=1
saveFeatureIndexToDisk = true
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useLongSequences=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
useOccurrencePatterns=true
useLastRealWord=true
useNextRealWord=true
normalize=true
wordShape=chris2useLC
useDisjunctive=true
disjunctionWidth=5
readerAndWriter=edu.stanford.nlp.sequences.ColumnDocumentReaderAndWriter
useObservedSequencesOnly=true
useQN = true
QNsize = 25
# makes it go faster
featureDiffThresh=0.05
There is more info about training a CRF model here.
You can learn more about what the various properties above mean here.
SUTime rules can be changed by modifying its included TokensRegex rule files. Changing other rule-based components (money, etc.) requires changes to the Java source code.
More information
For more details on the CRF tagger see this page.