Named Entity Recognition

Description
Options
Example Usage
Training-Only Options

Description

The named entity recognition (NER) module recognizes mention spans of a particular entity type (e.g., Person or Organization) in the input sentence. NER is widely used in many NLP applications such as information extraction or question answering systems. In Stanza, NER is performed by the NERProcessor and can be invoked by the name ner.

Note:

The NERProcessor currently supports 23 languages. All supported languages along with their training datasets can be found here.

Name	Annotator class name	Requirement	Generated Annotation	Description
ner	NERProcessor	tokenize, mwt	Named entities accessible through `Document` or `Sentence`’s properties `entities` or `ents`. Token-level NER tags accessible through `Token`’s properties `ner`.	Recognize named entities for all token spans in the corpus.

Options

Option name	Type	Default	Description
ner_batch_size	int	32	When annotating, this argument specifies the maximum number of sentences to process as a minibatch for efficient processing. Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device).
ner_pretrain_path	str	model-specific	Where to get the pretrained word embedding. If you trained your own NER model with a different pretrain from the default, you will need to set this flag to use the model.

Example Usage

Running the NERProcessor simply requires the TokenizeProcessor. After the pipeline is run, the Document will contain a list of Sentences, and the Sentences will contain lists of Tokens. Named entities can be accessed through Document or Sentence’s properties entities or ents. Alternatively, token-level NER tags can be accessed via the ner fields of Token.

Accessing Named Entities for Sentence and Document

Here is an example of performing named entity recognition for a piece of text and accessing the named entities in the entire document:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Chris Manning teaches at Stanford University. He lives in the Bay Area.")
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

Instead of accessing entities in the entire document, you can also access the named entities in each sentence of the document. The following example provides an identical result from the one above, by accessing entities from sentences instead of the entire document:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Chris Manning teaches at Stanford University. He lives in the Bay Area.")
print(*[f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents], sep='\n')

As can be seen in the output, Stanza correctly identifies that Chris Manning is a person, Stanford University an organization, and the Bay Area is a location.

entity: Chris Manning	type: PERSON
entity: Stanford University	type: ORG
entity: the Bay Area	type: LOC

Accessing Named Entity Recogition (NER) Tags for Token

It might sometimes be useful to access the BIOES NER tags for each token, and here is an example how:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Chris Manning teaches at Stanford University. He lives in the Bay Area.")
print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\n')

The result is the BIOES representation of the entities we saw above

token: Chris	ner: B-PERSON
token: Manning	ner: E-PERSON
token: teaches	ner: O
token: at	ner: O
token: Stanford	ner: B-ORG
token: University	ner: E-ORG
token: .	ner: O
token: He	ner: O
token: lives	ner: O
token: in	ner: O
token: the	ner: B-LOC
token: Bay	ner: I-LOC
token: Area	ner: E-LOC
token: .	ner: O

Using multiple models

New in v1.4.0

When creating the pipeline, it is possible to use multiple NER models at once by specifying a list in the package dict. Here is a brief example:

import stanza
pipe = stanza.Pipeline("en", processors="tokenize,ner", package={"ner": ["ncbi_disease", "ontonotes"]})
doc = pipe("John Bauer works at Stanford and has hip arthritis.  He works for Chris Manning")
print(doc.ents)

Output:

[{
  "text": "John Bauer",
  "type": "PERSON",
  "start_char": 0,
  "end_char": 10
}, {
  "text": "Stanford",
  "type": "ORG",
  "start_char": 20,
  "end_char": 28
}, {
  "text": "hip arthritis",
  "type": "DISEASE",
  "start_char": 37,
  "end_char": 50
}, {
  "text": "Chris Manning",
  "type": "PERSON",
  "start_char": 66,
  "end_char": 79
}]

Furthermore, the multi_ner field will have the outputs of each NER model in order.

# results truncated for legibility
# note that the token ids start from 1, not 0
print(doc.sentences[0].tokens[0:2])
print(doc.sentences[0].tokens[8:10])