Link

Named Entity Recognition

Table of contents


Description

The named entity recognition (NER) module recognizes mention spans of a particular entity type (e.g., Person or Organization) in the input sentence. NER is widely used in many NLP applications such as information extraction or question answering systems. In Stanza, NER is performed by the NERProcessor and can be invoked by the name ner.

NameAnnotator class nameRequirementGenerated AnnotationDescription
nerNERProcessortokenize, mwtNamed entities accessible through Document or Sentence’s properties entities or ents. Token-level NER tags accessible through Token’s properties ner.Recognize named entities for all token spans in the corpus.

Options

Option nameTypeDefaultDescription
ner_batch_sizeint32When annotating, this argument specifies the maximum number of sentences to process as a minibatch for efficient processing.
Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device).
ner_pretrain_pathstrmodel-specificWhere to get the pretrained word embedding. If you trained your own NER model with a different pretrain from the default, you will need to set this flag to use the model.

Example Usage

Running the NERProcessor simply requires the TokenizeProcessor. After the pipeline is run, the Document will contain a list of Sentences, and the Sentences will contain lists of Tokens. Named entities can be accessed through Document or Sentence’s properties entities or ents. Alternatively, token-level NER tags can be accessed via the ner fields of Token.

Accessing Named Entities for Sentence and Document

Here is an example of performing named entity recognition for a piece of text and accessing the named entities in the entire document:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Chris Manning teaches at Stanford University. He lives in the Bay Area.")
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

Instead of accessing entities in the entire document, you can also access the named entities in each sentence of the document. The following example provides an identical result from the one above, by accessing entities from sentences instead of the entire document:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Chris Manning teaches at Stanford University. He lives in the Bay Area.")
print(*[f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents], sep='\n')

As can be seen in the output, Stanza correctly identifies that Chris Manning is a person, Stanford University an organization, and the Bay Area is a location.

entity: Chris Manning	type: PERSON
entity: Stanford University	type: ORG
entity: the Bay Area	type: LOC

Accessing Named Entity Recogition (NER) Tags for Token

It might sometimes be useful to access the BIOES NER tags for each token, and here is an example how:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp("Chris Manning teaches at Stanford University. He lives in the Bay Area.")
print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\n')

The result is the BIOES representation of the entities we saw above

token: Chris	ner: B-PERSON
token: Manning	ner: E-PERSON
token: teaches	ner: O
token: at	ner: O
token: Stanford	ner: B-ORG
token: University	ner: E-ORG
token: .	ner: O
token: He	ner: O
token: lives	ner: O
token: in	ner: O
token: the	ner: B-LOC
token: Bay	ner: I-LOC
token: Area	ner: E-LOC
token: .	ner: O

Using multiple models

New in v1.4.0

When creating the pipeline, it is possible to use multiple NER models at once by specifying a list in the package dict. Here is a brief example:

import stanza
pipe = stanza.Pipeline("en", processors="tokenize,ner", package={"ner": ["ncbi_disease", "ontonotes"]})
doc = pipe("John Bauer works at Stanford and has hip arthritis.  He works for Chris Manning")
print(doc.ents)

Output:

[{
  "text": "John Bauer",
  "type": "PERSON",
  "start_char": 0,
  "end_char": 10
}, {
  "text": "Stanford",
  "type": "ORG",
  "start_char": 20,
  "end_char": 28
}, {
  "text": "hip arthritis",
  "type": "DISEASE",
  "start_char": 37,
  "end_char": 50
}, {
  "text": "Chris Manning",
  "type": "PERSON",
  "start_char": 66,
  "end_char": 79
}]

Furthermore, the multi_ner field will have the outputs of each NER model in order.

# results truncated for legibility
# note that the token ids start from 1, not 0
print(doc.sentences[0].tokens[0:2])
print(doc.sentences[0].tokens[8:10])

Output:

[[
  {
    "id": 1,
    "text": "John",
    "start_char": 0,
    "end_char": 4,
    "ner": "B-PERSON",
    "multi_ner": [
      "O",
      "B-PERSON"
    ]
  }
], [
  {
    "id": 2,
    "text": "Bauer",
    "start_char": 5,
    "end_char": 10,
    "ner": "E-PERSON",
    "multi_ner": [
      "O",
      "E-PERSON"
    ]
  }
]]
[[
  {
    "id": 8,
    "text": "hip",
    "start_char": 37,
    "end_char": 40,
    "ner": "B-DISEASE",
    "multi_ner": [
      "B-DISEASE",
      "O"
    ]
  }
], [
  {
    "id": 9,
    "text": "arthritis",
    "start_char": 41,
    "end_char": 50,
    "ner": "E-DISEASE",
    "multi_ner": [
      "E-DISEASE",
      "O"
    ]
  }
]]

Training-Only Options

Most training-only options are documented in the argument parser of the NER tagger.