Part-of-Speech & Morphological Features
Table of contents
Description
The Part-of-Speech (POS) & morphological features tagging module labels words with their universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats). This is jointly performed by the POSProcessor
in Stanza, and can be invoked with the name pos
.
Name | Annotator class name | Requirement | Generated Annotation | Description |
---|---|---|---|---|
pos | POSProcessor | tokenize, mwt | UPOS, XPOS, and UFeats annotations are accessible through Word ’s properties pos , xpos , and ufeats . | Labels tokens with their universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats). |
Options
Option name | Type | Default | Description |
---|---|---|---|
pos_batch_size | int | 5000 | When annotating, this argument specifies the maximum number of words to process as a minibatch for efficient processing. Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device). This parameter should be set larger than the number of words in the longest sentence in your input document, or you might run into unexpected behaviors. |
Example Usage
Running the POSProcessor requires the TokenizeProcessor and MWTProcessor. After the pipeline is run, the Document
will contain a list of Sentence
s, and the Sentence
s will contain lists of Word
s. The part-of-speech tags can be accessed via the upos
(pos
) and xpos
fields of each Word
, while the universal morphological features can be accessed via the feats
field.
Accessing POS and Morphological Feature for Word
Here is an example of tagging a piece of text and accessing part-of-speech and morphological features for each word:
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos')
doc = nlp('Barack Obama was born in Hawaii.')
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')
As can be seen in the result, we can tell that the word was
is a third-person auxiliary verb in the past tense from Stanza’s analysis.
word: Barack upos: PROPN xpos: NNP feats: Number=Sing
word: Obama upos: PROPN xpos: NNP feats: Number=Sing
word: was upos: AUX xpos: VBD feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word: born upos: VERB xpos: VBN feats: Tense=Past|VerbForm=Part|Voice=Pass
word: in upos: ADP xpos: IN feats: _
word: Hawaii upos: PROPN xpos: NNP feats: Number=Sing
word: . upos: PUNCT xpos: . feats: _
Training-Only Options
Most training-only options are documented in the argument parser of the POS/UFeats tagger.