Description
Generates the word lemmas for all tokens in the corpus.
| Property name |
Processor class name |
Generated Annotation |
| lemma |
LemmaProcessor |
Perform lemmatization on a Word using the Word.text and Word.upos value. The result can be accessed in Word.lemma. |
Options
| Option name |
Type |
Default |
Description |
| lemma_use_identity |
bool |
False |
When this flag is used, an identity lemmatizer (see models.identity_lemmatizer) will be used instead of a statistical lemmatizer. This is useful when [Word.lemma] is required for languages such as Vietnamese, where the lemma is identical to the original word form. |
| lemma_batch_size |
int |
50 |
When annotating, this argument specifies the maximum number of words to batch for efficient processing. |
| lemma_ensemble_dict |
bool |
True |
If set to True, the lemmatizer will ensemble a seq2seq model with the output from a dictionary-based lemmatizer, which yields improvements on many languages (see system description paper for more details). |
| lemma_dict_only |
bool |
False |
If set to True, only a dictionary-based lemmatizer will be used. For languages such as Chinese, a dictionary-based lemmatizer is enough. |
| lemma_edit |
bool |
True |
If set to True, use an edit classifier alongside the seq2seq lemmatizer. The edit classifier will predict “shortcut” operations such as “identical” or “lowercase”, to make the lemmatization of long sequences more stable. |
| lemma_beam_size |
int |
1 |
Control the beam size used during decoding in the seq2seq lemmatizer. |
| lemma_max_dec_len |
int |
50 |
Control the maximum decoding character length in the seq2seq lemmatizer. The decoder will stop if this length is achieved and the end-of-sequence character is still not seen. |
Example Usage
If your main interest is lemmatizing, you can supply a smaller processors list with just the prerequisites for lemma.
After the pipeline is run, the document will contain a list of sentences, and the sentences will contain lists of words.
The lemma information can be found in the lemma field.
import stanfordnlp
nlp = stanfordnlp.Pipeline(processors='tokenize,mwt,pos,lemma')
doc = nlp("Barack Obama was born in Hawaii.")
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')
This code will generate the following output:
word: Barack lemma: Barack
word: Obama lemma: Obama
word: was lemma: be
word: born lemma: bear
word: in lemma: in
word: Hawaii lemma: Hawaii
word: . lemma: .
Training-Only Options
Most training-only options are documented in the argument parser of the lemmatizer.