StanfordNLP

Description

Generates the word lemmas for all tokens in the corpus.

Property name	Processor class name	Generated Annotation
lemma	LemmaProcessor	Perform lemmatization on a `Word` using the `Word.text` and `Word.upos` value. The result can be accessed in `Word.lemma`.

Options

Option name	Type	Default	Description
lemma_use_identity	bool	`False`	When this flag is used, an identity lemmatizer (see `models.identity_lemmatizer`) will be used instead of a statistical lemmatizer. This is useful when [`Word.lemma`] is required for languages such as Vietnamese, where the lemma is identical to the original word form.
lemma_batch_size	int	50	When annotating, this argument specifies the maximum number of words to batch for efficient processing.
lemma_ensemble_dict	bool	`True`	If set to `True`, the lemmatizer will ensemble a seq2seq model with the output from a dictionary-based lemmatizer, which yields improvements on many languages (see system description paper for more details).
lemma_dict_only	bool	`False`	If set to `True`, only a dictionary-based lemmatizer will be used. For languages such as Chinese, a dictionary-based lemmatizer is enough.
lemma_edit	bool	`True`	If set to `True`, use an edit classifier alongside the seq2seq lemmatizer. The edit classifier will predict “shortcut” operations such as “identical” or “lowercase”, to make the lemmatization of long sequences more stable.
lemma_beam_size	int	1	Control the beam size used during decoding in the seq2seq lemmatizer.
lemma_max_dec_len	int	50	Control the maximum decoding character length in the seq2seq lemmatizer. The decoder will stop if this length is achieved and the end-of-sequence character is still not seen.

Example Usage

If your main interest is lemmatizing, you can supply a smaller processors list with just the prerequisites for lemma. After the pipeline is run, the document will contain a list of sentences, and the sentences will contain lists of words. The lemma information can be found in the lemma field.

import stanfordnlp

nlp = stanfordnlp.Pipeline(processors='tokenize,mwt,pos,lemma')
doc = nlp("Barack Obama was born in Hawaii.")
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

This code will generate the following output:

word: Barack 	lemma: Barack
word: Obama 	lemma: Obama
word: was 	lemma: be
word: born 	lemma: bear
word: in 	lemma: in
word: Hawaii 	lemma: Hawaii
word: . 	lemma: .

Training-Only Options

Most training-only options are documented in the argument parser of the lemmatizer.

LemmaProcessor

Description

Options

Example Usage

Training-Only Options