Expands multi-word tokens (MWT) predicted by the tokenizer.

Property name Processor class name Generated Annotation
mwt MWTProcessor Expands multi-word tokens into multiple words when they are predicted by the tokenizer.


Option name Type Default Description
mwt_batch_size int 50 When annotating, this argument specifies the maximum number of words to process as a minibatch for efficient processing.
Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device).

Example Usage

The mwt processor only requires tokenize. After these two processors have run, the Sentences will have lists of tokens and corresponding words based on the multi-word-token expander model. The list of tokens for sentence sent can be accessed with sent.tokens. The list of words for sentence sent can be accessed with sent.words. The list of words for a token token can be accessed with token.words. The code below shows an example of accessing tokens and words.

import stanfordnlp

nlp = stanfordnlp.Pipeline(processors='tokenize,mwt', lang='fr')
doc = nlp("Alors encore inconnu du grand public, Emmanuel Macron devient en 2014 ministre de l'Économie, de l'Industrie et du Numérique.")
print(*[f'token: {token.text.ljust(9)}\t\twords: {token.words}' for sent in doc.sentences for token in sent.tokens], sep='\n')
print(*[f'word: {word.text.ljust(9)}\t\ttoken parent:{word.parent_token.index+"-"+word.parent_token.text}' for sent in doc.sentences for word in sent.words], sep='\n')

This code will generate the following output:

token: Alors    		words: [<Word index=1;text=Alors>]
token: encore   		words: [<Word index=2;text=encore>]
token: inconnu  		words: [<Word index=3;text=inconnu>]
token: du       		words: [<Word index=4;text=de>, <Word index=5;text=le>]
token: grand    		words: [<Word index=6;text=grand>]
token: public   		words: [<Word index=7;text=public>]
token: ,        		words: [<Word index=8;text=,>]
token: Emmanuel 		words: [<Word index=9;text=Emmanuel>]
token: Macron   		words: [<Word index=10;text=Macron>]
token: devient  		words: [<Word index=11;text=devient>]
token: en       		words: [<Word index=12;text=en>]
token: 2014     		words: [<Word index=13;text=2014>]
token: ministre 		words: [<Word index=14;text=ministre>]
token: de       		words: [<Word index=15;text=de>]
token: l'       		words: [<Word index=16;text=l'>]
token: Économie 		words: [<Word index=17;text=Économie>]
token: ,        		words: [<Word index=18;text=,>]
token: de       		words: [<Word index=19;text=de>]
token: l'       		words: [<Word index=20;text=l'>]
token: Industrie		words: [<Word index=21;text=Industrie>]
token: et       		words: [<Word index=22;text=et>]
token: du       		words: [<Word index=23;text=de>, <Word index=24;text=le>]
token: Numérique		words: [<Word index=25;text=Numérique>]
token: .        		words: [<Word index=26;text=.>]
word: Alors    		token parent:1-Alors
word: encore   		token parent:2-encore
word: inconnu  		token parent:3-inconnu
word: de       		token parent:4-5-du
word: le       		token parent:4-5-du
word: grand    		token parent:6-grand
word: public   		token parent:7-public
word: ,        		token parent:8-,
word: Emmanuel 		token parent:9-Emmanuel
word: Macron   		token parent:10-Macron
word: devient  		token parent:11-devient
word: en       		token parent:12-en
word: 2014     		token parent:13-2014
word: ministre 		token parent:14-ministre
word: de       		token parent:15-de
word: l'       		token parent:16-l'
word: Économie 		token parent:17-Économie
word: ,        		token parent:18-,
word: de       		token parent:19-de
word: l'       		token parent:20-l'
word: Industrie		token parent:21-Industrie
word: et       		token parent:22-et
word: de       		token parent:23-24-du
word: le       		token parent:23-24-du
word: Numérique		token parent:25-Numérique
word: .        		token parent:26-.

Training-Only Options

Most training-only options are documented in the argument parser of the MWT expander.