Description
Expands multi-word tokens (MWT) predicted by the tokenizer.
Property name |
Processor class name |
Generated Annotation |
mwt |
MWTProcessor |
Expands multi-word tokens into multiple words when they are predicted by the tokenizer. |
Options
Option name |
Type |
Default |
Description |
mwt_batch_size |
int |
50 |
When annotating, this argument specifies the maximum number of words to process as a minibatch for efficient processing. Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device). |
Example Usage
The mwt
processor only requires tokenize
. After these two processors have run, the Sentence
s will have
lists of tokens and corresponding words based on the multi-word-token expander model. The list of tokens for
sentence sent
can be accessed with sent.tokens
. The list of words for sentence sent
can be accessed with
sent.words
. The list of words for a token token
can be accessed with token.words
. The code below shows
an example of accessing tokens and words.
import stanfordnlp
nlp = stanfordnlp.Pipeline(processors='tokenize,mwt', lang='fr')
doc = nlp("Alors encore inconnu du grand public, Emmanuel Macron devient en 2014 ministre de l'Économie, de l'Industrie et du Numérique.")
print(*[f'token: {token.text.ljust(9)}\t\twords: {token.words}' for sent in doc.sentences for token in sent.tokens], sep='\n')
print('')
print(*[f'word: {word.text.ljust(9)}\t\ttoken parent:{word.parent_token.index+"-"+word.parent_token.text}' for sent in doc.sentences for word in sent.words], sep='\n')
This code will generate the following output:
token: Alors words: [<Word index=1;text=Alors>]
token: encore words: [<Word index=2;text=encore>]
token: inconnu words: [<Word index=3;text=inconnu>]
token: du words: [<Word index=4;text=de>, <Word index=5;text=le>]
token: grand words: [<Word index=6;text=grand>]
token: public words: [<Word index=7;text=public>]
token: , words: [<Word index=8;text=,>]
token: Emmanuel words: [<Word index=9;text=Emmanuel>]
token: Macron words: [<Word index=10;text=Macron>]
token: devient words: [<Word index=11;text=devient>]
token: en words: [<Word index=12;text=en>]
token: 2014 words: [<Word index=13;text=2014>]
token: ministre words: [<Word index=14;text=ministre>]
token: de words: [<Word index=15;text=de>]
token: l' words: [<Word index=16;text=l'>]
token: Économie words: [<Word index=17;text=Économie>]
token: , words: [<Word index=18;text=,>]
token: de words: [<Word index=19;text=de>]
token: l' words: [<Word index=20;text=l'>]
token: Industrie words: [<Word index=21;text=Industrie>]
token: et words: [<Word index=22;text=et>]
token: du words: [<Word index=23;text=de>, <Word index=24;text=le>]
token: Numérique words: [<Word index=25;text=Numérique>]
token: . words: [<Word index=26;text=.>]
word: Alors token parent:1-Alors
word: encore token parent:2-encore
word: inconnu token parent:3-inconnu
word: de token parent:4-5-du
word: le token parent:4-5-du
word: grand token parent:6-grand
word: public token parent:7-public
word: , token parent:8-,
word: Emmanuel token parent:9-Emmanuel
word: Macron token parent:10-Macron
word: devient token parent:11-devient
word: en token parent:12-en
word: 2014 token parent:13-2014
word: ministre token parent:14-ministre
word: de token parent:15-de
word: l' token parent:16-l'
word: Économie token parent:17-Économie
word: , token parent:18-,
word: de token parent:19-de
word: l' token parent:20-l'
word: Industrie token parent:21-Industrie
word: et token parent:22-et
word: de token parent:23-24-du
word: le token parent:23-24-du
word: Numérique token parent:25-Numérique
word: . token parent:26-.
Training-Only Options
Most training-only options are documented in the argument parser of the MWT expander.