Multi-Word Token (MWT) Expansion
Table of contents
Description
The Multi-Word Token (MWT) expansion module can expand a raw token into multiple syntactic words, which makes it easier to carry out Universal Dependencies analysis in some languages. This was handled by the MWTProcessor
in Stanza, and can be invoked with the name mwt
. The token upon which an expansion will be performed is predicted by the TokenizeProcessor
, before the invocation of the MWTProcessor
.
For more details on why MWT is necessary for Universal Dependencies analysis, please visit the UD tokenization page.
Name | Annotator class name | Requirement | Generated Annotation | Description |
---|---|---|---|---|
mwt | MWTProcessor | tokenize | Expands multi-word tokens (MWTs) into multiple words when they are predicted by the tokenizer. Each Token will correspond to one or more Word s after tokenization and MWT expansion. | Expands multi-word tokens (MWT) predicted by the TokenizeProcessor. This is only applicable to some languages. |
Options
Option name | Type | Default | Description |
---|---|---|---|
mwt_batch_size | int | 50 | When annotating, this argument specifies the maximum number of words to process as a minibatch for efficient processing. Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device). |
Example Usage
The MWTProcessor processor only requires TokenizeProcessor to be run before it. After these two processors have processed the text, the Sentence
s will have lists of Token
s and corresponding syntactic Word
s based on the multi-word-token expander model. The list of tokens for a sentence sent
can be accessed with sent.tokens
, and its list of words with sent.words
. Similarly, the list of words for a token token
can be accessed with token.words
.
Accessing Syntactic Words for Multi-Word Token
Here is an example of a piece of text in French that requires multi-word token expansion, and how to access the underlying words of these multi-word tokens:
import stanza
nlp = stanza.Pipeline(lang='fr', processors='tokenize,mwt')
doc = nlp('Nous avons atteint la fin du sentier.')
for token in doc.sentences[0].tokens:
print(f'token: {token.text}\twords: {", ".join([word.text for word in token.words])}')
As a result of running this code, we see that the word du is expanded into its underlying syntactic words, de and le.
token: Nous words: Nous
token: avons words: avons
token: atteint words: atteint
token: la words: la
token: fin words: fin
token: du words: de, le
token: sentier words: sentier
token: . words: .
Accessing Parent Token for Word
When performing word-level annotations and processing, it might sometimes be useful to access the token a given word is derived from, so that we can access its character offsets, among other things, that are associated with the token. Here is an example of how to do that with Word
’s parent
property with the same sentence we just saw:
import stanza
nlp = stanza.Pipeline(lang='fr', processors='tokenize,mwt')
doc = nlp('Nous avons atteint la fin du sentier.')
for word in doc.sentences[0].words:
print(f'word: {word.text}\tparent token: {word.parent.text}')
As one can see in the result below, Words de
and le
have the same parent token du
.
word: Nous parent token: Nous
word: avons parent token: avons
word: atteint parent token: atteint
word: la parent token: la
word: fin parent token: fin
word: de parent token: du
word: le parent token: du
word: sentier parent token: sentier
word: . parent token: .
Training-Only Options
Most training-only options are documented in the argument parser of the MWT expander.
Resplitting tokens with MWT
In some circumstances, you may want to use the MWT processor to resplit known tokens. For example, we had an instance where an Italian dataset included token boundaries, but the words were not separated into clitics. For this, we provide a utility function
[resplit_mwt](https://github.com/stanfordnlp/stanza/blob/2fb99e379b0b5c97c01795d53ebd52f38e34e97b/stanza/models/mwt/utils.py#L7)
This function takes a list of list of string, representing the known token boundaries, and a Pipeline
with a minimum of a tokenizer and MWT processor, and retokenizes the text, returning a Stanza Document
. An example usage is in the unit test for the resplit function
If further processing is needed, this Document
can be passed to another pipeline which uses the tokenize_pretokenized
flag, in which case the second pipeline will respect the tokenization and MWT boundaries in the Document
.
Character Classifier
As of Stanza 1.9.0, there is a character classifier for certain MWT datasets in which the tokens are composed of words which add up to exactly the original token, every time. For example, in English, cannot
is composed of can
and not
, and this holds true for all tokens in English (given that the standard is to split won't
into wo
and n't
). In French, this is not necessarily true, as the token des
splits into de les
, among others.
When training a new MWT model, this is automatically detected and logged at training time.