Link

Tokenization & Sentence Segmentation

Table of contents


Description

Tokenization and sentence segmentation in Stanza are jointly performed by the TokenizeProcessor. This processor splits the raw input text into tokens and sentences, so that downstream annotation can happen at the sentence level. This processor can be invoked by the name tokenize.

NameAnnotator class nameRequirementGenerated AnnotationDescription
tokenizeTokenizeProcessor-Segments a Document into Sentences, each containing a list of Tokens. This processor also predicts which tokens are multi-word tokens, but leaves expanding them to the MWTProcessor.Tokenizes the text and performs sentence segmentation.

Options

The following options are available to configure the TokenizeProcessor when instantiating the Pipeline:

Option nameTypeDefaultDescription
tokenize_batch_sizeint32When annotating, this argument specifies the maximum number of paragraphs to process as a minibatch for efficient processing.
Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device).
tokenize_pretokenizedboolFalseAssume the text is tokenized by white space and sentence split by newline. Do not run a model.
tokenize_no_ssplitboolFalseAssume the sentences are split by two continuous newlines (\n\n). Only run tokenization and disable sentence segmentation.

Example Usage

The TokenizeProcessor is usually the first processor used in the pipeline. It performs tokenization and sentence segmentation at the same time. After this processor is run, the input document will become a list of Sentences. Each Sentence contains a list of Tokens, which can be accessed with the property tokens.

Tokenization and Sentence Segmentation

Here is a simple example of performing tokenization and sentence segmentation on a piece of plaintext:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('This is a test sentence for stanza. This is another sentence.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

This code will generate the following output, which shows that the text is segmented into two sentences, each containing a few tokens:

====== Sentence 1 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: a
id: 4   text: test
id: 5   text: sentence
id: 6   text: for
id: 7   text: stanza
id: 8   text: .
====== Sentence 2 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: another
id: 4   text: sentence
id: 5   text: .

You can also use the tokenizer just for sentence segmentation. To access segmented sentences, simply use

print([sentence.text for sentence in doc.sentences])

Tokenization without Sentence Segmentation

Sometimes you might want to tokenize your text given existing sentences (e.g., in machine translation). You can perform tokenization without sentence segmentation, as long as the sentences are split by two continuous newlines (\n\n) in the raw text. Just set tokenize_no_ssplit as True to disable sentence segmentation. Here is an example:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_no_ssplit=True)
doc = nlp('This is a sentence.\n\nThis is a second. This is a third.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

As you can see in the output below, only two Sentences resulted from this processing, where the second contains all the tokens in the second and third sentences if we were to perform sentence segmentation.

====== Sentence 1 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: a
id: 4   text: sentence
id: 5   text: .
====== Sentence 2 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: a
id: 4   text: second
id: 5   text: .
id: 6   text: This
id: 7   text: is
id: 8   text: a
id: 9   text: third
id: 10  text: .

Contrast this to Stanza’s output when tokenize_no_ssplit is set to False (its default value):

====== Sentence 1 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: a
id: 4   text: sentence
id: 5   text: .
====== Sentence 2 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: a
id: 4   text: second
id: 5   text: .
====== Sentence 3 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: a
id: 4   text: third
id: 5   text: .

Note that sentence segmentation is performed here as is normally the case.

Start with Pretokenized Text

In some cases, you might have already tokenized your text, and just want to use Stanza for downstream processing. In these cases, you can feed in pretokenized (and sentence split) text to the pipeline, as newline (\n) separated sentences, where each sentence is space separated tokens. Just set tokenize_pretokenized as True to bypass the neural tokenizer.

The code below shows an example of bypassing the neural tokenizer:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_pretokenized=True)
doc = nlp('This is token.ization done my way!\nSentence split, too!')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

An alternative to passing in a string is to pass in a list of lists of strings, representing a document with sentences, each sentence a list of tokens.

The equivalent of our example above would be:

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_pretokenized=True)
doc = nlp([['This', 'is', 'token.ization', 'done', 'my', 'way!'], ['Sentence', 'split,', 'too!']])
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

As you can see in the output below, no further tokenization or sentence segmentation is performed (note how punctuation are attached to the end of tokens as well as inside of tokens.

====== Sentence 1 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: token.ization
id: 4   text: done
id: 5   text: my
id: 6   text: way!
====== Sentence 2 tokens =======
id: 1   text: Sentence
id: 2   text: split,
id: 3   text: too!

Contrast this with Stanza’s output when tokenize_pretokenized is set to False, Stanza would perform tokenization and sentence segmentation as it sees fits.

====== Sentence 1 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: token
id: 4   text: .
id: 5   text: ization
id: 6   text: done
id: 7   text: my
id: 8   text: way
id: 9   text: !
====== Sentence 2 tokens =======
id: 1   text: Sentence
id: 2   text: split
id: 3   text: ,
id: 4   text: too
id: 5   text: !

Stanza will also accept a pretokenized Document for further processing with this flag:

import stanza

nlp_tokenized = stanza.Pipeline(lang='en', processors='tokenize', tokenize_pretokenized=True)
nlp_pos = stanza.Pipeline(lang='en', processors='tokenize,pos', tokenize_pretokenized=True)
doc = nlp_tokenized('This is token.ization done my way!\nSentence split, too!')
doc = nlp_pos(doc)
print("{:C}".format(doc))

for i, sentence in enumerate(doc.sentences):
    print(sentence)

This will output the following (results may vary based on POS model used):

# text = This is token.ization done my way!
# sent_id = 0
1       This    _       PRON    DT      Number=Sing|PronType=Dem        0       _       _       start_char=0|end_char=4
2       is      _       AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   1       _       _       start_char=5|end_char=7
3       token.ization   _       PROPN   NNP     Number=Sing     2       _       _       start_char=8|end_char=21
4       done    _       VERB    VBN     Tense=Past|VerbForm=Part        3       _       _       start_char=22|end_char=26
5       my      _       PRON    PRP$    Case=Gen|Number=Sing|Person=1|Poss=Yes|PronType=Prs     4       _       _       start_char=27|end_char=29
6       way!    _       NOUN    NN      Number=Sing     5       _       _       start_char=30|end_char=34

# text = Sentence split, too!
# sent_id = 1
1       Sentence        _       NOUN    NN      Number=Sing     0       _       _       start_char=35|end_char=43
2       split,  _       ADJ     JJ      Degree=Pos      1       _       _       start_char=44|end_char=50
3       too!    _       PUNCT   .       _       2       _       _       start_char=51|end_char=55

Use spaCy for Fast Tokenization and Sentence Segmentation

While our neural pipeline can achieve significantly higher accuracy, rule-based tokenizer such as spaCy runs much faster when processing large-scale text. We provide an interface to use spaCy as the tokenizer for English by simply specifying in the processors option. Please make sure you have successfully downloaded and installed spaCy and English models following their usage guide.

To perform tokenization and sentence segmentation with spaCy, simply set the package for the TokenizeProcessor to spacy, as in the following example:

import stanza

nlp = stanza.Pipeline(lang='en', processors={'tokenize': 'spacy'}) # spaCy tokenizer is currently only allowed in English pipeline.
doc = nlp('This is a test sentence for stanza. This is another sentence.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

This will allow us to tokenize the text with Spacy and use it in downstream annotations in Stanza. The output is:

====== Sentence 1 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: a
id: 4   text: test
id: 5   text: sentence
id: 6   text: for
id: 7   text: stanza
id: 8   text: .
====== Sentence 2 tokens =======
id: 1   text: This
id: 2   text: is
id: 3   text: another
id: 4   text: sentence
id: 5   text: .

Training-Only Options

Most training-only options are documented in the argument parser of the tokenizer.

Note that to train the tokenizer for Vietnamese, one needs to postprocess the character labels generated from the plain text file and the CoNLL-U file to form syllable-level labels, which is automatically handled if you are using the training scripts we provide.