Tokenizes the text and performs sentence segmentation.
||Processor class name
Sentences, each containing a list of
Tokens. This processor also predicts which tokens are multi-word tokens, but leaves expanding them to the MWT expander.
||When annotating, this argument specifies the maximum number of paragraphs to process as a minibatch for efficient processing.
Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device).
||Assume the text is tokenized by white space and sentence split by newline. Do not run a model.
Most training-only options are documented in the argument parser of the tokenizer.
Note that to train the tokenizer for Vietnamese, one would need to postprocess the character labels generated from the plain text file and the CoNLL-U file to form syllable-level labels, which is automatically handled if you are using the training scripts we provide.