Link

Pipeline and Processors

Table of contents


Pipeline

To start annotating text with Stanza, you would typically start by building a Pipeline that contains Processors, each fulfilling a specific NLP task you desire (e.g., tokenization, part-of-speech tagging, syntactic parsing, etc). The pipeline takes in raw text or a Document object that contains partial annotations, runs the specified processors in succession, and returns an annotated Document (see the documentation on Document for more information on how to extract these annotations).

To build and customize the pipeline, you can specify the options in the table below:

OptionTypeDefaultDescription
langstr'en'Language code (e.g., "en") or language name (e.g., "English") for the language to process with the Pipeline. You can find a complete list of available languages here.
dirstr'~/stanza_resources'Directory for storing the models downloaded for Stanza. By default, Stanza stores its models in a folder in your home directory.
packagestr'default'Package to use for processors, where each package typically specifies what data the models are trained on. We provide a “default” package for all languages that contains NLP models most users will find useful. A complete list of available packages can be found here.
processorsdict or strdict()Processors to use in the Pipeline. This can either be specified as a comma-seperated list of processor names to use (e.g., 'tokenize,pos'), or a Python dictionary with Processor names as keys and packages as corresponding values (e.g., {'tokenize': 'ewt', 'pos': 'ewt'}). All unspecified Processors will fall back to using the package specified by the package argument. A list of all Processors supported can be found here.
logging_levelstr'INFO'Controls the level of logging information to display when the Pipeline is instantiated and run. Can be one of 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CIRTICAL', or 'FATAL'. Less and less information will be displayed from 'DEBUG' to 'FATAL'.
verbosestrNoneSimplified option for logging level. If True, logging level will be set to 'INFO'. If False, logging level will be set to 'ERROR'.
use_gpuboolTrueAttempt to use a GPU if available. Set this to False if you are in a GPU-enabled environment but want to explicitly keep Stanza from using the GPU.
kwargs--Options for each of the individual processors. See the individual processor pages for descriptions.

Processors

Processors are units of the neural pipeline that perform specific NLP functions and create different annotations for a Document. The neural pipeline supports the following processors:

NameProcessor class nameRequirementGenerated AnnotationDescription
tokenizeTokenize­Processor-Segments a Document into Sentences, each containing a list of Tokens. This processor also predicts which tokens are multi-word tokens, but leaves expanding them to the MWTProcessor.Tokenizes the text and performs sentence segmentation.
mwtMWT­ProcessortokenizeExpands multi-word tokens (MWTs) into multiple words when they are predicted by the tokenizer. Each Token will correspond to one or more Words after tokenization and MWT expansion.Expands multi-word tokens (MWT) predicted by the TokenizeProcessor. This is only applicable to some languages.
posPOS­Processortokenize, mwtUPOS, XPOS, and UFeats annotations are accessible through Word’s properties pos, xpos, and ufeats.Labels tokens with their universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats).
lemmaLemma­Processortokenize, mwt, posPerform lemmatization on a Word using the Word.text and Word.upos values. The result can be accessed as Word.lemma.Generates the word lemmas for all words in the Document.
depparseDepparse­Processortokenize, mwt, pos, lemmaDetermines the syntactic head of each word in a sentence and the dependency relation between the two words that are accessible through Word’s head and deprel attributes.Provides an accurate syntactic dependency parsing analysis.
nerNER­Processortokenize, mwtNamed entities accessible through Document or Sentence’s properties entities or ents. Token-level NER tags accessible through Token’s properties ner.Recognize named entities for all token spans in the corpus.

Processor variants

New in v1.1

Sometimes you might want to build your own models to perform the tasks that existing processors already handle in the Stanza neural pipeline, or simply experiment with alternative toolkits for a specific task. Processor variants are here to help with that.

One example use case is using your own tokenizer for tokenization. Previously we have added support for popular tokenizers like spaCy for English and jieba for Chinese, and now we have made it much easier to add your own. You simply need to implement a ProcessorVariant and register it with Stanza using the @register_processor_variant decorator. For instance, this is how our spaCy tokenizer is implemented

from stanza.pipeline.processor import ProcessorVariant, register_processor_variant

@register_processor_variant('tokenize', 'spacy')
class SpacyTokenizer(ProcessorVariant):
    def __init__(self, config):
        # initialize spacy

    def process(self, text):
        # tokenize text with spacy

This allows the user to set tokenize_with_spacy as True (or processors={"tokenize": "spacy"}) when instantiating the pipeline to use it. In the case of the tokenizer, the TokenizeProcessor handles options such as pre-tokenization and pre-sengmentation, and only passes text to the variants when tokenization from raw text is needed.

Alternatively, one can also implement a processor variant as a drop-in replacement for a processor by setting OVERRIDE as True in the ProcessorVariant class, for instance

@register_processor_variant("lemma", "cool")
class CoolLemmatizer(ProcessorVariant):
    ''' An alternative lemmatizer that lemmatizes every word to "cool". '''

    OVERRIDE = True

    def __init__(self, lang):
        pass

    def process(self, document):
        for sentence in document.sentences:
            for word in sentence.words:
                word.lemma = "cool"

        return document

This lemmatizer will replace all of the functionality of the Stanza lemmatizer when it’s used in the pipeline, and lemmatize every single word to “cool”.

Building your own Processors and using them in the neural pipeline

New in v1.1

If you’re looking to implement annotation capabilities that don’t currently exist in Stanza and want to use it in the neural pipeline, it hasn’t been easier. You can simply implement a Processor class and register it with Stanza using the @register_processor decorator, and then it’s easy to use it in your project, and/or publish it for other Stanza users to use.

Here is an example:

@register_processor("lowercase")
class LowercaseProcessor(Processor):
    ''' Processor that lowercases all text '''
    _requires = set(['tokenize'])
    _provides = set(['lowercase'])

    def __init__(self, config, pipeline, use_gpu):
        pass

    def _set_up_model(self, *args):
        pass

    def process(self, doc):
        doc.text = doc.text.lower()
        for sent in doc.sentences:
            for tok in sent.tokens:
                tok.text = tok.text.lower()

            for word in sent.words:
                word.text = word.text.lower()

        return doc

Once registered, you can use this processor in the pipeline as if it were one of Stanza’s standard processors

nlp = stanza.Pipeline(dir=TEST_MODELS_DIR, lang='en', processors='tokenize,lowercase')

and in this case the processor will lowercase all text in the document.

Usage

Using Stanza’s neural Pipeline to annotate your text can be as simple as a few lines of Python code. Here we provide two simple examples, and refer the user to our tutorials for further details on how to use each Processor.

Basic Example

To annotate a piece of text, you can easily build the Stanza Pipeline with options introduced above:

import stanza

nlp = stanza.Pipeline('en', processors='tokenize,pos', use_gpu=True, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp("Barack Obama was born in Hawaii.") # Run the pipeline on the input text
print(doc) # Look at the result

Here, we are building a Pipeline for English that performs tokenization, sentence segmentation, and POS tagging that runs on the GPU, and POS tagging is limited to processing 3000 words at one time to avoid excessive GPU memory consumption.

You can find more examples about how to use these options here.

Build Pipeline from a Config Dictionary

When there are many options you want to configure, or even set programmatically, it might not be convenient to set them one by one using keyword arguments to instantiate the Pipeline. In these cases, alternatively, you can build the desired pipeline with a config dictionary, allowing maximum customization for the pipeline:

import stanza

config = {
	'processors': 'tokenize,mwt,pos', # Comma-separated list of processors to use
	'lang': 'fr', # Language code for the language to build the Pipeline in
	'tokenize_model_path': './fr_gsd_models/fr_gsd_tokenizer.pt', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
	'mwt_model_path': './fr_gsd_models/fr_gsd_mwt_expander.pt',
	'pos_model_path': './fr_gsd_models/fr_gsd_tagger.pt',
	'pos_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt',
	'tokenize_pretokenized': True # Use pretokenized text as input and disable tokenization
}
nlp = stanza.Pipeline(**config) # Initialize the pipeline using a configuration dict
doc = nlp("Van Gogh grandit au sein d'une famille de l'ancienne bourgeoisie .") # Run the pipeline on the pretokenized input text
print(doc) # Look at the result

Here, we can specify the language, processors, and paths for many Processor models all at once, and pass that to the Pipeline initializer. Note that config dictionaries and keyword arguments can be combined as well, to maximize your flexibility in using Stanza’s neural pipeline.