Pipeline and Processors

Pipeline
Processors
- Processor variants
- Building your own Processors and using them in the neural pipeline
Advanced Pipeline Options
- Build Pipeline from a Config Dictionary

Pipeline

For basic end to end examples, please see Getting Started.

To start annotating text with Stanza, you would typically start by building a Pipeline that contains Processors, each fulfilling a specific NLP task you desire (e.g., tokenization, part-of-speech tagging, syntactic parsing, etc). The pipeline takes in raw text or a Document object that contains partial annotations, runs the specified processors in succession, and returns an annotated Document (see the documentation on Document for more information on how to extract these annotations).

To build and customize the pipeline, you can specify the options in the table below:

Option	Type	Default	Description
lang	`str`	`'en'`	Language code (e.g., `"en"`) or language name (e.g., `"English"`) for the language to process with the Pipeline. You can find a complete list of available languages here.
dir	`str`	`'~/stanza_resources'`	Directory for storing the models downloaded for Stanza. By default, Stanza stores its models in a folder in your home directory.
package	`dict` or `str`	`'default'`	Package to use for processors, where each package typically specifies what data the models are trained on. We provide a “default” package for all languages that contains NLP models most users will find useful. A complete list of available packages can be found here. If a `dict` is used, any processor not specified will be `default`. ner allows for a list of packages in the `dict`.
processors	`dict` or `str`	`dict()`	Processors to use in the Pipeline. This can either be specified as a comma-seperated list of processor names to use (e.g., `'tokenize,pos'`), or a Python dictionary with Processor names as keys and packages as corresponding values (e.g., `{'tokenize': 'ewt', 'pos': 'ewt'}`). In the case of a dict, all unspecified Processors will fall back to using the package specified by the `package` argument. To ensure that only the processors you want are loaded when using a dict, set `package=None` as well. A list of all Processors supported can be found here.
logging_level	`str`	`'INFO'`	Controls the level of logging information to display when the Pipeline is instantiated and run. Can be one of `'DEBUG'`, `'INFO'`, `'WARN'`, `'ERROR'`, `'CIRTICAL'`, or `'FATAL'`. Less and less information will be displayed from `'DEBUG'` to `'FATAL'`.
verbose	`str`	`None`	Simplified option for logging level. If `True`, logging level will be set to `'INFO'`. If `False`, logging level will be set to `'ERROR'`.
use_gpu	`bool`	`True`	Attempt to use a GPU if available. Set this to `False` if you are in a GPU-enabled environment but want to explicitly keep Stanza from using the GPU.
device	`str`	None	Which device to use for the Pipeline. Can be used to put the pipeline on `cuda:1` instead of `cuda:0`, for example
kwargs	-	-	Options for each of the individual processors. See the individual processor pages for descriptions.
{processor}_model_path	-	-	Path to load an alternate model. For example, `pos_model_path=xyz.pt` to load `xyz.pt` for the pos processor.
{processor}_pretrain_path	-	-	For processors which use word vectors, path to load an alternate set of word vectors. For example, `pos_pretrain_path=abc.pt` to load the `abc.pt` pretrain for the pos processor.
{processor}_forward_charlm_path	-	-	For processors which use the pretrained charlm, path to load an alternate forward model. For example, `pos_forward_charlm_path=abc.pt` to load the `abc.pt` pretrained forward charlm for the pos processor.
{processor}_backward_charlm_path	-	-	For processors which use the pretrained charlm, path to load an alternate backward model. For example, `pos_backward_charlm_path=abc.pt` to load the `abc.pt` pretrained backward charlm for the pos processor.

Processors

Processors are units of the neural pipeline that perform specific NLP functions and create different annotations for a Document. The neural pipeline supports the following processors:

Name	Processor class name	Requirement	Generated Annotation	Description
tokenize	TokenizeProcessor	-	Segments a `Document` into `Sentence`s, each containing a list of `Token`s. This processor also predicts which tokens are multi-word tokens, but leaves expanding them to the MWTProcessor.	Tokenizes the text and performs sentence segmentation.
mwt	MWTProcessor	tokenize	Expands multi-word tokens (MWTs) into multiple words when they are predicted by the tokenizer. Each `Token` will correspond to one or more `Word`s after tokenization and MWT expansion.	Expands multi-word tokens (MWT) predicted by the TokenizeProcessor. For some languages like Chinese, which do not have any multi-words tokens at all, this processor is not implemented.
pos	POSProcessor	tokenize, mwt	UPOS, XPOS, and UFeats annotations are accessible through `Word`’s properties `pos`, `xpos`, and `ufeats`.	Labels tokens with their universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats).
lemma	LemmaProcessor	tokenize, mwt, pos	Perform lemmatization on a `Word` using the `Word.text` and `Word.upos` values. The result can be accessed as `Word.lemma`.	Generates the word lemmas for all words in the Document.
depparse	DepparseProcessor	tokenize, mwt, pos, lemma	Determines the syntactic head of each word in a sentence and the dependency relation between the two words that are accessible through `Word`’s `head` and `deprel` attributes.	Provides an accurate syntactic dependency parsing analysis.
ner	NERProcessor	tokenize, mwt	Named entities accessible through `Document` or `Sentence`’s properties `entities` or `ents`. Token-level NER tags accessible through `Token`’s properties `ner`.	Recognize named entities for all token spans in the corpus.
sentiment	SentimentProcessor	tokenize, mwt	Sentiment scores of 0, 1, or 2 (negative, neutral, positive). Accessible using a `Sentence`’s `sentiment` property.	Assign per-sentence sentiment scores.
constituency	ConstituencyProcessor	tokenize, mwt, pos	Parse trees accessible a `Sentence`’s `constituency` property.	Parse each sentence in a document using a phrase structure parser.

Processor variants

New in v1.1

Sometimes you might want to build your own models to perform the tasks that existing processors already handle in the Stanza neural pipeline, or simply experiment with alternative toolkits for a specific task. Processor variants are here to help with that.

One example use case is using your own tokenizer for tokenization. Previously we have added support for popular tokenizers like spaCy for English and jieba for Chinese, and now we have made it much easier to add your own. You simply need to implement a ProcessorVariant and register it with Stanza using the @register_processor_variant decorator. For instance, this is how our spaCy tokenizer is implemented

from stanza.pipeline.processor import ProcessorVariant, register_processor_variant

@register_processor_variant('tokenize', 'spacy')
class SpacyTokenizer(ProcessorVariant):
    def __init__(self, config):
        # initialize spacy

    def process(self, text):
        # tokenize text with spacy

This allows the user to set tokenize_with_spacy as True (or processors={"tokenize": "spacy"}) when instantiating the pipeline to use it. In the case of the tokenizer, the TokenizeProcessor handles options such as pre-tokenization and pre-sengmentation, and only passes text to the variants when tokenization from raw text is needed.

Alternatively, one can also implement a processor variant as a drop-in replacement for a processor by setting OVERRIDE as True in the ProcessorVariant class, for instance

@register_processor_variant("lemma", "cool")
class CoolLemmatizer(ProcessorVariant):
    ''' An alternative lemmatizer that lemmatizes every word to "cool". '''

    OVERRIDE = True

    def __init__(self, lang):
        pass

    def process(self, document):
        for sentence in document.sentences:
            for word in sentence.words:
                word.lemma = "cool"

        return document

This lemmatizer will replace all of the functionality of the Stanza lemmatizer when it’s used in the pipeline, and lemmatize every single word to “cool”.

Note:

It is essential to import the file where the variant is defined to trigger the @register_processor_variant

Building your own Processors and using them in the neural pipeline

New in v1.1

If you’re looking to implement annotation capabilities that don’t currently exist in Stanza and want to use it in the neural pipeline, it hasn’t been easier. You can simply implement a Processor class and register it with Stanza using the @register_processor decorator, and then it’s easy to use it in your project, and/or publish it for other Stanza users to use.

Here is an example:

@register_processor("lowercase")
class LowercaseProcessor(Processor):
    ''' Processor that lowercases all text '''
    _requires = set(['tokenize'])
    _provides = set(['lowercase'])

    def __init__(self, device, config, pipeline):
        pass

    def _set_up_model(self, *args):
        pass

    def process(self, doc):
        doc.text = doc.text.lower()
        for sent in doc.sentences:
            for tok in sent.tokens:
                tok.text = tok.text.lower()

            for word in sent.words:
                word.text = word.text.lower()

        return doc

Once registered, you can use this processor in the pipeline as if it were one of Stanza’s standard processors

nlp = stanza.Pipeline(dir=TEST_MODELS_DIR, lang='en', processors='tokenize,lowercase')

and in this case the processor will lowercase all text in the document.

Advanced Pipeline Options

Build Pipeline from a Config Dictionary

When there are many options you want to configure, or even set programmatically, it might not be convenient to set them one by one using keyword arguments to instantiate the Pipeline. In these cases, alternatively, you can build the desired pipeline with a config dictionary, allowing maximum customization for the pipeline:

import stanza

config = {
        # Comma-separated list of processors to use
	'processors': 'tokenize,mwt,pos',
        # Language code for the language to build the Pipeline in
        'lang': 'fr',
        # Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
        # You only need model paths if you have a specific model outside of stanza_resources
	'tokenize_model_path': './fr_gsd_models/fr_gsd_tokenizer.pt',
	'mwt_model_path': './fr_gsd_models/fr_gsd_mwt_expander.pt',
	'pos_model_path': './fr_gsd_models/fr_gsd_tagger.pt',
	'pos_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt',
        # Use pretokenized text as input and disable tokenization
	'tokenize_pretokenized': True
}
nlp = stanza.Pipeline(**config) # Initialize the pipeline using a configuration dict
doc = nlp("Van Gogh grandit au sein d'une famille de l'ancienne bourgeoisie .") # Run the pipeline on the pretokenized input text
print(doc) # Look at the result

Here, we can specify the language, processors, and paths for many Processor models all at once, and pass that to the Pipeline initializer. Note that config dictionaries and keyword arguments can be combined as well, to maximize your flexibility in using Stanza’s neural pipeline.