Pipeline and Processors
Table of contents
Pipeline
For basic end to end examples, please see Getting Started.
To start annotating text with Stanza, you would typically start by building a Pipeline
that contains Processor
s, each fulfilling a specific NLP task you desire (e.g., tokenization, part-of-speech tagging, syntactic parsing, etc). The pipeline takes in raw text or a Document
object that contains partial annotations, runs the specified processors in succession, and returns an annotated Document
(see the documentation on Document
for more information on how to extract these annotations).
To build and customize the pipeline, you can specify the options in the table below:
Option | Type | Default | Description |
---|---|---|---|
lang | str | 'en' | Language code (e.g., "en" ) or language name (e.g., "English" ) for the language to process with the Pipeline. You can find a complete list of available languages here. |
dir | str | '~/stanza_resources' | Directory for storing the models downloaded for Stanza. By default, Stanza stores its models in a folder in your home directory. |
package | dict or str | 'default' | Package to use for processors, where each package typically specifies what data the models are trained on. We provide a “default” package for all languages that contains NLP models most users will find useful. A complete list of available packages can be found here. If a dict is used, any processor not specified will be default . ner allows for a list of packages in the dict . |
processors | dict or str | dict() | Processors to use in the Pipeline. This can either be specified as a comma-seperated list of processor names to use (e.g., 'tokenize,pos' ), or a Python dictionary with Processor names as keys and packages as corresponding values (e.g., {'tokenize': 'ewt', 'pos': 'ewt'} ). In the case of a dict, all unspecified Processors will fall back to using the package specified by the package argument. To ensure that only the processors you want are loaded when using a dict, set package=None as well. A list of all Processors supported can be found here. |
logging_level | str | 'INFO' | Controls the level of logging information to display when the Pipeline is instantiated and run. Can be one of 'DEBUG' , 'INFO' , 'WARN' , 'ERROR' , 'CIRTICAL' , or 'FATAL' . Less and less information will be displayed from 'DEBUG' to 'FATAL' . |
verbose | str | None | Simplified option for logging level. If True , logging level will be set to 'INFO' . If False , logging level will be set to 'ERROR' . |
use_gpu | bool | True | Attempt to use a GPU if available. Set this to False if you are in a GPU-enabled environment but want to explicitly keep Stanza from using the GPU. |
device | str | None | Which device to use for the Pipeline. Can be used to put the pipeline on cuda:1 instead of cuda:0 , for example |
kwargs | - | - | Options for each of the individual processors. See the individual processor pages for descriptions. |
{processor}_model_path | - | - | Path to load an alternate model. For example, pos_model_path=xyz.pt to load xyz.pt for the pos processor. |
{processor}_pretrain_path | - | - | For processors which use word vectors, path to load an alternate set of word vectors. For example, pos_pretrain_path=abc.pt to load the abc.pt pretrain for the pos processor. |
{processor}_forward_charlm_path | - | - | For processors which use the pretrained charlm, path to load an alternate forward model. For example, pos_forward_charlm_path=abc.pt to load the abc.pt pretrained forward charlm for the pos processor. |
{processor}_backward_charlm_path | - | - | For processors which use the pretrained charlm, path to load an alternate backward model. For example, pos_backward_charlm_path=abc.pt to load the abc.pt pretrained backward charlm for the pos processor. |
Processors
Processors are units of the neural pipeline that perform specific NLP functions and create different annotations for a Document
. The neural pipeline supports the following processors:
Name | Processor class name | Requirement | Generated Annotation | Description |
---|---|---|---|---|
tokenize | TokenizeProcessor | - | Segments a Document into Sentence s, each containing a list of Token s. This processor also predicts which tokens are multi-word tokens, but leaves expanding them to the MWTProcessor. | Tokenizes the text and performs sentence segmentation. |
mwt | MWTProcessor | tokenize | Expands multi-word tokens (MWTs) into multiple words when they are predicted by the tokenizer. Each Token will correspond to one or more Word s after tokenization and MWT expansion. | Expands multi-word tokens (MWT) predicted by the TokenizeProcessor. This is only applicable to some languages. |
pos | POSProcessor | tokenize, mwt | UPOS, XPOS, and UFeats annotations are accessible through Word ’s properties pos , xpos , and ufeats . | Labels tokens with their universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats). |
lemma | LemmaProcessor | tokenize, mwt, pos | Perform lemmatization on a Word using the Word.text and Word.upos values. The result can be accessed as Word.lemma . | Generates the word lemmas for all words in the Document. |
depparse | DepparseProcessor | tokenize, mwt, pos, lemma | Determines the syntactic head of each word in a sentence and the dependency relation between the two words that are accessible through Word ’s head and deprel attributes. | Provides an accurate syntactic dependency parsing analysis. |
ner | NERProcessor | tokenize, mwt | Named entities accessible through Document or Sentence ’s properties entities or ents . Token-level NER tags accessible through Token ’s properties ner . | Recognize named entities for all token spans in the corpus. |
sentiment | SentimentProcessor | tokenize, mwt | Sentiment scores of 0, 1, or 2 (negative, neutral, positive). Accessible using a Sentence ’s sentiment property. | Assign per-sentence sentiment scores. |
constituency | ConstituencyProcessor | tokenize, mwt, pos | Parse trees accessible a Sentence ’s constituency property. | Parse each sentence in a document using a phrase structure parser. |
Processor variants
New in v1.1
Sometimes you might want to build your own models to perform the tasks that existing processors already handle in the Stanza neural pipeline, or simply experiment with alternative toolkits for a specific task. Processor variants are here to help with that.
One example use case is using your own tokenizer for tokenization. Previously we have added support for popular tokenizers like spaCy for English and jieba for Chinese, and now we have made it much easier to add your own. You simply need to implement a ProcessorVariant
and register it with Stanza using the @register_processor_variant
decorator. For instance, this is how our spaCy tokenizer is implemented
from stanza.pipeline.processor import ProcessorVariant, register_processor_variant
@register_processor_variant('tokenize', 'spacy')
class SpacyTokenizer(ProcessorVariant):
def __init__(self, config):
# initialize spacy
def process(self, text):
# tokenize text with spacy
This allows the user to set tokenize_with_spacy
as True
(or processors={"tokenize": "spacy"}
) when instantiating the pipeline to use it. In the case of the tokenizer, the TokenizeProcessor
handles options such as pre-tokenization and pre-sengmentation, and only passes text to the variants when tokenization from raw text is needed.
Alternatively, one can also implement a processor variant as a drop-in replacement for a processor by setting OVERRIDE
as True
in the ProcessorVariant
class, for instance
@register_processor_variant("lemma", "cool")
class CoolLemmatizer(ProcessorVariant):
''' An alternative lemmatizer that lemmatizes every word to "cool". '''
OVERRIDE = True
def __init__(self, lang):
pass
def process(self, document):
for sentence in document.sentences:
for word in sentence.words:
word.lemma = "cool"
return document
This lemmatizer will replace all of the functionality of the Stanza lemmatizer when it’s used in the pipeline, and lemmatize every single word to “cool”.
It is essential to import the file where the variant is defined to trigger the @register_processor_variant
Building your own Processors and using them in the neural pipeline
New in v1.1
If you’re looking to implement annotation capabilities that don’t currently exist in Stanza and want to use it in the neural pipeline, it hasn’t been easier. You can simply implement a Processor
class and register it with Stanza using the @register_processor
decorator, and then it’s easy to use it in your project, and/or publish it for other Stanza users to use.
Here is an example:
@register_processor("lowercase")
class LowercaseProcessor(Processor):
''' Processor that lowercases all text '''
_requires = set(['tokenize'])
_provides = set(['lowercase'])
def __init__(self, device, config, pipeline):
pass
def _set_up_model(self, *args):
pass
def process(self, doc):
doc.text = doc.text.lower()
for sent in doc.sentences:
for tok in sent.tokens:
tok.text = tok.text.lower()
for word in sent.words:
word.text = word.text.lower()
return doc
Once registered, you can use this processor in the pipeline as if it were one of Stanza’s standard processors
nlp = stanza.Pipeline(dir=TEST_MODELS_DIR, lang='en', processors='tokenize,lowercase')
and in this case the processor will lowercase all text in the document.
Advanced Pipeline Options
Build Pipeline from a Config Dictionary
When there are many options you want to configure, or even set programmatically, it might not be convenient to set them one by one using keyword arguments to instantiate the Pipeline. In these cases, alternatively, you can build the desired pipeline with a config dictionary, allowing maximum customization for the pipeline:
import stanza
config = {
# Comma-separated list of processors to use
'processors': 'tokenize,mwt,pos',
# Language code for the language to build the Pipeline in
'lang': 'fr',
# Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
# You only need model paths if you have a specific model outside of stanza_resources
'tokenize_model_path': './fr_gsd_models/fr_gsd_tokenizer.pt',
'mwt_model_path': './fr_gsd_models/fr_gsd_mwt_expander.pt',
'pos_model_path': './fr_gsd_models/fr_gsd_tagger.pt',
'pos_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt',
# Use pretokenized text as input and disable tokenization
'tokenize_pretokenized': True
}
nlp = stanza.Pipeline(**config) # Initialize the pipeline using a configuration dict
doc = nlp("Van Gogh grandit au sein d'une famille de l'ancienne bourgeoisie .") # Run the pipeline on the pretokenized input text
print(doc) # Look at the result
Here, we can specify the language, processors, and paths for many Processor models all at once, and pass that to the Pipeline initializer. Note that config dictionaries and keyword arguments can be combined as well, to maximize your flexibility in using Stanza’s neural pipeline.