Pipeline
Users of StanfordNLP can process documents by building a Pipeline with the desired Processor units. The pipeline takes in a Document
object or raw text, runs the processors in succession, and returns an annotated Document.
Options
| Option name |
Type |
Default |
Description |
| lang |
str |
“en” |
Use recommended models for this language. |
| models_dir |
str |
~/stanfordnlp_resources |
Directory for storing the models. |
| processors |
str |
“tokenize,mwt,pos,lemma,depparse” |
List of processors to use. For a list of all processors supported, see Processors Summary. |
| treebank |
str |
None |
Use models for this treebank. If not specified, Pipeline will look up the default treebank for the language requested. |
| use_gpu |
bool |
True |
Attempt to use a GPU if possible. |
Options for each of the individual processors can be specified when building the pipeline. See the individual processor pages for descriptions.
Usage
Basic Example
import stanfordnlp
MODELS_DIR = '.'
stanfordnlp.download('en', MODELS_DIR) # Download the English models
nlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=MODELS_DIR, treebank='en_ewt', use_gpu=True, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp("Barack Obama was born in Hawaii.") # Run the pipeline on input text
doc.sentences[0].print_tokens() # Look at the result
Specifying A Full Config
import stanfordnlp
config = {
'processors': 'tokenize,mwt,pos,lemma,depparse', # Comma-separated list of processors to use
'lang': 'fr', # Language code for the language to build the Pipeline in
'tokenize_model_path': './fr_gsd_models/fr_gsd_tokenizer.pt', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
'mwt_model_path': './fr_gsd_models/fr_gsd_mwt_expander.pt',
'pos_model_path': './fr_gsd_models/fr_gsd_tagger.pt',
'pos_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt',
'lemma_model_path': './fr_gsd_models/fr_gsd_lemmatizer.pt',
'depparse_model_path': './fr_gsd_models/fr_gsd_parser.pt',
'depparse_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt'
}
nlp = stanfordnlp.Pipeline(**config) # Initialize the pipeline using a configuration dict
doc = nlp("Van Gogh grandit au sein d'une famille de l'ancienne bourgeoisie.") # Run the pipeline on input text
doc.sentences[0].print_tokens() # Look at the result
After a pipeline is run, a Document object will be created and populated with annotation data.
A Document contains a list of Sentencess, and a Sentence contains a list of Tokens and
Words. For the most part Tokens and Words overlap, but some tokens can be divided into
mutiple words, for instance the French token aux is divided into the words à and les. The
dependency parses are derived over words.
In this code example, the Document is named doc. After the text is annotated, the for loops
go through each sentence sent in doc.sentences and each word word in sent.words and information
about the word is printed out, specifically word.text, word.lemma, word.upos, and word.xpos.
import stanfordnlp
nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was born in Hawaii.")
print(*[f'text: {word.text+" "}\tlemma: {word.lemma}\tupos: {word.upos}\txpos: {word.xpos}' for sent in doc.sentences for word in sent.words], sep='\n')
The following output is generated:
text: Barack lemma: Barack upos: PROPN xpos: NNP
text: Obama lemma: Obama upos: PROPN xpos: NNP
text: was lemma: be upos: AUX xpos: VBD
text: born lemma: bear upos: VERB xpos: VBN
text: in lemma: in upos: ADP xpos: IN
text: Hawaii lemma: Hawaii upos: PROPN xpos: NNP
text: . lemma: . upos: PUNCT xpos: .
Running On Pre-Tokenized Text
If you set the tokenize_pretokenized option, the text will be interpreted as already tokenized on white space and sentence split by newlines.
The tokenizer model will not be run.
import stanfordnlp
config = {
'processors': 'tokenize,pos',
'tokenize_pretokenized': True,
'pos_model_path': './en_ewt_models/en_ewt_tagger.pt',
'pos_pretrain_path': './en_ewt_models/en_ewt.pretrain.pt',
'pos_batch_size': 1000
}
nlp = stanfordnlp.Pipeline(**config)
doc = nlp('Joe Smith lives in California .\nHe loves pizza .')
print(doc.conll_file.conll_as_string())
You can also provide a list of lists representing sentences and tokens. Make sure to still set the tokenize_pretokenized option to True.
Each list will represent the tokens of a sentence.
pretokenized_text = [['hello', 'world'], ['hello', 'world', 'again']]
doc = nlp(pretokenized_text)
print(doc.conll_file.conll_as_string())