StanfordNLP

Pipeline

Users of StanfordNLP can process documents by building a Pipeline with the desired Processor units. The pipeline takes in a Document object or raw text, runs the processors in succession, and returns an annotated Document.

Options

Option name	Type	Default	Description
lang	str	“en”	Use recommended models for this language.
models_dir	str	~/stanfordnlp_resources	Directory for storing the models.
processors	str	“tokenize,mwt,pos,lemma,depparse”	List of processors to use. For a list of all processors supported, see Processors Summary.
treebank	str	None	Use models for this treebank. If not specified, `Pipeline` will look up the default treebank for the language requested.
use_gpu	bool	True	Attempt to use a GPU if possible.

Options for each of the individual processors can be specified when building the pipeline. See the individual processor pages for descriptions.

Usage

Basic Example

import stanfordnlp

MODELS_DIR = '.'
stanfordnlp.download('en', MODELS_DIR) # Download the English models
nlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=MODELS_DIR, treebank='en_ewt', use_gpu=True, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp("Barack Obama was born in Hawaii.") # Run the pipeline on input text
doc.sentences[0].print_tokens() # Look at the result

Specifying A Full Config

import stanfordnlp

config = {
	'processors': 'tokenize,mwt,pos,lemma,depparse', # Comma-separated list of processors to use
	'lang': 'fr', # Language code for the language to build the Pipeline in
	'tokenize_model_path': './fr_gsd_models/fr_gsd_tokenizer.pt', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
	'mwt_model_path': './fr_gsd_models/fr_gsd_mwt_expander.pt',
	'pos_model_path': './fr_gsd_models/fr_gsd_tagger.pt',
	'pos_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt',
	'lemma_model_path': './fr_gsd_models/fr_gsd_lemmatizer.pt',
	'depparse_model_path': './fr_gsd_models/fr_gsd_parser.pt',
	'depparse_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt'
}
nlp = stanfordnlp.Pipeline(**config) # Initialize the pipeline using a configuration dict
doc = nlp("Van Gogh grandit au sein d'une famille de l'ancienne bourgeoisie.") # Run the pipeline on input text
doc.sentences[0].print_tokens() # Look at the result

Accessing Word Information

After a pipeline is run, a Document object will be created and populated with annotation data. A Document contains a list of Sentencess, and a Sentence contains a list of Tokens and Words. For the most part Tokens and Words overlap, but some tokens can be divided into mutiple words, for instance the French token aux is divided into the words à and les. The dependency parses are derived over words.

In this code example, the Document is named doc. After the text is annotated, the for loops go through each sentence sent in doc.sentences and each word word in sent.words and information about the word is printed out, specifically word.text, word.lemma, word.upos, and word.xpos.

import stanfordnlp

nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was born in Hawaii.")
print(*[f'text: {word.text+" "}\tlemma: {word.lemma}\tupos: {word.upos}\txpos: {word.xpos}' for sent in doc.sentences for word in sent.words], sep='\n')

The following output is generated:

text: Barack 	lemma: Barack	upos: PROPN	xpos: NNP
text: Obama 	lemma: Obama	upos: PROPN	xpos: NNP
text: was 	lemma: be	upos: AUX	xpos: VBD
text: born 	lemma: bear	upos: VERB	xpos: VBN
text: in 	lemma: in	upos: ADP	xpos: IN
text: Hawaii 	lemma: Hawaii	upos: PROPN	xpos: NNP
text: . 	lemma: .	upos: PUNCT	xpos: .

Running On Pre-Tokenized Text

If you set the tokenize_pretokenized option, the text will be interpreted as already tokenized on white space and sentence split by newlines. The tokenizer model will not be run.

import stanfordnlp

config = {
        'processors': 'tokenize,pos',
        'tokenize_pretokenized': True,
        'pos_model_path': './en_ewt_models/en_ewt_tagger.pt',
        'pos_pretrain_path': './en_ewt_models/en_ewt.pretrain.pt',
        'pos_batch_size': 1000
         }
nlp = stanfordnlp.Pipeline(**config)
doc = nlp('Joe Smith lives in California .\nHe loves pizza .')
print(doc.conll_file.conll_as_string())

You can also provide a list of lists representing sentences and tokens. Make sure to still set the tokenize_pretokenized option to True. Each list will represent the tokens of a sentence.

pretokenized_text = [['hello', 'world'], ['hello', 'world', 'again']]
doc = nlp(pretokenized_text)
print(doc.conll_file.conll_as_string())