StanfordNLP

This page describes the data objects used in StanfordNLP, and how they interact with each other.

Document

A Document object holds the annotation of an entire document, and is automatically generated when a string is annotated by the Pipeline. It holds a collection of Sentences, and can be seamlessly translated into a CoNLL-U file.

Objects of this class expose useful properties such as text, sentences, and conll_file.

Sentence

A Sentence object represents a sentence (as is predicted by the tokenizer), and holds a list of the Tokens in the sentence, as well as a list of all its Words. It also processes the dependency parse as is predicted by the parser, through its member method build_dependencies.

Objects of this class expose useful properties such as words, tokens, and dependencies, as well as methods such as print_tokens, print_words, print_dependencies.

Token

A Token object holds a token, and a list of its underlying words. In the event that the token is a multi-word token (e.g., French au = à le), the token will have a range index as described in the CoNLL-U format specifications (e.g., 3-4), with its word property containing the underlying Words. In other cases, the Token object will be a simple wrapper around one Word object, where its words property is a singleton.

Aside from index that gives the 1-based sentence index of the token and words that points to the underlying words, Token objects also provide text to access the raw form of the Token, which is a substring of the input text.

Word

A Word object holds a syntactic word and all of its word-level annotations. In the example of multi-word tokens (MWT), these are generated as a result of multi-word token expansion, and are used in all downstream syntactic analyses such as tagging, lemmatization, and parsing. If a Word is the result from an MWT expansion, its text will usually not be found in the input raw text. Aside from multi-word tokens, Words should be similar to the familiar “tokens” one would see elsewhere.

Word objects expose useful properties such as index, text, lemma, pos (which is an alias for xpos, the treebank-specific part-of-speech, e.g., NN), upos (universal part-of-speech, e.g., NOUN), feats (morphological features), governor (governor/head in the dependency parse), dependency_relation (dependency relation between this word and its head), and parent_token (the Token object that this Word is part of).

Data Objects

Document

Sentence

Token

Word