Dependency Parsing

Description
Options
Example Usage
- Accessing Syntactic Dependency Information
- Start with Pretagged Document
Training-Only Options

Description

The dependency parsing module builds a tree structure of words from the input sentence, which represents the syntactic dependency relations between words. The resulting tree representations, which follow the Universal Dependencies formalism, are useful in many downstream applications. In Stanza, dependency parsing is performed by the DepparseProcessor, and can be invoked with the name depparse.

Name	Annotator class name	Requirement	Generated Annotation	Description
depparse	DepparseProcessor	tokenize, mwt, pos, lemma	Determines the syntactic head of each word in a sentence and the dependency relation between the two words that are accessible through `Word`’s `head` and `deprel` attributes.	Provides an accurate syntactic dependency parsing analysis.

Options

Option name	Type	Default	Description
depparse_batch_size	int	5000	When annotating, this argument specifies the maximum number of words to process as a minibatch for efficient processing. Caveat: the larger this number is, the more working memory is required (main RAM or GPU RAM, depending on the computating device). This parameter should be set larger than the number of words in the longest sentence in your input document, or you might run into unexpected behaviors.
depparse_pretagged	bool	False	Assume the document is tokenized and pretagged. Only run dependency parsing on the document.

Example Usage

Running the DepparseProcessor requires the TokenizeProcessor, MWTProcessor, POSProcessor, and LemmaProcessor. After all these processors have been run, each Sentence in the output would have been parsed into Universal Dependencies (version 2) structure, where the head index of each Word can be accessed by the property head, and the dependency relation between the words deprel. Note that the head index starts at 1 for actual words, and is 0 only when the word itself is the root of the tree. This index should be offset by 1 when looking for the govenor word in the sentence.

Accessing Syntactic Dependency Information

Here is an example of parsing a sentence and accessing syntactic parse information from each word:

import stanza

nlp = stanza.Pipeline(lang='fr', processors='tokenize,mwt,pos,lemma,depparse')
doc = nlp('Nous avons atteint la fin du sentier.')
print(*[f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}' for sent in doc.sentences for word in sent.words], sep='\n')

As can be seen in the output, the syntactic head of the word Nous is atteint, and the dependency relation between the two words is nsubj (Nous is a nominal subject for atteint).

id: 1   word: Nous      head id: 3      head: atteint   deprel: nsubj
id: 2   word: avons     head id: 3      head: atteint   deprel: aux:tense
id: 3   word: atteint   head id: 0      head: root      deprel: root
id: 4   word: la        head id: 5      head: fin       deprel: det
id: 5   word: fin       head id: 3      head: atteint   deprel: obj
id: 6   word: de        head id: 8      head: sentier   deprel: case
id: 7   word: le        head id: 8      head: sentier   deprel: det
id: 8   word: sentier   head id: 5      head: fin       deprel: nmod
id: 9   word: .         head id: 3      head: atteint   deprel: punct

Start with Pretagged Document

Normally, the depparse processor depends on tokenize, mwt, pos, and lemma processors. However, in cases you wish to use your own tokenization, multi-word token expansion, POS tagging and lemmatization, you can skip the restriction and pass the pretagged document (with upos, xpos, feats, lemma) by setting depparse_pretagged to True.

Here is an example of dependency parsing with pretokenized and pretagged document:

import stanza
from stanza.models.common.doc import Document

nlp = stanza.Pipeline(lang='en', processors='depparse', depparse_pretagged=True)
pretagged_doc = Document([[{'id': 1, 'text': 'Test', 'lemma': 'Test', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing'}, {'id': 2, 'text': 'sentence', 'lemma': 'sentence', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing'}, {'id': 3, 'text': '.', 'lemma': '.', 'upos': 'PUNCT', 'xpos': '.'}]])
doc = nlp(pretagged_doc)

Training-Only Options

Most training-only options are documented in the argument parser of the dependency parser.