Link

Constituency Parser

Table of contents


Description

Constituency parsing is added to the stanza pipeline by using a shift-reduce parser.

NameAnnotator class nameRequirementGenerated AnnotationDescription
constituencyConstituencyProcessortokenize, mwt, posconstituencyAdds the constituency annotation to each Sentence in the Document

Options

Option nameTypeDefaultDescription
‘model_path’stringdepends on the languageWhere to load the model.
‘pretrain_path’stringdepends on the languageWhich set of pretrained word vectors to use. Can be changed for existing models, but this is not recommended, as the models are trained to work specifically with one set of word vectors.

Example Usage

The ConstituencyProcessor adds a constituency / phrase structure parse tree to each Sentence.

Bracket types are dependent on the treebank; for example, the PTB model using the PTB bracket types. Custom models could support any set of labels as long as you have training data.

Simple code example

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency')
doc = nlp('This is a test')
for sentence in doc.sentences:
    print(sentence.constituency)

The output produced (aside from logging) will be:

(ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test)))))

The tree can be programmatically accessed. Note that the layer under the root has two children, one for the NP This and one for the VP is a test.

>>> tree = doc.sentences[0].constituency
>>> tree.label
'ROOT'
>>> tree.children
[(S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test))))]
>>> tree.children[0].children
[(NP (DT This)), (VP (VBZ is) (NP (DT a) (NN test)))]

Available models

As of Stanza 1.4.0, charlm has been added by default to each of the conparse models. This improves accuracy around 1.0 F1 when trained for a long time. The currently released models were trained on 250 iterations of 5000 trees each, so for languages with large datasets such as English, there may have been room to improve further.

We also release a set of models which incorporate HuggingFace transformer models such as Bert or Roberta. This significantly increases the scores for the constituency parser.

Bert models can be used by setting the package parameter when creating a pipeline:

pipe = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency', package={'constituency': 'wsj_bert'})

Note that the following scores are slight underestimates. They use the CoreNLP scorer, which gives scores slightly lower than the evalb script.

LanguageDatasetBase scoreTransformer scoreNotes
ChineseCTB786.8590.98Future work: update to later versions of CTB
DanishArboretum82.783.45Non-projective constituents are rearranged
EnglishPTB93.2195.64with NMLs and separated dashes
EnglishPTB395.72 
ItalianTurin89.4292.76Test scores are on Evalita
ItalianVIT78.5282.43Split based on UD VIT (some trees dropped)
JapaneseALT91.03Transformers were not used - required a separate tokenizer
PortugueseCintil90.9893.61 
SpanishAnCora + LDC??????Compared against a combination of the test sets
TurkishStarlang73.0475.7 

As of Stanza 1.3.0, there was an English model trained on PTB. It achieved a test score of 91.5 using the inorder transition scheme.