Link

Constituency Parser

Table of contents


Description

Constituency parsing is added to the stanza pipeline by using a shift-reduce parser.

NameAnnotator class nameRequirementGenerated AnnotationDescription
constituencyConstituencyProcessortokenize, mwt, posconstituencyAdds the constituency annotation to each Sentence in the Document

Options

Option nameTypeDefaultDescription
‘model_path’stringdepends on the languageWhere to load the model.
‘pretrain_path’stringdepends on the languageWhich set of pretrained word vectors to use. Can be changed for existing models, but this is not recommended, as the models are trained to work specifically with one set of word vectors.

Example Usage

The ConstituencyProcessor adds a constituency / phrase structure parse tree to each Sentence.

Bracket types are dependent on the treebank; for example, the PTB model using the PTB bracket types. Custom models could support any set of labels as long as you have training data.

Simple code example

import stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency')
doc = nlp('This is a test')
for sentence in doc.sentences:
    print(sentence.constituency)

The output produced (aside from logging) will be:

(ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test)))))

The tree can be programmatically accessed. Note that the layer under the root has two children, one for the NP This and one for the VP is a test.

>>> tree = doc.sentences[0].constituency
>>> tree.label
'ROOT'
>>> tree.children
[(S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test))))]
>>> tree.children[0].children
[(NP (DT This)), (VP (VBZ is) (NP (DT a) (NN test)))]

Available models

As of Stanza 1.4.0, charlm has been added by default to each of the conparse models. This improves accuracy around 1.0 F1 when trained for a long time. The currently released models were trained on 250 iterations of 5000 trees each, so for languages with large datasets such as English, there may have been room to improve further.

We also release a set of models which incorporate HuggingFace transformer models such as Bert or Roberta. This significantly increases the scores for the constituency parser.

Bert models can be used by setting the package parameter when creating a pipeline:

pipe = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency', package='default_accurate')

Note that the following scores are slight underestimates. They use the CoreNLP scorer, which gives scores slightly lower than the evalb script.

LanguageDatasetBase scoreTransformer scoreNotes
ChineseCTB5.186.891.44Future work: build CTB9 model
DanishArboretum82.9684.4Non-projective constituents are rearranged
EnglishPTB393.396.06 
EnglishPTB3-revised93.1696.08with NMLs and separated dashes
IndonesianICON86.5687.8 
ItalianTurin91.8394.57Test scores are on Evalita
ItalianVIT80.4184.41Split based on UD VIT (some trees dropped)
JapaneseALT91.491.89Transformers were not used - required a separate tokenizer
PortugueseCintil91.0293.43 
SpanishAnCora + LDC??????Compared against a combination of the test sets
TurkishStarlang73.376.01 
VietnameseVLSP2269.3475.52 

As of Stanza 1.3.0, there was an English model trained on PTB. It achieved a test score of 91.5 using the inorder transition scheme.

Treebank descriptions

Chinese

Currently constructed from CTB 5.1, as this is a frequently used benchmark for constituency parsing tasks. We have the code and data available for CTB 9.0… simply didn’t produce or release this model yet.

English

For English, the default model uses Penn Treebank. However, there are a couple updates to this. The first is the addition of NML, which marks a noun-based “adjective phrase” under an NP. There are also tokenization changes from the original, such as “New York-based” becoming 4 tokens, “New York - based”. These changes are described in an update to PTB

We include an original PTB model as well, for reference.

Indonesian

Based on a constituency treebank published at GURT 2023.

Italian

We provide models for two Italian treebanks.

The default dataset is VIT, a larger dataset with more recent edits to ensure accuracy. The constituents built in that treebank are described in the original paper. There were no official train/dev/test splits for VIT, but we aligned the trees with the UD translation of the treebank and used that as the split for building the model.

There is also the Turin University Parallel Treebank. It is smaller than VIT, but has more human readable annotations.

Japanese

An annotation guideline for Japanese ALT is available on the ALT homepage

Vietnamese

Based on the 2022 version of the VLSP constituency bakeoff. A version of this model won the bakeoff.

Cintil can also be purchased from ELRA

Citations

Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, 1993, “Building a Large Annotated Corpus of English: The Penn Treebank”. Computational Linguistics, 19(2):313–330.

Sanguinetti M., Bosco C. (2014) “PartTUT: The Turin University Parallel Treebank”. In Basili, Bosco, Delmonte, Moschitti, Simi (editors) Harmonization and development of resources and tools for Italian Natural Language Processing within the PARLI project, LNCS, Springer Verlag

Rodolfo Delmonte, Antonella Bristot, and Sara Tonelli (2007). “VIT - Venice Italian Treebank: Syntactic and Quantitative Features”. In Proc. Sixth International Workshop on Treebanks and Linguistic Theories.

Silva, João, António Branco, Sérgio Castro and Ruben Reis, 2010, “Out-of-the-Box Robust Parsing of Portuguese”. In Proceedings of the 9th International Conference on the Computational Processing of Portuguese (PROPOR2010), Lecture Notes in Artificial Intelligence, 6001, Berlin, Springer, pp.75–85.

Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch and Eiichiro Sumita. (2016) “Introducing the Asian Language Treebank (ALT)”. LREC.

Ha My Linh, Nguyen Thi Minh Huyen, Ngo The Quyen, Le Tuan Thanh, Dang Tran Thai, Ngo Viet Hoang, Doan Xuan Dung, Nguyen Thi Luong, Le Van Cuong, Phan Thi Hue, Vu Xuan Luong VLSP 2022 Challenge: Vietnamese Constituency Parsing, to appear in Journal of Computer Science and Cybernetics, 2022

Ee Suan Lim, Wei Qi Leong, Thanh Ngan Nguyen, Dea Adhista, Wei Ming Kng, William Chandra Tjhi, Ayu Purwarianti. ICON: Building a Large-Scale Benchmark Constituency Treebank for the Indonesian Language. GURT 2023.