Link

Tsurgeon

Table of contents


Introduction

Stanza comes with an interface to Tsurgeon, a constituency tree rewriting tool included with CoreNLP.

Here is a brief example of how to use the context window for Tsurgeon

Code sample

from stanza.models.constituency.tree_reader import read_trees, read_treebank
from stanza.server import tsurgeon

TREEBANK = """
( (IP-MAT (NP-SBJ (PRO-N Það-það))
          (BEPI er-vera)
          (ADVP (ADV eiginlega-eiginlega))
          (ADJP (NEG ekki-ekki) (ADJ-N hægt-hægur))
          (IP-INF (TO að-að) (VB lýsa-lýsa))
          (NP-OB1 (N-D tilfinningu$-tilfinning) (D-D $nni-hinn))
          (IP-INF (TO að-að) (VB fá-fá))
          (IP-INF (TO að-að) (VB taka-taka))
          (NP-OB1 (N-A þátt-þáttur))
          (PP (P í-í)
              (NP (D-D þessu-þessi)))
          (, ,-,)
          (VBPI segir-segja)
          (NP-SBJ (NPR-N Sverrir-sverrir) (NPR-N Ingi-ingi))
          (. .-.)))
"""

# expected result
#(ROOT
#  (IP-MAT
#    (NP-SBJ (PRO-N Það))
#    (BEPI er)
#    (ADVP (ADV eiginlega))
#    (ADJP (NEG ekki) (ADJ-N hægt))
#    (IP-INF (TO að) (VB lýsa))
#    (NP-OB1 (N-D tilfinningunni))
#    (IP-INF (TO að) (VB fá))
#    (IP-INF (TO að) (VB taka))
#    (NP-OB1 (N-A þátt))
#    (PP
#      (P í)
#      (NP (D-D þessu)))
#    (, ,)
#    (VBPI segir)
#    (NP-SBJ (NPR-N Sverrir) (NPR-N Ingi))
#    (. .)))


treebank = read_trees(TREEBANK)

with tsurgeon.Tsurgeon(classpath="$CLASSPATH") as tsurgeon_processor:
    form_tregex = "/^(.+)-.+$/#1%form=word !< __"
    form_tsurgeon = "relabel word /^.+$/%{form}/"

    noun_det_tregex = "/^N-/ < /^([^$]+)[$]$/#1%noun=noun $+ (/^D-/ < /^[$]([^$]+)$/#1%det=det)"
    noun_det_relabel = "relabel noun /^.+$/%{noun}%{det}/"
    noun_det_prune = "prune det"

    for tree in treebank:
        updated_tree = tsurgeon_processor.process(tree, (form_tregex, form_tsurgeon))[0]
        print("{:P}".format(updated_tree))
        updated_tree = tsurgeon_processor.process(updated_tree, (noun_det_tregex, noun_det_relabel, noun_det_prune))[0]
        print("{:P}".format(updated_tree))

In this example, we perform two operations on the tree: replace all word nodes with the form, and squish the N and D nodes with $ in them into one node. In this way, we can prepare the Icelandic treebank for use in training a Stanza constituency parser.

Details

Tsurgeon can operation on trees read from a file or text, such as in this example, or on trees produced by the constituency parser.

The context window opens a pipe to a Java executable and then uses a protobuf to communicate. Java and CoreNLP both need to be available on your system for this to work.

Further information

More information on the tregex patterns used is available on the Tregex Javadoc page

There is also more documentation on the Tsurgeon operations available

Standalone Tregex Operations

So far, we haven’t found a use for Tregex by itself without Tsurgeon from Python. If you have such a use case, please contact us via github and we will help work out an interface for Tregex. Likely candidates would be finding specific subtrees or testing whether or not a tree matches an expression.