Tregex and Tsurgeon FAQ
Table of contents
Is there a User Guide?
At present, no. There is:
- The information in the README-tregex.txt, README-tsurgeon.txt, and README-gui.txt files
- The information on patterns and programmatic use in the Javadocs. The
TregexPatternandTsurgeonclass javadoc give relatively complete information on the pattern languages of these tools. - The GUI has help screens, available within the graphical interface by clicking on the Help buttons, which provide similar information on Tregex and Tsurgeon syntax, as does this page.
- There are brief powerpoint tutorials for Tregex (by Galen Andrew) and Tsurgeon (by Marie-Catherine de Marneffe).
- For
tregex, you may also usefully consult user guides fortgrepandtgrep2, with whichtregexis mainly compatible.
What command-line options does Tregex have?
Here are some details on the command-line options of Tregex, taken from the javadoc for the main method of TregexPattern:
Usage:
java edu.stanford.nlp.trees.tregex.TregexPattern [[-TCwfosnu] [-filter] [-h <node-name>]]* pattern filepath
Arguments:
pattern: the tree pattern which optionally names some set of nodes (i.e., gives it the “handle”)=name(for some arbitrary string “name”)filepath: the path to files with trees. If this is a directory, there will be recursive descent and the pattern will be run on all files beneath the specified directory.
Options:
-Csuppresses printing of matches, so only the number of matches is printed.-wcauses the whole of a tree that matches to be printed.-fcauses the filename to be printed.-i <filename>causes the pattern to be matched to be read from<filename>rather than the command line. Don’t specify a pattern when this option is used.-oSpecifies that each tree node can be reported only once as the root of a match (by default a node will be printed once for every way the pattern matches).-scauses trees to be printed all on one line (by default they are pretty printed).-ncauses the number of the tree in which the match was found to be printed before every match.-ucauses only the label of each matching node to be printed, not complete subtrees.-tcauses only the yield (terminal words) of the selected node to be printed (or the yield of the whole tree, if the-woption is used).-encoding <charset_encoding>option allows specification of character encoding of trees.-h <node-handle>If a-hoption is given, the root tree node will not be printed. Instead, for eachnode-handlespecified, the node matched and given that handle will be printed. Multiple nodes can be printed by using the-hoption multiple times on a single command line.-hf <headfinder-class-name>use the specified {@link HeadFinder} class to determine headship relations.-hfArg <string>pass a string argument in to the {@link HeadFinder} class’s constructor.-hfArgcan be used multiple times to pass in multiple arguments.-trf <TreeReaderFactory-class-name>use the specified {@link TreeReaderFactory} class to read trees from files.-vprint every tree that contains no matches of the specified pattern, but print no matches to the pattern.-xInstead of the matched subtree, print the matched subtree’s identifying number as defined intgrep2:a unique identifier for the subtree and is in the form s:n, where s is an integer specifying the sentence number in the corpus (starting with 1), and n is an integer giving the order in which the node is encountered in a depth-first search starting with 1 at top node in the sentence tree.-extract <code> <tree-file>extracts the subtree s:n specified bycodefrom the specifiedtree-file. Overrides all other behavior of tregex. Can’t specify multiple encodings etc. yet.-extractFile <code-file> <tree-file>extracts every subtree specified by the subtree codes incode-file, which must appear exactly one per line, from the specifiedtree-file. Overrides all other behavior of tregex. Can’t specify multiple encodings etc. yet.-filtercauses this to act as a filter, reading tree input from stdin-Tcauses all trees to be printed as processed (for debugging purposes). Otherwise only matching nodes are printed.
Tsurgeon has stopped responding/gone into an infinite loop. Is that a bug?
Probably not (though you never know). Normally what this means is that you have written an infinite loop in your tree surgery script.
Tsurgeon script matching and rewriting is applied repeatedly (recursively) until no further matching and rewriting is possible. This is the behavior that you want in complex cases: large trees may match several patterns in a complex Tsurgeon script, and earlier edits may enable later edits. However, unfortunately, it also makes it very easy to write infinite loops in Tsurgeon, and this can confuse beginners. Here’s a very simple example of how you can produce an infinite loop (you can come up with complex examples by yourself!):
/^VB/=haveaux < have|has|having|had
relabel haveaux /^(.*)$/$1-HAVE/
The first poor haveaux node that matches the pattern, say (VBZ has), will get repeatedly relabeled as:
VBZ-HAVE
VBZ-HAVE-HAVE
VBZ-HAVE-HAVE-HAVE
VBZ-HAVE-HAVE-HAVE-HAVE
...
And you can see where that is heading. It is essential to write edits so that they will not apply to their own output forever. For this example, things can easily be fixed with the following changed script:
/^VB.?$/=haveaux < have|has|having|had
relabel haveaux /^(.*)$/$1-HAVE/
For other questions, feedback, extensions, or bugfixes, please join and post to the parser-user@lists.stanford.edu mailing list. Or you can send email to parser-support@lists.stanford.edu.