Semgrex and Ssurgeon
Table of contents
Here we present a complete example of using Semgrex and Ssurgeon to fix some errors in a UD treebank.
In particular, there are two classes of edits we want to make to UD_English-LinES. There are cases where he's
is labeled as a to be
verb, even though it is definitely he has
. Also, other English UD treebanks are marked with MWT on words such as won't
, and we can do that automatically with Ssurgeon.
To use this, you will need Semgrex >= 1.5.1 and CoreNLP >= 4.5.5. (Certain components from the examples below were present in earlier versions, but the tools have been upgraded since their initial release.)
Example search
To search for examples of the first one, we try a couple things. The first is that this often happens when there are two consecutive to be
verbs in a row, so we search for exactly that:
python3 stanza/server/semgrex.py --input_file UD_English-LinES/en_lines-ud-train.conllu "{word:/'s/;lemma:be} . {lemma:be}" --matches_only
Here, --matches_only
makes it only print out matching sentences. Otherwise, Semgrex would print out the whole treebank, even though there are only a few matching sentences.
This prints out sentences such as
# text = He's been waiting for you since eight o'clock.
# semgrex pattern |{word:/'s/;lemma:be} . {lemma:be}| matched at 2:'s
1 He he PRON PERS-P3SG-NOM Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs 4 nsubj _ SpaceAfter=No
2 's be AUX PRES Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 aux _ _
3 been be AUX PERF Tense=Past|VerbForm=Part 4 aux _ _
4 waiting wait VERB ING Tense=Pres|VerbForm=Part 0 root _ _
# sent_id = en_lines-ud-train-doc6-2341
# text = Rebecca's been to the Sputnik and she says it's terrific now.
# semgrex pattern |{word:/'s/;lemma:be} . {lemma:be}| matched at 2:'s
1 Rebecca Rebecca PROPN SG-NOM Number=Sing 3 nsubj _ SpaceAfter=No
2 's be AUX PRES-AUX Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 aux _ _
3 been be AUX PERF Tense=Past|VerbForm=Part 0 root _ _
4 to to ADP _ _ 6 case _ _
5 the the DET DEF Definite=Def|PronType=Art 6 det _ _
6 Sputnik Sputnik PROPN SG-NOM Number=Sing 3 obl _ _
There are cases where there is an interposing word, though, and this query does not find those. We can observe that there are two cases we might care about. In the first sentence, 's
and been
are both dependents of waiting
. We can try to find that as follows:
python3 stanza/server/semgrex.py --input_file UD_English-LinES/en_lines-ud-train.conllu "{} > {word:/'s/;lemma:be}=has > {lemma:be}" --matches_only
Note that the 's
word will be labeled as a named node in this query. However, this query actually has an error. Matching nodes is not exclusive, meaning that the second > {lemma:be}
can always match the same node that matched =has
. We can fix that in a variety of ways. One such solution is to mark both dependents, then enforce that they are different:
python3 stanza/server/semgrex.py --input_file UD_English-LinES/en_lines-ud-train.conllu "{} > {word:/'s/;lemma:be}=has > {lemma:be}=other : {}=has .. {}=other" --matches_only
The problem with this query is that there are cases with some spurious matches. For example, in the dev set:
# sent_id = en_lines-ud-dev-doc6-3874
# text = If there's a father of the state, it's got to be him or no one.
# semgrex pattern |{} > {word:/'s/;lemma:be}=has > {lemma:be}=other : {}=has .. {}=other| matched at 15:him has=3:'s other=11:'s
# semgrex pattern |{} > {word:/'s/;lemma:be}=has > {lemma:be}=other : {}=has .. {}=other| matched at 15:him has=3:'s other=14:be
# semgrex pattern |{} > {word:/'s/;lemma:be}=has > {lemma:be}=other : {}=has .. {}=other| matched at 15:him has=11:'s other=14:be
In this sentence, we actually would want to match it has got to be ...
, so words 11
and 12
, not 11
and 14
. The tag on 12
was AUX
, which we will use later
Furthermore, there is this sentence:
# text = Edward Shinza's one of the few who did his stretch and got his head split open ...
# semgrex pattern |{} > {word:/'s/;lemma:be}=has > {lemma:be}=other : {}=has .. {}=other| matched at 4:one has=3:'s other=29:'s
In the test set, though, we find this sentence, which is exactly what we were looking for in terms of looking for words which skip a step. We note that the tag for that word is ADV
:
# sent_id = en_lines-ud-test-doc6-4988
# text = He's never been sent anywhere where there was anything left to do, he said.
# semgrex pattern |{} > {word:/'s/;lemma:be}=has > {lemma:be}=other : {}=has .. {}=other| matched at 5:sent has=2:'s other=4:been
1 He he PRON PERS-P3SG-NOM Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs 5 nsubj:pass _ SpaceAfter=No
2 's be AUX PRES-AUX Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 5 aux:pass _ _
3 never never ADV NEG _ 5 advmod _ _
4 been be AUX PERF Tense=Past|VerbForm=Part 5 aux:pass _ _
5 sent send VERB PASS Tense=Past|VerbForm=Part|Voice=Pass 16 ccomp _ _
So, finally, let us try to express the relation we want in plain English:
- A node with
's
and lemmabe
… - which is followed by an
AUX
… - with no words other than
ADV
between them… - and they are both children of the same parent
First we express that the immediate word after is any AUX
, not just be
, and indeed all of the matches for this are correct:
python3 stanza/server/semgrex.py --input_file UD_English-LinES/en_lines-ud-train.conllu "{} > {word:/'s/;lemma:be}=has > {cpos:AUX}=other : {}=has . {}=other" --matches_only
To only allow ADV
in between, we can first try this pattern:
"{} > {word:/'s/;lemma:be}=has > {cpos:AUX}=other : {}=has .. ({cpos:ADV} .. {}=other)"
There is a theoretical problem here, which is that this is positive matches, and says nothing about any other words in between has
and other
. As it turns out, we are lucky in this particular treebank, and there are no spurious sentences that match. If there were, we could eliminate them with a negative regex, such as
{}=has !.. ({cpos:/((?!ADV).*)/} .. {}=other)
Up until now, we have been doing everything with shell commands, so one would either need to escape the shell symbols such as !
or simply use the --semgrex_file
flag.
In the second sentence from our original results, we found:
# text = Rebecca's been to the Sputnik and she says it's terrific now.
Here, 's
is a dependant of been
. Hypothetically, the sentence could be
# text = Rebecca's already been to the Sputnik and she says it's terrific now.
Since 's
is a dependant of been
, it is much easier to find:
python3 stanza/server/semgrex.py --input_file UD_English-LinES/en_lines-ud-train.conllu "{word:/'s/;lemma:be}=has < {cpos:AUX}" --matches_only
This picks up one more sentence:
# text = He's just been to Denmark or somewhere because his mother died.
To summarize:
python3 stanza/server/semgrex.py --input_file UD_English-LinES/en_lines-ud-train.conllu "{} > {word:/'s/;lemma:be}=has > {cpos:AUX}=other : {}=has . {}=other" --matches_only
python3 stanza/server/semgrex.py --input_file UD_English-LinES/en_lines-ud-train.conllu "{} > {word:/'s/;lemma:be}=has > {cpos:AUX}=other : {}=has .. ({cpos:ADV} .. {}=other)" --matches_only
python3 stanza/server/semgrex.py --input_file UD_English-LinES/en_lines-ud-train.conllu "{word:/'s/;lemma:be}=has < {cpos:AUX}" --matches_only
and apply to each of train, dev, and test.
Example Ssurgeon edit
There is also a Python interface to Ssurgeon, which uses Semgrex to search for patterns and then edits the matching trees based on an edit script.
The edit to change the lemma of a word is really quite simple:
editNode -node <name> -lemma <lemma>
so in this case, we do
python3 stanza/server/ssurgeon.py --input_file UD_English-LinES/en_lines-ud-train.conllu --semgrex "{word:/'s/;lemma:be}=has < {cpos:AUX}" "editNode -node has -lemma have" --output_file en_lines-ud_train.conllu
We can do an entire directory at once with the --input_dir
and --output_dir
flags:
python3 stanza/server/ssurgeon.py --input_dir UD_English-LinES --semgrex "{word:/'s/;lemma:be}=has < {cpos:AUX}" "editNode -node has -lemma have" --output_dir remapped
There is also the ability to provide an entire edit script, so that it is not necessary to run each of the three semgrex expressions listed above. For example, this script:
{word:/'s/;lemma:be}=has < {cpos:AUX}
editNode -node has -lemma have
{} > {word:/'s/;lemma:be}=has > {cpos:AUX}=other : {}=has .. ({cpos:ADV} .. {}=other)
editNode -node has -lemma have
{} > {word:/'s/;lemma:be}=has > {cpos:AUX}=other : {}=has . {}=other
editNode -node has -lemma have
will edit all of the 's
we found above at once:
python3 stanza/server/ssurgeon.py --input_dir UD_English-LinES --edit_file has_edit.txt --output_dir remapped
Combining MWT
There is a wide variety of Ssurgeon edits possible. Another is combineMWT
, which marks some nodes as MWT. For example, the following edit script will capture many of the MWT and mark them:
{}=first . {word:/'s|n't|'ll|'ve/}=second
combineMWT -node first -node second
More Information
More information on Semgrex patterns is available on its javadoc page
More information on Ssurgeon is available on its javadoc page
A writeup of Semgrex and Ssurgeon was published at GURT 2023