Link

Client Regex Usage

Table of contents


Using Tokensregex, Semgrex and Tregex with the client

Separate client functions are provided to run Tokensregex, Semgrex, Tregex pattern matching with the CoreNLP client. The following example shows how to start a new client and use TokensRegex to find patterns in text:

text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."
with CoreNLPClient(
        annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse'],
        timeout=30000,
        memory='16G') as client:

    # Use tokensregex patterns to find who wrote a sentence.
    pattern = '([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/'
    matches = client.tokensregex(text, pattern)
    # sentences contains a list with matches for each sentence.
    print(len(matches["sentences"])) # prints: 3
    # length tells you whether or not there are any matches in this
    print(matches["sentences"][1]["length"]) # prints: 1
    # You can access matches like most regex groups.
    print(matches["sentences"][1]["0"]["text"]) # prints: "Chris wrote a simple sentence"
    print(matches["sentences"][1]["0"]["1"]["text"]) # prints: "Chris"

Aside from surface level patterns, the CoreNLPClient also allows you to use CoreNLP to extract patterns in syntactic structures. Here is an example shows how to use Semgrex and Tregex on the same piece of text:

text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."
with CoreNLPClient(
        annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse'],
        timeout=30000,
        memory='16G') as client:

    # Use semgrex patterns to directly find who wrote what.
    pattern = '{word:wrote} >nsubj {}=subject >obj {}=object'
    matches = client.semgrex(text, pattern)
    # sentences contains a list with matches for each sentence.
    print(len(matches["sentences"])) # prints: 3
    # length tells you whether or not there are any matches in this
    print(matches["sentences"][1]["length"]) # prints: 1
    # You can access matches like most regex groups.
    print(matches["sentences"][1]["0"]["text"]) # prints: "wrote"
    print(matches["sentences"][1]["0"]["$subject"]["text"]) # prints: "Chris"
    print(matches["sentences"][1]["0"]["$object"]["text"]) # prints: "sentence"

    # Tregex example
    pattern = 'NP'
    matches = client.tregex(text, pattern)
    # You can access matches similarly
    print(matches['sentences'][1]['1']['match']) # prints: "(NP (DT a) (JJ simple) (NN sentence))\n"

Using Semgrex on depparse

New in v1.1

Note that each of the previous methods rely on using CoreNLP for the language processing. It is also possible to use semgrex to search dependencies produced by the depparse module of stanza.

For example:

import stanza
import stanza.server.semgrex as semgrex

# stanza.download("en")
nlp = stanza.Pipeline("en", processors="tokenize,pos,lemma,depparse")

doc = nlp("Banning opal removed all artifact decks from the meta.  I miss playing lantern.")
semgrex_results = semgrex.process_doc(doc,
                                      "{pos:NN}=object <obl {}=action",
                                      "{cpos:NOUN}=thing <obj {cpos:VERB}=action")
print(semgrex_results)

This uses CoreNLP semgrex by launching a new java process. Even though it is inexpensive to launch a single process, it is best to group all of the semgrex patterns to be run on a single doc into a single function call.

The result will be a CoreNLP_pb2.SemgrexResponse protobuf object, which contains nested lists for each sentence, for each semgrex query. For this snippet, the result will look like:

result {
  result {
    match {
      index: 9
      node {
        name: "action"
        index: 3
      }
      node {
        name: "object"
        index: 9
      }
    }
  }
  result {
    match {
      index: 6
      node {
        name: "action"
        index: 3
      }
      node {
        name: "thing"
        index: 6
      }
    }
    match {
      index: 2
      node {
        name: "action"
        index: 1
      }
      node {
        name: "thing"
        index: 2
      }
    }
  }
}
result {
  result {
  }
  result {
    match {
      index: 4
      node {
        name: "action"
        index: 3
      }
      node {
        name: "thing"
        index: 4
      }
    }
  }
}

Semgrex as a context

New in v1.2.1

In the next release, it will be possible to use semgrex as a Python context. This will allow multiple calls per Java process, hopefully reducing overhead in situations where there are lots of small queries.

with Semgrex(classpath=???) as sem:
  result = sem.process(doc, patterns)

TokensRegex

New in v1.2.1

Similar to the semgrex interface, there is a tokensregex interface which allows use of tokensregex on documents processed with stanza. For example:

    nlp = stanza.Pipeline('en',
                          processors='tokenize')

    doc = nlp('Uro ruined modern.  Fortunately, Wotc banned him')
    print(process_doc(doc, "him", "ruined"))

The expected result of this is that it will return locations of him and ruined:

match {
  match {
    sentence: 1
    match {
      text: "him"
      begin: 4
      end: 5
    }
  }
}
match {
  match {
    sentence: 0
    match {
      text: "ruined"
      begin: 1
      end: 2
    }
  }
}

Universal Enhancements

New in v1.2.1

Currently, the depparse annotator only processes basic dependencies. The CoreNLP package includes a tool to convert basic UD to enhanced UD. We now include a way to communicate with that tool.

import stanza
from stanza.server.ud_enhancer import UniversalEnhancer

nlp = stanza.Pipeline('en',
                      processors='tokenize,pos,lemma,depparse')

with UniversalEnhancer(language="en", classpath="$CLASSPATH") as enhancer:
    doc = nlp("This is the car that I bought")
    result = enhancer.process(doc)
    print(result.sentence[0].enhancedDependencies)

You can see that there is an “extra” dependency in the output:

...
edge {
  source: 4
  target: 7
  dep: "acl:relcl"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 7
  target: 4
  dep: "obl:relobj"
  isExtra: true
  sourceCopy: 0
  targetCopy: 0
  language: Any
}
edge {
  source: 7
  target: 6
  dep: "nsubj"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
...

In order to use this, you either need to supply the language or supply a pronouns_pattern which describes how to identify relative pronouns in the language of interest. For example, the pattern for English is "(?i:that|what|which|who|whom|whose)". Note that most languages are not yet supported by name, but we are more than happy to receive contributions for how to find relative pronouns in other languages.