Link

Advanced Usage & Client Customization

Table of contents


In this section, we introduce how to customize the client options such that you can annotate a different language, use a different CoreNLP model, or have finer control over how you want the CoreNLP client or server to start.

Overview

By default, the CoreNLP server will run the following English annotators:

tokenize,ssplit,pos,lemma,ner,depparse,coref,kbp

There are a variety of ways to customize a CoreNLP pipeline, including:

  • using a different list of annotators (e.g. tokenize,ssplit,pos)
  • processing a different language (e.g. French)
  • using custom models (e.g. my-custom-depparse.gz)
  • returning different output formats (e.g. JSON)

These customizations are achieved by specifying properties.

The first step is always importing CoreNLPClient

from stanza.server import CoreNLPClient

When starting a CoreNLP server via Stanza, a user can choose what properties to initialize the server with. For instance, here is an example of launching a server with a different parser model that returns JSON:

CUSTOM_PROPS = {"parse.model": "edu/stanford/nlp/models/srparser/englishSR.beam.ser.gz"}

with CoreNLPClient(properties=CUSTOM_PROPS, output_format="json") as client:

Or one could launch a server with CoreNLP French defaults as in this example:

with CoreNLPClient(properties="french") as client:

When communicating with a CoreNLP server via Stanza, a user can send specific properties for one time use with that request. These request level properties allow for a dynamic NLP application which can apply different pipelines depending on input text.

For instance, one could switch between German and French pipelines:

french_text = "Emmanuel Macron est le président de la France."
german_text = "Angela Merkel ist die deutsche Bundeskanzlerin."

with CoreNLPClient() as client:
    french_ann = client.annotate(french_text, properties="fr")
    german_ann = client.annotate(german_text, properties="de")

If a user has created custom biomedical and finanical models, they could switch between them based on what kind of document they are processing:

BIOMEDICAL_PROPS = {
    "depparse.model": "/path/to/biomedical-parser.gz",
    "ner.model": "/path/to/biomedical-ner.ser.gz"
}
FINANCE_PROPS = {
    "depparse.model": "/path/to/finance-parser.gz",
    "ner.model": "/path/to/finance-ner.ser.gz"
}

with CoreNLPClient() as client:
    bio_ann = client.annotate(bio_text, properties=BIOMEDICAL_PROPS)
    finance_ann = client.annotate(finance_text, properties=FINANCE_PROPS)

CoreNLP Server Start Options (Pipeline)

There are three ways to specify pipeline properties when starting a CoreNLP server:

Properties TypeExampleDescription
Stanford CoreNLP supported languagefrenchOne of {arabic, chinese, english, french, german, spanish} (or the ISO 639-1 code), this will use Stanford CoreNLP defaults for that language
Python dictionary{‘annotators’: ‘tokenize,ssplit,pos’, ‘pos.model’: ‘/path/to/custom-model.ser.gz’}A Python dictionary specifying the properties, the properties will be written to a tmp file
File path/path/to/server.propsPath on the file system or CLASSPATH to a properties file

For convenience one can also specify the list of annotators and the desired output_format in the CoreNLPClient constructor. The values for those two arguments will override any additional properties supplied at construction time.

Below are examples that illustrate how to use the three different types of properties:

  • Using a language name:
    with CoreNLPClient(properties='french') as client:
    

    As introduced above, this option allows quick switch between languages, and a default list of models will be used for each language.

  • Using a Python dictionary
    with CoreNLPClient(properties={
          'annotators': 'tokenize,ssplit,pos',
          'pos.model': '/path/to/custom-model.ser.gz'
      }) as client:
    

    This option allows you to override the default models used by the server, by providing (model name, model path) pairs.

  • Using a properties file:
    with CoreNLPClient(properties='/path/to/server.props') as client:
    

    This option allows the finest level of control over what annotators and models are going to be used in the server. For details on how to write a property file, please see the instructions on configuring CoreNLP property files.

For convenience one can also specify the list of annotators and the desired output_format in the CoreNLPClient constructor.

Option nameTypeDefaultDescription
annotatorsstr“tokenize,ssplit,lemma,pos,ner,depparse”The default list of CoreNLP annotators the server will use
output_formatstr“serialized”The default output format to use for the server response, unless otherwise specified. If set to be “serialized”, the response will be converted to local Python objects (see usage examples here).

The values for those two arguments will override any additional properties supplied at construction time.

with CoreNLPClient(properties='french', annotators='tokenize,ssplit,mwt,pos,ner,parse', output_format='json') as client:

CoreNLP Server Start Options (Server)

In addition to customizing the pipeline the server will run, a variety of server specific properties can be specified at server construction time.

Here we provide a list of commonly-used arguments that you can initialize your CoreNLPClient with, along with their default values and descriptions:

Option nameTypeDefaultDescription
endpointstrhttp://localhost:9000The host and port where the CoreNLP server will run on; change this when the default port 9000 is occupied.
classpathstrNoneClasspath to use for CoreNLP. None means using the classpath as set by the $CORENLP_HOME environment variable, “$CLASSPATH” means to use the system CLASSPATH, and otherwise, the given string is used
timeoutint60000The maximum amount of time, in milliseconds, to wait for an annotation to finish before cancelling it.
threadsint5The number of threads to hit the server with. If, for example, the server is running on an 8 core machine, you can specify this to be 8, and the client will allow you to make 8 simultaneous requests to the server.
memorystr“5G”This specifies the memory used by the CoreNLP server process.
start_serverstanza.server.StartServerFORCE_STARTWhether to start the CoreNLP server when initializing the Python CoreNLPClient object. By default the CoreNLP server will be started using the provided options. Alternatively, DONT_START doesn’t start a new CoreNLP server and attempts to connect to an existing server instance at endpoint; TRY_START tries to start a new server instance at the endpoint provided, but doesn’t fail like FORCE_START if one is already running there. Note that this Enum is new in Stanza v1.1, and in previous versions it only supports boolean input.
stdoutfilesys.stdoutThe standard output used by the CoreNLP server process.
stderrfilesys.stderrThe standard error used by the CoreNLP server process.
be_quietboolFalseIf set to False, the server process will print detailed error logs. Useful for diagnosing errors.
max_char_lengthint100000The max number of characters that will be accepted and processed by the CoreNLP server in a single request.
preloadboolTrueLoad the annotators immediately upon server start; otherwise the annotators will be lazily loaded upon the first annotation request is made.

Here is a quick example that specifies a list of annotators to load, allocates 8G of memory to the server, uses plain text output format, and requests the server to print detailed error logs during annotation:

with CoreNLPClient(
    annotators='tokenize,ssplit,pos,lemma,ner',
    output_format='text',
    memory='8G',
    be_quiet=False) as client:

CoreNLP Server Start Options (Advanced)

Apart from the above options, there are some very advanced settings that you may need to customize how the CoreNLP server will start in the background. They are summarized in the following table:

OptionDescription
server_idID for the server, label attached to server’s shutdown key file
status_portPort to server status check endpoints
uriContextURI context for server
strictObey strict HTTP standards
sslIf true, start server with (an insecure) SSL connection
key.jks file to load if ssl is enabled
usernameThe username component of a username/password basic auth credential
passwordThe password component of a username/password basic auth credential
blockLista list of IPv4 addresses to ban from using the server

You can also find more documention for the server’s start up options on the CoreNLP Server website.

Here we highlight two common use cases on why you may need these options.

Changing server ID when using multiple CoreNLP servers on a machine

When a CoreNLP server is started, it will write a special shutdown key file to the local disk, to indicate its running status. This will create an issue when multiple servers need to be run simultaneously on a single machine, since a second server won’t be able to write and delete its own shutdown key file. This is easily solvable by giving a special server ID to the second server instance, when the client is initialized:

with CoreNLPClient(server_id='second-server-name') as client:

Protecting a CoreNLP server with password

You can even password-protect a CoreNLP server process, so that other users on the same machine won’t be able to access or change your CoreNLP server:

with CoreNLPClient(username='myusername', password='1234') as client:

Now you’ll need to provide the same username and password when you call the annotate function of the client, so that the request can authenticate itself with the server:

ann = client.annotate(text, username='myusername', password='1234')

Easy, right?

Switching Languages

Stanza by default starts an English CoreNLP pipeline when a client is initialized. You can switch to a different language by setting a simple properties argument when the client is initialized. The following example shows how to start a client with default French models:

with CoreNLPClient(properties='french') as client:

Alternatively, you can also use the ISO 639-1 code for a language:

with CoreNLPClient(properties='fr') as client:

This will initialize a CoreNLPClient object with the default set of French models. If you want to further customize the models used by the CoreNLP server, please read on.

Using a CoreNLP server on a remote machine

With the endpoint option, you can even connect to a remote CoreNLP server running in a different machine:

with CoreNLPClient(endpoint='http://remote-server-address:9000') as client:

Dynamically Changing Properties for Each Annotation Request

Properties for the CoreNLP pipeline run on text can be set for each particular annotation request. If properties are set for a particular request, the server’s initialization properties will be overridden. This allows you to dynamically change your annotation need, without needing to start a new client-server from scratch.

Request level properties can be specified with a Python dictionary, or the name of a CoreNLP supported language.

Here is an example of making a request with a custom dictionary of properties:

FRENCH_CUSTOM_PROPS = {
    'annotators': 'tokenize,ssplit,pos,parse', 'tokenize.language': 'fr',
    'pos.model': 'edu/stanford/nlp/models/pos-tagger/french/french.tagger',
    'parse.model': 'edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz',
    'outputFormat': 'text'
}

with CoreNLPClient() as client:
    ann = client.annotate(text, properties=FRENCH_CUSTOM_PROPS)

Alternatively, request-level properties can simply be a language that you want to run the CoreNLP pipeline for:

ann = client.annotate(text, properties='german')

Similarly to CoreNLPClient initialization, you can also specify the annotators and output format for CoreNLP for individual annotation requests as:

ann = client.annotate(text, properties=FRENCH_CUSTOM_PROPS, annotators='tokenize,ssplit,pos', output_format='json')