Link

Advanced Usage & Client Customization

Table of contents


In this section, we introduce how to customize the client options such that you can annotate a different language, use a different CoreNLP model, or have finer control over how you want the CoreNLP client or server to start.

Switching Language

Stanza by default starts an English CoreNLP pipeline when a client is initialized. You can switch to a different language by setting a simple properties argument when the client is initialized. The following example shows how to start a client with default French models:

with CoreNLPClient(properties='french') as client:

Alternatively, you can also use the ISO 639-1 code for a language:

with CoreNLPClient(properties='fr') as client:

This will initialize a CoreNLPClient object with the default set of French models. If you want to further customize the models used by the CoreNLP server, please read on.

Using Customized Models by Setting Client Properties

Without further customization, a background CoreNLP server will use a default list of models for a language. This usually works very well out-of-the-box. However, sometimes it becomes useful to customize the models used by the CoreNLP server. For example, you might want to use a dictionary-based NER model instead of a statistical one, or you might want to switch to using the PCFG parser instead of the default shift-reduce parser.

Similar to switching languages, setting models used by the server can be done again via the properties argument when initializing the CoreNLPClient object. This argument can take three types of values:

Properties TypeExampleDescription
Stanford CoreNLP supported languagefrenchOne of {arabic, chinese, english, french, german, spanish}, this will use Stanford CoreNLP defaults for that language
Python dictionary{‘annotators’: ‘tokenize,ssplit,pos’, ‘pos.model’: ‘/path/to/custom-model.ser.gz’}A Python dictionary specifying the properties, the properties will be written to a tmp file
File path/path/to/server.propsPath on the file system or CLASSPATH to a properties file

Below are examples that illustrate how to use the three different types of properties:

  • Using a language name:
    with CoreNLPClient(properties='french') as client:
    

    As introduced above, this option allows quick switch between languages, and a default list of models will be used for each language.

  • Using a Python dictionary
    with CoreNLPClient(properties={
          'annotators': 'tokenize,ssplit,pos',
          'pos.model': '/path/to/custom-model.ser.gz'
      }) as client:
    

    This option allows you to override the default models used by the server, by providing (model name, model path) pairs.

  • Using a properties file:
    with CoreNLPClient(properties='/path/to/server.props') as client:
    

    This option allows the finest level of control over what annotators and models are going to be used in the server. For details on how to write a property file, please see the instructions on configuring CoreNLP property files.

Commonly-used CoreNLP Client Options

Using customized properties is the most common way to customize your CoreNLP server. However, sometimes you may want to have even more control on different aspects of your client-server, such as the port that the client uses to communicate with the server, the output format from the server, or the memory that you want to allocate to your server.

Here we provide a list of commonly-used arguments that you can initialize your CoreNLPClient with, along with their default values and descriptions:

Option nameTypeDefaultDescription
annotatorsstr“tokenize,ssplit,lemma,pos,ner,depparse”The default list of CoreNLP annotators the server will use
properties-NoneSee “Setting Client Properties” section above
endpointstrhttp://localhost:9000The host and port where the CoreNLP server will run on; change this when the default port 9000 is occupied.
classpathstrNoneClasspath to use for CoreNLP. None means using the classpath as set by the $CORENLP_HOME environment variable, “$CLASSPATH” means to use the system CLASSPATH, and otherwise, the given string is used
timeoutint15000The maximum amount of time, in milliseconds, to wait for an annotation to finish before cancelling it.
threadsint5The number of threads to hit the server with. If, for example, the server is running on an 8 core machine, you can specify this to be 8, and the client will allow you to make 8 simultaneous requests to the server.
output_formatstr“serialized”The default output format to use for the server response, unless otherwise specified. If set to be “serialized”, the response will be converted to local Python objects (see usage examples here). For a list of all supported output format, see the CoreNLP output options page.
memorystr“4G”This specifies the memory used by the CoreNLP server process.
start_serverboolTrueWhether to start the CoreNLP server when initializing the Python CoreNLPClient object. By default the CoreNLP server will be started using the provided options.
stdoutfilesys.stdoutThe standard output used by the CoreNLP server process.
stderrfilesys.stderrThe standard error used by the CoreNLP server process.
be_quietboolTrueIf set to False, the server process will print detailed error logs. Useful for diagnosing errors.
max_char_lengthint100000The max number of characters that will be accepted and processed by the CoreNLP server in a single request.
preloadboolTrueLoad the annotators immediately upon server start; otherwise the annotators will be lazily loaded upon the first annotation request is made.

Here is a quick example that specifies a list of annotators to load, allocates 8G of memory to the server, uses plain text output format, and requests the server to print detailed error logs during annotation:

with CoreNLPClient(
    annotators='tokenize,ssplit,pos,lemma,ner',
    output_format='text',
    memory='8G',
    be_quiet=False) as client:

Using a CoreNLP server on a remote machine

With the endpoint option, you can even connect to a remote CoreNLP server running in a different machine:

with CoreNLPClient(endpoint='http://remote-server-address:9000') as client:

More Advanced CoreNLP Server Settings

Apart from the above options, there are some very advanced settings that you may need to customize how the CoreNLP server will start in the background. They are summarized in the following table:

OptionDescription
server_idID for the server, label attached to server’s shutdown key file
status_portPort to server status check endpoints
uriContextURI context for server
strictObey strict HTTP standards
sslIf true, start server with (an insecure) SSL connection
key.jks file to load if ssl is enabled
usernameThe username component of a username/password basic auth credential
passwordThe password component of a username/password basic auth credential
blacklista list of IPv4 addresses to ban from using the server

You can also find more documention for the server’s start up options on the CoreNLP Server website.

Here we highlight two common use cases on why you may need these options.

Changing server ID when using multiple CoreNLP servers on a machine

When a CoreNLP server is started, it will write a special shutdown key file to the local disk, to indicate its running status. This will create an issue when multiple servers need to be run simultaneously on a single machine, since a second server won’t be able to write and delete its own shutdown key file. This is easily solvable by giving a special server ID to the second server instance, when the client is initialized:

with CoreNLPClient(server_id='second-server-name') as client:

Protecting a CoreNLP server with password

You can even password-protect a CoreNLP server process, so that other users on the same machine won’t be able to access or change your CoreNLP server:

with CoreNLPClient(username='myusername', password='1234') as client:

Now you’ll need to provide the same username and password when you call the annotate function of the client, so that the request can authenticate itself with the server:

ann = client.annotate(text, username='myusername', password='1234')

Easy, right?

Dynamically Changing Properties for Each Annotation Request

Properties for the CoreNLP pipeline run on text can be set for each particular annotation request. If properties are set for a particular request, the server’s initialization properties will be overridden. This allows you to dynamically change your annotation need, without needing to start a new client-server from scratch.

Request-level properties can be registered with Stanza’s CoreNLPClient to maximize efficiency. Upon registration, the client will maintain a properties_cache to map keys to particular property settings. Alternatively, request-level properties can be specified as a Stanford CoreNLP support language to use the language defaults, or a full Python dictionary for maximal flexibility.

Here is an example of how to register a set of properties with the client’s properties_cache, and how to use those properties via the key for annotation:

FRENCH_CUSTOM_PROPS = {
    'annotators': 'tokenize,ssplit,pos,parse', 'tokenize.language': 'fr',
    'pos.model': 'edu/stanford/nlp/models/pos-tagger/french/french.tagger',
    'parse.model': 'edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz',
    'outputFormat': 'text'
}

with CoreNLPClient(annotators='tokenize,ssplit,pos') as client:
    client.register_properties_key('fr-custom', FRENCH_CUSTOM_PROPS)
    ann = client.annotate(text, properties_key='fr-custom')

Alternatively, request-level properties can simply be a language that you want to run the CoreNLP pipeline for:

ann = client.annotate(text, properties='german')

Or, a dictionary that specifies all properties you want to set/override:

ann = client.annotate(text, properties=FRENCH_CUSTOM_PROPS)

Similarly to CoreNLPClient initialization, you can also specify the annotators and output format for CoreNLP for individual annotation requests as:

ann = client.annotate(text, properties=FRENCH_CUSTOM_PROPS, annotators='tokenize,ssplit,pos', output_format='json')