If your text is all lowercase, all uppercase, or badly and
inconsistently capitalized (many web forums, texts, twitter, etc.)
then this will negatively effect the performance of most of our
annotators. Most of our annotators were trained on data that is
standardly edited and capitalized full sentences.
There are two strategies available to address this that may help. One
is to try to first correctly capitalize the text with a
truecaser, and then to process the text with the standard models.
See the TrueCaseAnnotator for how to do this.
An example command for using regular annotators following truecasing
java -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,truecase,pos,lemma,ner,depparse -truecase.overwriteText true -file caseless.txt -outputFormat json
The other strategy is to use models more suited to ill-capitalized text.
The GATE folk made an English POS tagger model trained on twitter
text. You can get it from the extensions page.
We have made slightly different Stanford CoreNLP models for the tagger, parser, and NER
that ignore capitalization. We have only trained such models
for English, but the same method could be used for other languages.
To use these models, you need to download a jar file with caseless
models. Prior to version 3.6, caseless models were packaged separately as
their own jar file (approximately treating “caseless English” like a
separate language). Starting with version 3.6, caseless models for
English were included in the new comprehensive english jar file. You
can find these jar files on the Release history page.
Be sure to include the path to the case-insensitive models jar in the
-cp classpath flag and then you can
ask for these models to be used like this:
Important note (2016): The caseless NER model
released with version 3.6.0 was defective and has very poor
performance. Sorry! Stuff happens. If you want good caseless NER,
you should either run with caseless models from a 3.5.x series
release (all of which are compatible with version 3.6.0) or download
the new fixed model from version 3.7.0 or in the HEAD jar, which is available from
our GitHub page. Since the
version 3.5.x releases have a separate caseless jar, it is easy to
also specify it as a dependency; you just have to make sure that it
appears on your classpath before other jars which contain models
with the same name.
To train your own caseless models, you need one additional property,
which asks for a function to be called before a token is processed
which leads to the case of all words being ignored. We use by default
a function that also Americanizes the spelling of certain words:
wordFunction = edu.stanford.nlp.process.LowercaseAndAmericanizeFunction
but there is also simply:
wordFunction = edu.stanford.nlp.process.LowercaseFunction