Chinese and Arabic Segmenter
Table of contents
This software is for “tokenizing” or “segmenting” the words of Chinese or Arabic text. Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.
The Stanford Word Segmenter currently supports Arabic and Chinese. (The Stanford Tokenizer can be used for English, French, and Spanish.) The provided segmentation schemes have been found to work well for a variety of applications.
The system requires Java 1.8+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), you can decrease the memory requirement by changing the option java -mx1g
in the run scripts.
Arabic
Arabic is a root-and-template language with abundant bound clitics. These clitics include possessives, pronouns, and discourse connectives. The Arabic segmenter segments clitics from words (only). Segmenting clitics attached to words reduces lexical sparsity and simplifies syntactic analysis.
The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is an implementation of the segmenter described in:
Will Monroe, Spence Green, and Christopher D. Manning. 2014. Word Segmentation of Informal Arabic with Domain Adaptation. In ACL.
Chinese
Chinese is standardly written without spaces between words (as are some other languages). This software will split Chinese text into a sequence of words, defined according to some word segmentation standard. It is a Java implementation of the CRF-based Chinese Word Segmenter described in:
Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing.
Two models with two different segmentation standards are included: Chinese Penn Treebank standard and Peking University standard.
On May 21, 2008, we released a version that makes use of lexicon features. With external lexicon features, the segmenter segments more consistently and also achieves higher F measure when we train and test on the bakeoff data. This version is close to the CRF-Lex segmenter described in:
Pi-Chuan Chang, Michel Galley and Chris Manning. 2008. Optimizing Chinese Word Segmentation for Machine Translation Performance. In WMT.
The older version (2006-05-11) without external lexicon features is still available for download, but we recommend using the latest version.
Another new feature of recent releases is that the segmenter can now output k-best segmentations. An example of how to train the segmenter is now also available.
Tutorials
- Michelle Fullwood wrote a nice tutorial on segmenting and parsing Chinese with the Stanford NLP tools.
- This page from linguisticsweb has a few Windows examples (but text is a bit sparse).
Download
The Chinese and Arabic versions of CoreNLP use the segmenter for tokenization, and the segmenter package is available in all recent versions of CoreNLP. Bugfixes are primarily released through CoreNLP.
Previous versions of the segmenter are also available for download, licensed under theGNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation and a Java API. The segmenter code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing is available. If you don’t need a commercial license, but would like to support maintenance of these tools, we welcome gift funding.
The download is a zipped file consisting of model files, compiled code, and source files. If you unpack the tar file, you should have everything needed. Simple scripts are included to invoke the segmenter.
Download Stanford Word Segmenter version 4.2.0
Mailing Lists
We have 3 mailing lists for the Stanford Word Segmenter all of which are shared with other JavaNLP tools (with the exclusion of the parser). Each address is at @lists.stanford.edu
:
java-nlp-user
This is the best list to post to in order to send feature requests, make announcements, or for discussion among JavaNLP users. (Please ask support questions on Stack Overflow using thestanford-nlp
tag.)
You have to subscribe to be able to use this list. Join the list via this webpage or by emailing java-nlp-user-join@lists.stanford.edu
. (Leave the subject and message body empty.) You can also look at the list archives.
java-nlp-announce
This list will be used only to announce new versions of Stanford JavaNLP tools. So it will be very low volume (expect 2-4 messages a year). Join the list via this webpage or by emailingjava-nlp-announce-join@lists.stanford.edu
. (Leave the subject and message body empty.)java-nlp-support
This list goes only to the software maintainers. It’s a good address for licensing questions, etc. For general use and support questions, you’re better off using Stack Overflow or joining and usingjava-nlp-user
. You cannot joinjava-nlp-support
, but you can mail questions tojava-nlp-support@lists.stanford.edu
.
Extensions: Packages by others using Stanford Word Segmenter
- F#/C#/.NET: Sergey Tihon has ported Stanford NER to F# (and other .NET languages, such as C#), using IKVM. It’s available on NuGet.
- Python: The Stanford Word Segmenter is incorporated into nltk’s tokenize package.
Release History
Version | Date | Description |
---|---|---|
4.2.0 | 2020-11-17 | Update for compatibility |
4.0.0 | 2020-04-19 | New Chinese segmenter trained off of CTB 9.0 |
3.9.2 | 2018-10-16 | Updated for compatibility |
3.9.1 | 2018-02-27 | Updated for compatibility |
3.8.0 | 2017-06-09 | Update for compatibility |
3.7.0 | 2016-10-31 | Update for compatibility |
3.6.0 | 2015-12-09 | Updated for compatibility |
3.5.2 | 2015-04-20 | Updated for compatibility |
3.5.1 | 2015-01-29 | Updated for compatibility |
3.5.0 | 2014-10-26 | Upgrade to Java 8 |
3.4.1 | 2014-08-27 | Updated for compatibility |
3.4 | 2014-06-16 | Updated Arabic model |
3.3.1 | 2014-01-04 | Bugfix release |
3.3.0 | 2013-11-12 | Updated for compatibility |
3.2.0 | 2013-06-20 | Improved line by line handling |
1.6.8 | 2013-04-04 | ctb7 model, -nthreads option |
1.6.7 | 2012-11-11 | Bugfixes for both Arabic and Chinese, Chinese segmenter can now load data from a jar file |
1.6.6 | 2012-07-09 | Improved Arabic model |
1.6.5 | 2012-05-22 | Fixed encoding problems, supports stdin for Chinese segmenter |
1.6.4 | 2012-05-07 | Included Arabic model |
1.6.3 | 2012-01-08 | Minor bug fixes |
1.6.2 | 2011-09-14 | Improved thread safety |
1.6.1 | 2011-06-19 | Fixed empty document bug when training new models |
1.6 | 2011-05-15 | Models updated to be slightly more accurate; code correctly released so it now builds; updated for compatibility with other Stanford releases |
1.5 | 2008-05-21 | (with external lexicon features; able to output k-best segmentations) |
1.0 | 2006-05-11 | Initial release |