Link

Lexicalized Parser

Table of contents


About

A natural language parser is a program that works out the grammatical structure of sentences , for instance, which groups of words go together (as “phrases”) and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s. You can try out our parser online.

Package contents

This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, k -best parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Anna Rafferty, Spence Green, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel.

The lexicalized probabilistic parser implements a factored product model, with separate PCFG phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm. Or the software can be used simply as an accurate unlexicalized stochastic context-free grammar parser. Either of these yields a good performance statistical parsing system. A GUI is provided for viewing the phrase structure tree output of the parser.

As well as providing an English parser, the parser can be and has been adapted to work with other languages. A Chinese parser based on the Chinese Treebank, a German parser based on the Negra corpus and Arabic parsers based on the Penn Arabic Treebank are also included. The parser has also been used for other languages, such as Italian, Bulgarian, and Portuguese.

The parser provides Universal Dependencies (v1) and Stanford Dependencies output as well as phrase structure trees. Typed dependencies are otherwise known grammatical relations. This style of output is available only for English and Chinese. For more details, please refer to the Stanford Dependencies webpage and the Universal Dependencies v1 documentation. (See also the current Universal Dependencies documentation, but we are yet to update to it.).

Shift-reduce constituency parser

As of version 3.4 in 2014, the parser includes the code necessary to run a shift reduce parser, a much faster constituent parser with competitive accuracy. Models for this parser are linked below.

Neural-network dependency parser

In version 3.5.0 (October 2014) we released a high-performance dependency parser powered by a neural network. The parser outputs typed dependency parses for English and Chinese. The models for this parser are included in the general Stanford Parser models package.

Dependency scoring

The package includes a tool for scoring of generic dependency parses, in a class edu.stanford.nlp.trees.DependencyScoring. This tool measures scores for dependency trees, doing F1 and labeled attachment scoring. The included usage message gives a detailed description of how to use the tool.

Usage notes

The current version of the parser requires Java 8 or later. (You can also download an old version of the parser, version 1.4, which runs under JDK 1.4, version 2.0 which runs under JDK 1.5, version 3.4.1 which runs under JDK 1.6, but those distributions are no longer supported.) The parser also requires a reasonable amount of memory (at least 100MB to run as a PCFG parser on sentences up to 40 words in length; typically around 500MB of memory to be able to parse similarly long typical-of-newswire sentences using the factored model).

The parser is available for download, licensed under theGNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation, a Java parsing GUI, and a Java API.

The download is a 261 MB zipped file (mainly consisting of included grammar data files). If you unpack the zip file, you should have everything needed. Simple scripts are included to invoke the parser on a Unix or Windows system. For another system, you merely need to similarly configure the classpath.

Licensing

The parser code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing is available. (Fine print: The traditional (dynamic programmed) Stanford Parser does part-of-speech tagging as it works, but the newer constituency and neural network dependency shift-reduce parsers require pre-tagged input. For convenience, we include the part-of-speech tagger code, but not models with the parser download. However, if you want to use these parsers under a commercial license, then you need a license to both the Stanford Parser and the Stanford POS tagger. Or you can get the whole bundle of Stanford CoreNLP.) If you don’t need a commercial license, but would like to support maintenance of these tools, we welcome gift funding: use this form and write “Stanford NLP Group open source software” in the Special Instructions.

Citing the Stanford Parser

The main technical ideas behind how these parsers work appear in these papers. Feel free to cite one or more of the following papers or people depending on what you are using. Since the parser is regularly updated, we appreciate it if papers with numerical results reflecting parser performance mention the version of the parser being used!

For the neural-network dependency parser:
Danqi Chen and Christopher D Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. Proceedings of EMNLP 2014

For the Compositional Vector Grammar parser (starting at version 3.2):
Richard Socher, John Bauer, Christopher D. Manning and Andrew Y. Ng. 2013. Parsing With Compositional Vector Grammars. Proceedings of ACL 2013

For the Shift-Reduce Constituency parser (starting at version 3.2):
This parser was written by John Bauer. You can thank him and cite the web page describing it You can also cite the original research papers of others mentioned on that page.

For the PCFG parser (which also does POS tagging):
Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics , pp. 423-430.

For the factored parser (which also does POS tagging):
Dan Klein and Christopher D. Manning. 2003. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002) , Cambridge, MA: MIT Press, pp. 3-10.

For the Universal Dependencies representation:
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal Dependencies v1: A Multilingual Treebank Collection. In LREC 2016.

For the English Universal Dependencies converter and the enhanced English Universal Dependencies representation:
Sebastian Schuster and Christopher D. Manning. 2016. Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks. In LREC 2016.

For the (English) Stanford Dependencies representation:
Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning.

  1. Generating Typed Dependency Parses from Phrase Structure Parses. In LREC 2006.

For the German parser:
Anna Rafferty and Christopher D. Manning. 2008. Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines. In ACL Workshop on Parsing German.

For the Chinese Parser:
Roger Levy and Christopher D. Manning. 2003. Is it harder to parse Chinese, or the Chinese Treebank? ACL 2003 , pp. 439-446.

For the Chinese Stanford Dependencies:
Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, and Christopher D. Manning. 2009. Discriminative Reordering with Chinese Grammatical Relations Features. In Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation.

For the Arabic parser:
Spence Green and Christopher D. Manning. 2010. Better Arabic Parsing: Baselines, Evaluations, and Analysis. In COLING 2010.

For the French parser:
Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning. 2010. Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French.. In EMNLP 2011.

For the Spanish parser:
Most of the work on Spanish was by Jon Gauthier. There is no published paper, but you can thank him and/or cite this webpage: https://nlp.stanford.edu/software/spanish-faq.html

Questions about the parser?

  1. If you’re new to parsing, you can start by running the GUI to try out the parser. Scripts are included for linux (lexparser-gui.sh) and Windows (lexparser-gui.bat).
  2. Take a look at the Javadoc lexparser package documentation and LexicalizedParser class documentation. (Point your web browser at the index.html file in the included javadoc directory and navigate to those items.)
  3. Look at the parser FAQ for answers to common questions.
  4. If none of that helps, please see our email guidelines for instructions on how to reach us for further assistance.

Download

The standard CoreNLP download includes the code for the lexicalized parser and all of its siblings described above.

This standard download includes models for Arabic, Chinese, English, French, German, and Spanish. There are additional models we do not release with the standalone parser, including shift-reduce models, that can be found in the models jars for each language.

Older versions are linked below.

Extensions: Packages by others using the parser

Java

  • tydevi Typed Dependency Viewer that makes a picture of the Stanford Dependencies analysis of a sentence. By Bernard Bou.
  • DependenSee A Dependency Parse Visualisation Tool that makes pictures of Stanford Dependency output. By Awais Athar. (GitHub)
  • GATE plug-in. By the GATE Team (esp. Adam Funk).
  • GrammarScope grammatical relation browser. GUI, especially focusing on grammatical relations (typed dependencies), including an editor. By Bernard Bou.

PHP

  • PHP-Stanford-NLP. Supports POS Tagger, NER, Parser. By Anthony Gentile (agentile).

Python/Jython

Ruby

.NET / F# / C#

OS X

  • If you use Homebrew, you can install the Stanford Parser with: brew install stanford-parser

Release history

VersionDateDescriptionModels
4.2.02020‑11‑17Retrain English models with treebank fixesarabic chinese english french german spanish
4.0.02020‑05‑22Model tokenization updated to UDv2.0arabic chinese english french german spanish
3.9.22018‑10‑17Updated for compatibilityarabic chinese english french german spanish
3.9.12018‑02‑27new French and Spanish UD models, misc. UD enhancements, bug fixesarabic chinese english french german spanish
3.8.02017‑06‑09Updated for compatibilityarabic chinese english french german spanish
3.7.02016‑10‑31new UD modelsarabic chinese english french german spanish
3.6.02015‑12‑09Updated for compatibilitychinese english french german spanish
3.5.22015‑04‑20Switch to universal dependenciesshift reduce parser models
3.5.12015‑01‑29Dependency parser fixes and model improvementsshift reduce parser models
3.5.02014‑10‑31neural-network dependency parsershift reduce parser models
3.4.12014‑08‑27Add Spanish modelsshift reduce parser models
3.42014‑06‑16Shift-reduce parser, dependency improvements, French parser uses CC tagsetshift reduce parser models
3.3.12014‑01‑04English dependency “infmod” and “partmod” combined into “vmod”, other minor dependency improvements 
3.3.02013‑11‑12English dependency “attr” removed, other dependency improvements, imperative training data added 
3.2.02013‑06‑20New CVG based English model with higher accuracy 
2.0.52013‑04‑05Dependency improvements, -nthreads option, ctb7 model 
2.0.42012‑11‑12Improved dependency code extraction efficiency, other dependency changes 
2.0.32012‑07‑09Minor bug fixes 
2.0.22012‑05‑22Some models now support training with extra tagged, non-tree data 
2.0.12012‑03‑09Caseless English model included, bugfix for enforced tags 
2.02012‑02‑03Threadsafe! 
1.6.92011‑09‑14Improved recognition of imperatives, dependencies now explicitely include a root, parser knows osprey is a noun 
1.6.82011‑06‑19New French model, improved foreign language models, bug fixes 
1.6.72011‑05‑18Minor bug fixes. 
1.6.62011‑04‑20Internal code and API changes (ArrayLists rather than Sentence; use of CoreLabel objects) to match tagger and CoreNLP. 
1.6.52010‑11‑30Further improvements to English Stanford Dependencies and other minor changes 
1.6.42010‑08‑20More minor bug fixes and improvements to English Stanford Dependencies and question parsing 
1.6.32010‑07‑09Improvements to English Stanford Dependencies and question parsing, minor bug fixes 
1.6.22010‑02‑26Improvements to Arabic parser models, and to English and Chinese Stanford Dependencies 
1.6.12008‑10‑26Slightly improved Arabic and German parsing, and Stanford Dependencies 
1.62007‑08‑19Added Arabic, k-best PCCFG parsing; improved English grammatical relations 
1.5.12006‑06‑11Improved English and Chinese grammatical relations; fixed UTF-8 handling 
1.52005‑07‑21Added grammatical relations output; fixed bugs introduced in 1.4 
1.42004‑03‑24Made PCFG faster again (by FSA minimization); added German support 
1.32003‑09‑06Made parser over twice as fast; added tokenization options 
1.22003‑07‑20Halved PCFG memory usage; added support for Chinese 
1.12003‑03‑25Improved parsing speed; included GUI, improved PCFG grammar 
1.02002‑12‑05Initial release 

Sample input and output

The parser can read various forms of plain text input and can output various analysis formats, including part-of-speech tagged text, phrase structure trees, and a grammatical relations (typed dependency) format. For example, consider the text:

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

The following output shows part-of-speech tagged text, then a context-free phrase structure grammar representation, and finally a typed dependency representation. All of these are different views of the output of the parser.

The/DT strongest/JJS rain/NN ever/RB recorded/VBN in/IN India/NNP
shut/VBD down/RP the/DT financial/JJ hub/NN of/IN Mumbai/NNP ,/,
snapped/VBD communication/NN lines/NNS ,/, closed/VBD airports/NNS
and/CC forced/VBD thousands/NNS of/IN people/NNS to/TO sleep/VB in/IN
their/PRP$ offices/NNS or/CC walk/VB home/NN during/IN the/DT night/NN
,/, officials/NNS said/VBD today/NN ./.
 
(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
        (, ,)
        (VP (VBD snapped)
          (NP (NN communication) (NNS lines)))
        (, ,)
        (VP (VBD closed)
          (NP (NNS airports)))
        (CC and)
        (VP (VBD forced)
          (NP
            (NP (NNS thousands))
            (PP (IN of)
              (NP (NNS people))))
          (S
            (VP (TO to)
              (VP
                (VP (VB sleep)
                  (PP (IN in)
                    (NP (PRP$ their) (NNS offices))))
                (CC or)
                (VP (VB walk)
                  (NP (NN home))
                  (PP (IN during)
                    (NP (DT the) (NN night))))))))))
    (, ,)
    (NP (NNS officials))
    (VP (VBD said)
      (NP-TMP (NN today)))
    (. .)))
 
det(rain-3, The-1)
amod(rain-3, strongest-2)
nsubj(shut-8, rain-3)
nsubj(snapped-16, rain-3)
nsubj(closed-20, rain-3)
nsubj(forced-23, rain-3)
advmod(recorded-5, ever-4)
partmod(rain-3, recorded-5)
prep_in(recorded-5, India-7)
ccomp(said-40, shut-8)
prt(shut-8, down-9)
det(hub-12, the-10)
amod(hub-12, financial-11)
dobj(shut-8, hub-12)
prep_of(hub-12, Mumbai-14)
conj_and(shut-8, snapped-16)
ccomp(said-40, snapped-16)
nn(lines-18, communication-17)
dobj(snapped-16, lines-18)
conj_and(shut-8, closed-20)
ccomp(said-40, closed-20)
dobj(closed-20, airports-21)
conj_and(shut-8, forced-23)
ccomp(said-40, forced-23)
dobj(forced-23, thousands-24)
prep_of(thousands-24, people-26)
aux(sleep-28, to-27)
xcomp(forced-23, sleep-28)
poss(offices-31, their-30)
prep_in(sleep-28, offices-31)
xcomp(forced-23, walk-33)
conj_or(sleep-28, walk-33)
dobj(walk-33, home-34)
det(night-37, the-36)
prep_during(walk-33, night-37)
nsubj(said-40, officials-39)
root(ROOT-0, said-40)
tmod(said-40, today-41)

This output was generated with the command:

java -mx200m edu.stanford.nlp.parser.lexparser.LexicalizedParser -retainTMPSubcategories -outputFormat "wordsAndTags,penn,typedDependencies" englishPCFG.ser.gz mumbai.txt