Arabic Parser FAQ

Stanford Arabic Parser IAQ
What tokenization of Arabic does the parser assume?
What character encoding do you assume?
What characters are encoded?
What POS tag set does the parser use?
What phrasal category set does the parser use?
What’s not in the box?
What data are the parsers trained on?
How well do the parsers work?
Can you give me some examples of how to use the parser for Arabic?
Can you get dependencies output from the Arabic parser?
Where does the Arabic-specific source code live?

Stanford Arabic Parser IAQ

The grey “GALE ROSETTA” notes are only for people involved in that project; they don’t apply to regular users.

Much of the information here is also applicable to the Arabic part of speech tagger, such as discussion of word segmentation and tag sets.

What tokenization of Arabic does the parser assume?

The parser assumes precisely the tokenization of Arabic used in the Penn Arabic Treebank (ATB). You must provide input to the parser that is tokenized in this way or the resulting parses will be terrible. We do now have a software component for segmenting Arabic,but you have to download and run it first; it isn’t included in the parser (see at the end of this answer). The Arabic parser simply uses a whitespace tokenizer. As far as we are aware, ATB tokenization has only an extensional definition; it isn’t written down anywhere. Segmentation is done based on the morphological analyses generated by the Buckwalter analyzer. The segmentation can be characterized thus:

* Almost all clitics are separated off as separate words. This includes clitic pronouns, prepositions, and conjunctions. However, the clitic determiner (definite article) "Al" (ال) is _not_ separated off. Inflectional and derivational morphology is not separated off. 
* [GALE ROSETTA: These separated off clitics are not overtly marked as proclitics/enclitics, although we do have a facility to strip off the '+' and '#' characters that the IBM segmenter uses to mark enclitics and proclitics, respectively. See the example below using the option `-escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper`]
* Parentheses are rendered `-LRB-` and `-RRB-`
* Quotes are rendered as (ASCII) straight single and double quotes (`'` and `"`), not as curly quotes or LaTeX-style  quotes (unlike the Penn English Treebank).
* Dashes are represented with the ASCII hyphen character (U+002D). 
* Non-break space is not used. 

There are some tools available that can do the necessary clitic segmentation:

* [Stanford Word Segmenter](https://nlp.stanford.edu/software/segmenter.html)
* [`http://www1.cs.columbia.edu/~rambow/software-downloads/MADA_Distribution.html`](http://www1.cs.columbia.edu/~rambow/software-downloads/MADA_Distribution.html)
* [`http://www1.ccls.columbia.edu/~mdiab/`](http://www1.ccls.columbia.edu/~mdiab/)
* [`http://www.spencegreen.com/2011/01/19/howto-basic-arabic-preprocessing-for-nlp/`](http://www.spencegreen.com/2011/01/19/howto-basic-arabic-preprocessing-for-nlp/)

What character encoding do you assume?

The present release provides a grammar (arabicFactored.ser.gz) for real Arabic (arabicFactored.ser.gz), for which the default encoding is UTF-8, but for which another encoding (such as legacy Arabic encodings) can be specified on the commmand line with the -encoding charset flag. (Previous releases also provided grammars for the Buckwalter encoding of Arabic in ASCII (either atbP3FactoredBuckwalter.ser.gz or arabicFactoredBuckwalter.ser.gz and atb3FactoredBuckwalter.ser.gz, depending on the parser release). They may return if there is interest.)

What characters are encoded?

The parsers are trained on unvocalized Arabic. One grammar (atbP3FactoredBuckwalter.ser.gz or atb3FactoredBuckwalter.ser.gz) is trained on input represented exactly as it is found in the Penn Arabic Treebank. The other grammars (arabicFactored.ser.gz and arabicFactoredBuckwalter.ser.gz) are trained on a more normalized form of Arabic. This form deletes the tatweel character and other diacritics beyond the short vowel markers which are sometimes not written (Alef with hamza or madda becomes simply Alef, and Alef maksura becomes Yaa), and prefers ASCII characters (Arabic punctuation and number characters are mapped to corresponding ASCII characters). Your accuracy will suffer unless you normalize text in this way, because words are recognized simply based on string identity. [GALE ROSETTA: This is precisely the mapping that the IBM ar_normalize_v5.pl script does for you.]

What POS tag set does the parser use?

The parser uses an “augmented Bies” tag set. The so-called “Bies mapping” maps down the full morphological analyses from the Buckwalter analyzer that appear in the LDC Arabic Treebanks to a subset of the POS tags used in the Penn English Treebank (but some with different meanings). We augment this set to represent which words have the determiner “Al” (ال) cliticized to them. These extra tags start with “DT”, and appear for all parts of speech that can be preceded by “Al”, so we have DTNN, DTCD, etc. This is an early definition of the Bies mapping. For something more up-to-date with recent updates of the Arabic Treebank tag taxonomy, it is also be useful to look at the recent documentation and articles. In particular, the Bies mapping is defined in the file http://catalog.ldc.upenn.edu/docs/LDC2010T13/atb1-v4.1-taglist-conversion- to-PennPOS- forrelease.lisp, included with recent ATB releases. This revised version now includes a few new tags that are not in the English PTB tag set (NOUN_QUANT, ADJ_NUM, and VN). We also include them now.

What phrasal category set does the parser use?

The set used in the Penn Arabic Treebank. See the original Penn Arabic Treebank Guidelines, or, better, the up-to-date Penn Arabic Treebank Guidelines.

What’s not in the box?

The parser download does not include components for normalizing or segmenting Arabic text. You might look at the Stanford Word Segmenter download, or the segmentation tools from CADIM, such as the one available on Mona Diab’s homepage (but note that if they also separate off the “Al” (ال) clitic, then you will need to glue it back on in a postprocessing step). [GALE ROSETTA: IBM has an ATB segmenter and a Perl script that does the appropriate normalization. Their segmenter marks proclitics and enclitics with ‘#’ and ‘+’. These need to be removed for parsing, but we do provide an escaper which does this.]

What data are the parsers trained on?

Two of the 3 grammars (arabicFactored.ser.gz and arabicFactoredBuckwalter.ser.gz) are trained on the training data of the “Mona Diab” (a.k.a. “Johns Hopkins 2005 Workshop”) data splits of parts 1-3 of the Penn Arabic Treebank. The other grammar (atbP3FactoredBuckwalter.ser.gz or atb3FactoredBuckwalter.ser.gz) is trained on a decimation of the ATBp3 treebank data. (That is, heading sentence-by-sentence through the trees, you put 8 sentences in training, 1 in development, and then 1 in test, and then repeat.) This is the data split that has been used at UPenn (see S. Kulick et al., TLT 2006).

How well do the parsers work?

The table below shows the parser’s performance on the development test data sets, as defined above. Here, “factF1” is the Parseval F1 of Labeled Precision and Recall, and “factDA” is the dependency accuracy of the factored parser (based on untyped dependencies imputed from “head rules”). This is for sentences of 40 words or less, and discarding “sentences” (the bylines at the start of articles) that are just an “X” constituent. The performance of the UTF-8 and Buckwalter grammars is basically identical, because only the character encoding is different (and so it is not shown separately). Note that we do get value from the extra data in parts 1 and 2 of the ATB (more value than it first appears, because a decimation data split is always advantageous to a parser), and that dependency accuracy is relatively better than constituency accuracy (we regard this as evidence of inconsistent constituency annotation in the ATB).

                                   factF1   factDA  factEx  pcfgF1  depDA   factTA   num
arabicFactored.ser.gz          77.44    84.05   13.27   69.49   80.07   96.09   1567
atb3FactoredBuckwalter.ser.gz  75.76    83.08   14.41   68.09   77.75   95.91    951

Can you give me some examples of how to use the parser for Arabic?

Sure! These parsing examples are for the 3 test files supplied with the parser. They assume you are sitting in the root directory of the parser distribution. [GALE ROSETTA: The last illustrates the removal of the IBM “+” and “#” marks mentioned earlier.]

    $ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser arabicFactored.ser.gz arabic-onesent-utf8.txt
Loading parser from serialized file arabicFactored.ser.gz ... done [14.3 sec].
Parsing file: arabic-onesent-utf8.txt with 1 sentences.
Parsing [sent. 1 len. 8]: و نشر العدل من خلال قضاء مستقل .
(ROOT
  (S (CC و)
    (VP (VBD نشر)
      (NP (DTNN العدل))
      (PP (IN من)
        (NP (NN خلال)
          (NP (NN قضاء) (JJ مستقل)))))
    (PUNC .)))

Parsed file: arabic-onesent-utf8.txt [1 sentences].
Parsed 8 words in 1 sentences (5.15 wds/sec; 0.64 sents/sec).
$ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser arabicFactoredBuckwalter.ser.gz arabic-onesent-buck.txt
Loading parser from serialized file arabicFactoredBuckwalter.ser.gz ... done [9.4 sec].
Parsing file: arabic-onesent-buck.txt with 1 sentences.
Parsing [sent. 1 len. 8]: w n$r AlEdl mn xlAl qDA' mstql .
(ROOT
  (S (CC w)
    (VP (VBD n$r)
      (NP (DTNN AlEdl))
      (PP (IN mn)
        (NP (NN xlAl)
          (NP (NN qDA') (JJ mstql)))))
    (PUNC .)))

Parsed file: arabic-onesent-buck.txt [1 sentences].
Parsed 8 words in 1 sentences (7.92 wds/sec; 0.99 sents/sec).
$ cat arabic-onesent-ibm-utf8.txt 
و# نشر العدل من خلال قضاء مستقل .
$ java -cp stanford-parser.jar -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper arabicFactored.ser.gz arabic-onesent-ibm-utf8.txt
Loading parser from serialized file arabicFactored.ser.gz ... done [9.3 sec].
Parsing file: arabic-onesent-ibm-utf8.txt with 1 sentences.
Parsing [sent. 1 len. 8]: و نشر العدل من خلال قضاء مستقل .
(ROOT
  (S (CC و)
    (VP (VBD نشر)
      (NP (DTNN العدل))
      (PP (IN من)
        (NP (NN خلال)
          (NP (NN قضاء) (JJ مستقل)))))
    (PUNC .)))

Parsed file: arabic-onesent-ibm-utf8.txt [1 sentences].
Parsed 8 words in 1 sentences (5.87 wds/sec; 0.73 sents/sec).

Can you get dependencies output from the Arabic parser?

You can ask for dependencies output, with the -outputFormat dependencies option. At present, there is no typed dependencies (grammatical relations) analysis available for Arabic, and so asking for typedDependencies will throw an UnsupportedOperationException. (Caution: With UTF-8 Arabic, the dependencies output may appear to be reversed, because dependencies are being displayed right-to-left (depending on the bidi support of your terminal program). But they are correct, really.)

Where does the Arabic-specific source code live?

Much of the Arabic-specific code, including the ArabicHeadFinder and the ArabicTreebankLanguagePack is defined inside the edu.stanford.nlp.trees.international.arabic package. But parser-specific code and the top level entry to Arabic language resources is found in the edu.stanford.nlp.parser.lexparser package. There, you find the classes ArabicTreebankParserParams and ArabicUnknownWordSignatures.

For general questions, see also the Parser FAQ. Please send any other questions or feedback, or extensions and bugfixes to parser- user@lists.stanford.edu or parser- support@lists.stanford.edu.