||Annotator class name
This annotator removes XML tags from an input document.
Stanford CoreNLP also has the ability to remove most XML from a
document before processing it. This functionality is provided by a
finite automaton. It works fine for typical XML, but complex
constructions and CDATA sections will not be correctly handled.
If you want full and correct handling of XML, then you should run XML
documents through an XML parser (such as the one included standard in
Java) before passing appropriate text nodes to Stanford CoreNLP.
The cleanxml annotator supports many complex processing options:
You can choose to only delete
some XML tags, to treat certain XML tags as sentence ending, as
marking the speaker in a dialog, etc. The cleanxml annotator can be
placed after tokenize in processing order.
For example, if run with the annotators
annotators = tokenize, cleanxml, ssplit, pos, lemma, ner, parse, dcoref
and given the text
<xml>Stanford University is located in California. It is a great university.</xml>
Stanford CoreNLP generates output that is basically the same as for the default
input.txt example. The only difference between this and the original output is a change in CharacterOffsets.
Some of the supported options are documented below. At present, there
are others that can only be found in the source code….
|Discard xml tag tokens that match this regular expression. For example, the default
.* will discard all xml tags.
|Treat tags that match this regular expression as the end of a sentence, or none if an empty string. For example, p will treat <p> as the end of a sentence.
||If this is true, allow errors such as unclosed tags. Otherwise, such xml will cause an exception.
|A regular expression that specifies which tags to treat as the reference date of a document.
|A regular expression that specifies which tags to treat as the document id of a document.
|A regular expression that specifies which tags to treat as the document type of a document.