Data Conversion
Table of contents
This page describes how to seamlessly convert between Stanza’s Document
, the CoNLL-U format, and native Python objects. We show four examples that represent exactly the same document.
Document to Python Object
A Document
instance will be returned after text is annotated by the Pipeline
.
Here’s how we can convert this Document
object to a native Python object:
import stanza
nlp = stanza.Pipeline('en', processors='tokenize,pos')
doc = nlp('Test sentence.') # doc is class Document
dicts = doc.to_dict() # dicts is List[List[Dict]], representing each token / word in each sentence in the document
Python Object to Document
A Document
can also be instanciated with a native Python object, and then passed to the Pipeline
for further annotations.
The code below shows an example of converting python native object to [Document
]:
from stanza.models.common.doc import Document
dicts = [[{'id': 1, 'text': 'Test', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'misc': 'start_char=0|end_char=4'}, {'id': 2, 'text': 'sentence', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'misc': 'start_char=5|end_char=13'}, {'id': 3, 'text': '.', 'upos': 'PUNCT', 'xpos': '.', 'misc': 'start_char=13|end_char=14'}]] # dicts is List[List[Dict]], representing each token / word in each sentence in the document
doc = Document(dicts) # doc is class Document
CoNLL to Python Object
CoNLL-U is a widely-used format for universal dependencies. Here is an example of converting from a CoNLL-U format to a native Python object with the help of Stanza:
from stanza.utils.conll import CoNLL
conll = [[['1', 'Test', '_', 'NOUN', 'NN', 'Number=Sing', '0', '_', '_', 'start_char=0|end_char=4'], ['2', 'sentence', '_', 'NOUN', 'NN', 'Number=Sing', '1', '_', '_', 'start_char=5|end_char=13'], ['3', '.', '_', 'PUNCT', '.', '_', '2', '_', '_', 'start_char=13|end_char=14']]] # conll is List[List[List]], representing each token / word in each sentence in the document
dicts = CoNLL.convert_conll(conll) # dicts is List[List[Dict]], representing each token / word in each sentence in the document
Python Object to CoNLL
It might sometimes be desirable to output or process data in the CoNLL-U format after text has been annotated by Stanza. Here is an example of converting a native Python object to the CoNLL-U format:
from stanza.utils.conll import CoNLL
dicts = [[{'id': 1, 'text': 'Test', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'misc': 'start_char=0|end_char=4'}, {'id': 2, 'text': 'sentence', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'misc': 'start_char=5|end_char=13'}, {'id': 3, 'text': '.', 'upos': 'PUNCT', 'xpos': '.', 'misc': 'start_char=13|end_char=14'}]] # dicts is List[List[Dict]], representing each token / word in each sentence in the document
conll = CoNLL.convert_dict(dicts) # conll is List[List[List]], representing each token / word in each sentence in the document
convert_dict
is now marked as deprecated, as internally we use the Document object everywhere. If you have a use case where you need it, please let us know!
CoNLL to Document
New in v1.2.1
There is a mechanism for converting CoNLL files directly to a Stanza Document:
from stanza.utils.conll import CoNLL
doc = CoNLL.conll2doc("extern_data/ud2/ud-treebanks-v2.7/UD_Italian-ISDT/it_isdt-ud-train.conllu")
This can be used with a pipeline on pretokenized text to reprocess parts of the document. For example, this will reprocess the tags in the ISDT dataset. Note that nlp(doc)
alters the doc
object in place in this example.
doc = CoNLL.conll2doc("extern_data/ud2/ud-treebanks-v2.7/UD_Italian-ISDT/it_isdt-ud-train.conllu")
nlp = stanza.Pipeline(lang='it', processors='tokenize,pos', tokenize_pretokenized=True)
doc = nlp(doc)
Document to CoNLL
New in v1.2.1
There is a corresponding mechanism for writing back the document:
CoNLL.write_doc2conll(doc2, "output.conllu")