Data Conversion
Table of contents
This page describes how to seamlessly convert between Stanza’s Document
, the CoNLL-U format, and native Python objects. We show four examples that represent exactly the same document.
Document to Python Object
A Document
instance will be returned after text is annotated by the Pipeline
.
Here’s how we can convert this Document
object to a native Python object:
import stanza
nlp = stanza.Pipeline('en', processors='tokenize,pos')
doc = nlp('Test sentence.') # doc is class Document
dicts = doc.to_dict() # dicts is List[List[Dict]], representing each token / word in each sentence in the document
Python Object to Document
A Document
can also be instanciated with a native Python object, and then passed to the Pipeline
for further annotations.
The code below shows an example of converting python native object to [Document
]:
from stanza.models.common.doc import Document
dicts = [[{'id': 1, 'text': 'Test', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'misc': 'start_char=0|end_char=4'}, {'id': 2, 'text': 'sentence', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'misc': 'start_char=5|end_char=13'}, {'id': 3, 'text': '.', 'upos': 'PUNCT', 'xpos': '.', 'misc': 'start_char=13|end_char=14'}]] # dicts is List[List[Dict]], representing each token / word in each sentence in the document
doc = Document(dicts) # doc is class Document
CoNLL to Python Object
CoNLL-U is a widely-used format for universal dependencies. Here is an example of converting from a CoNLL-U format to a native Python object with the help of Stanza:
from stanza.utils.conll import CoNLL
conll = [[['1', 'Test', '_', 'NOUN', 'NN', 'Number=Sing', '0', '_', '_', 'start_char=0|end_char=4'], ['2', 'sentence', '_', 'NOUN', 'NN', 'Number=Sing', '1', '_', '_', 'start_char=5|end_char=13'], ['3', '.', '_', 'PUNCT', '.', '_', '2', '_', '_', 'start_char=13|end_char=14']]] # conll is List[List[List]], representing each token / word in each sentence in the document
dicts = CoNLL.convert_conll(conll) # dicts is List[List[Dict]], representing each token / word in each sentence in the document
Python Object to CoNLL
It might sometimes be desirable to output or process data in the CoNLL-U format after text has been annotated by Stanza. Here is an example of converting a native Python object to the CoNLL-U format:
from stanza.utils.conll import CoNLL
dicts = [[{'id': '1', 'text': 'Test', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'misc': 'start_char=0|end_char=4'}, {'id': '2', 'text': 'sentence', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Number=Sing', 'misc': 'start_char=5|end_char=13'}, {'id': '3', 'text': '.', 'upos': 'PUNCT', 'xpos': '.', 'misc': 'start_char=13|end_char=14'}]] # dicts is List[List[Dict]], representing each token / word in each sentence in the document
conll = CoNLL.convert_dict(dicts) # conll is List[List[List]], representing each token / word in each sentence in the document