Biomedical & Clinical Model Performance

Table of contents

Here we report the performance of Stanza’s biomedical and clinical models, including the syntactic analysis pipelines and the NER models. For more detailed evaluation and analysis, please see our biomedical models description paper.

Syntactic Analysis Performance

In the table below you can find the performance of Stanza’s biomedical syntactic analysis pipelines. All models are evaluated on the test split of the corresponding datasets. Note that all scores reported are from an end-to-end evaluation on the official test sets (from raw text to the full CoNLL-U file with syntactic annotations), and are generated with the CoNLL 2018 UD shared task official evaluation script.

Note that while the results on the CRAFT and GENIA treebanks are based on human-annotated oracle data, the results on the MIMIC treebank is based on automatically generated silver data, and therefore should not be treated as a rigorous oracle-based evaluation.

CategoryTreebankPackage NameTokensSentencesUPOSXPOSLemmasUASLAS

NER Performance

In the table below you can find the performance of Stanza’s biomedical and clinical NER models, and their comparisons to the BioBERT models and scispaCy models. All numbers reported are micro-averaged F1 scores. We used canonical train/dev/test splits for all datasets, whenever such splits exist.

Results for BioBERT are from their v1.1 models as reported in the BioBERT paper; results for scispaCy are from the medium sized models as reported in the scispaCy paper.

CategoryDatasetDomain & TypesStanzaBioBERTscispaCy
 BC5CDRChemical, Disease88.0883.92
 BioNLP13CG16 types in Cancer Genetics84.3477.60
 JNLPBAProtein, DNA, RNA, Cell line, Cell type76.0977.4973.21
Clinicali2b2Problem, Test, Treatment88.1386.73
 Radiology5 types in Radiology84.80