Biomedical & Clinical Model Performance
Table of contents
Here we report the performance of Stanza’s biomedical and clinical models, including the syntactic analysis pipelines and the NER models. For more detailed evaluation and analysis, please see our biomedical models description paper.
In the table below you can find the performance of Stanza’s biomedical syntactic analysis pipelines. All models are evaluated on the test split of the corresponding datasets. Note that all scores reported are from an end-to-end evaluation on the official test sets (from raw text to the full CoNLL-U file with syntactic annotations), and are generated with the CoNLL 2018 UD shared task official evaluation script.
Note that while the results on the CRAFT and GENIA treebanks are based on human-annotated oracle data, the results on the MIMIC treebank is based on automatically generated silver data, and therefore should not be treated as a rigorous oracle-based evaluation.
In the table below you can find the performance of Stanza’s biomedical and clinical NER models, and their comparisons to the BioBERT models and scispaCy models. All numbers reported are micro-averaged F1 scores. We used canonical train/dev/test splits for all datasets, whenever such splits exist.
|Category||Dataset||Domain & Types||Stanza||BioBERT||scispaCy|
|BioNLP13CG||16 types in Cancer Genetics||84.34||–||77.60|
|JNLPBA||Protein, DNA, RNA, Cell line, Cell type||76.09||77.49||73.21|
|Clinical||i2b2||Problem, Test, Treatment||88.13||86.73||–|
|Radiology||5 types in Radiology||84.80||–||–|