Link

Available Biomedical & Clinical Models

Table of contents


At a high level, Stanza currently provides packages that support Universal Dependencies (UD)-compatible syntactic analysis and named entity recognition (NER) from both English biomedical literature and clinical note text. Officially offered packages include:

  • 2 UD-compatible biomedical syntactic analysis pipelines, trained with human-annotated treebanks;
  • 1 UD-compatible clinical syntactic analysis pipeline, trained with silver data;
  • 8 accurate biomedical NER models augmented with contextualized representations;
  • 2 clinical NER models, including one specialized in radiology reports.

All our syntactic analysis pipelines are compatible with the Universal Dependencies v2 framework. Here we briefly introduce our models and their usage; for full details on the creation of these models and their evaluation, please check out our biomedical models description paper.

⚒️  For a quick introduction on how to download and use these models, please visit the Biomedical Models Download & Usage page.

Biomedical & Clinical Syntactic Analysis Pipelines

The following table lists the syntactic analysis pipelines currently offered in Stanza. You can find more information about the POS tags, morphological features, and syntactic relations used on the Universal Dependencies website.

Table Notes

  1. Package Name: this column lists the “keyword” that’s needed to download this model and construct the pipeline. See usage for more details.
  2. New Tokenization: this column lists whether the pipeline is compatible with the new UD-friendly tokenization or not. The most notable difference is that in new tokenization, hyphenated words are split into multiple tokens (e.g., up-regulation will be tokenized into three parts up - regulation); whereas in old tokenization hyphenated words are kept as a whole.
  3. Source Corpora: this column describes the genres and domains of text that the training corpora of a particular model have.
CategoryTreebankPackage NameNew TokenizationSource CorporaTreebank Doc
BioCRAFTcraftYesFull-text biomedical articles related to the Mouse Genome Informatics database; general English Web Treebank.CRAFT homepage
 GENIAgeniaNoPubMed abstracts related to “human”, “blood cells”, and “transcription factors”.GENIA homepage
ClinicalMIMICmimicYesAll types of MIMIC-III clinical notes; general English Web Treebank.MIMIC-III homepage

Biomedical & Clinical NER Models

The following table lists all biomedical and clinical NER models supported by Stanza, pretrained on the corresponding NER datasets.

CategoryCorpusPackage NameSupported Entity Types
BioAnatEManatemANATOMY
 BC5CDRbc5cdrCHEMICAL, DISEASE
 BC4CHEMDbc4chemdCHEMICAL
 BioNLP13CGbionlp13cg16 types in Cancer Genetics (* see below for a full list)
 JNLPBAjnlpbaPROTEIN, DNA, RNA, CELL_LINE, CELL_TYPE
 LinnaeuslinnaeusSPECIES
 NCBI-Diseasencbi_diseaseDISEASE
 S800s800SPECIES
Clinicali2b2-2010i2b2PROBLEM, TEST, TREATMENT
 RadiologyradiologyANATOMY, OBSERVATION, ANATOMY_MODIFIER, OBSERVATION_MODIFIER, UNCERTAINTY
  • The 16 entity types in the BioNLP13CG model include: AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE.

Besides these datasets, all NER models in Stanza are augmented with pretrained character-level language models for improved accuracy. For the bio NER models, the language models are pretrained on the publicly available PubMed abstracts; for the clinical NER models, the language models are pretrained on the clinical notes from the publicly available MIMIC-III database.