Link

NER Models

Table of contents


System Performance on NER Corpora

In the table below you can find the performance of Stanza’s pretrained NER models. All numbers reported are micro-averaged F1 scores. We used canonical train/dev/test splits for all datasets except for the WikiNER datasets, for which we used random splits.

The Ukrainian model and its score was provided by gawy. The Armenian model was provided by ShakeHakobyan. The Polish model was provided by Karol Saputa

LanguagelcodeCorpusTypesF1Def?SinceDoc
AfrikaansafNCHLT480.08 
ArmenianhyARMTDP1887.961.5.0
ArabicarAQMAR474.3 
BulgarianbgBSNLP 2019583.211.2.1
ChinesezhOntoNotes1879.2 
DanishdaDDT480.951.4.0
DutchnlCoNLL02489.2 
DutchnlWikiNER494.8 
EnglishenCoNLL03492.1 
EnglishenOntoNotes1888.8 
FinnishfiTurku687.041.2.1
FrenchfrWikiNER492.9 
GermandeCoNLL03481.9 
GermandeGermEval2014485.2 
HungarianhuCombined4-1.2.1
ItalianitFBK387.921.2.3
JapanesejaGSD2281.011.4.0
KazakhkkkazNERD2594.941.4.1
MarathimrL3Cube684.191.4.1
MyanmarmyUCSY795.861.4.0
Norwegian‑BokmaalnbNorne884.791.4.0
Norwegian‑NynorsknnNorne880.161.4.0
PersianfaArman680.071.4.0
PolishplNKJP688.731.4.1
RussianruWikiNER492.9 
SindhisdSiNER1184.741.5.0
SpanishesCoNLL02488.1 
SpanishesAnCora488.6 
SwedishsvSUC3 (shuffled)885.661.4.0
SwedishsvSUC3 (licensed)882.541.4.0
ThaithLST201079.651.4.1
TurkishtrStarlang581.651.4.0
Ukrainianuklanguk486.05 
VietnameseviVLSP482.441.2.1

Notes on NER Corpora

We have provided links to all NER datasets used to train the released models on our available NER models page. Here we provide notes on how to find several of these corpora:

Tag category notes

  • For packages with 4 named entity types, supported types include PER (Person), LOC (Location), ORG (Organization) and MISC (Miscellaneous)
    • The Vietnamese VLSP model spells out the entire tag, though: PERSON, LOCATION, ORGANIZATION, MISCELLANEOUS.
  • For packages with 18 named entity types, supported types include PERSON, NORP (Nationalities/religious/political group), FAC (Facility), ORG (Organization), GPE (Countries/cities/states), LOC (Location), PRODUCT,EVENT, WORK_OF_ART, LAW, LANGUAGE, DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL (details can be found on page 21 of this OntoNotes documentation).
  • The BSNLP19 dataset(s) use EVENT, LOCATION, ORGANIZATION, PERSON, PRODUCT.
  • The Italian FBK dataset uses LOC, ORG, PER
  • The Marathi L3Cube dataset uses ED (Designation), NED (Date), NEL (Location), NEM (Measure), NEO (Organization), NEP (Person), NETI (Time)
  • The Myanmar UCSY dataset uses LOC (Location), NE (Misc), ORG (Organization), PNAME (Person), RACE, TIME, NUM
  • The Japanese GSD dataset uses 22 tags: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, MOVEMENT, NORP, ORDINAL, ORG, PERCENT, PERSON, PET_NAME, PHONE, PRODUCT, QUANTITY, TIME, TITLE_AFFIX, WORK_OF_ART
  • The Kazakh KazNERD dataset uses 25 tags: ADAGE, ART, CARDINAL, CONTACT, DATE, DISEASE, EVENT, FACILITY, GPE, LANGUAGE, LAW, LOCATION, MISCELLANEOUS, MONEY, NON_HUMAN, NORP, ORDINAL, ORGANISATION, PERCENTAGE, PERSON, POSITION, PRODUCT, PROJECT, QUANTITY, TIME
  • The Norwegian Norne dataset uses 8 tags for both NB and NN: DRV, EVT, GPE, LOC, MISC, ORG, PER, PROD
  • The Persian Arman dataset uses 6 tags: event, fac, loc, org, pers, pro
  • The Polish NKJP dataset uses 6 tags: date, geogName, orgName, persName, placeName, time
  • The Sindhi SiNER dataset uses 11 tags: ART, EVENT, FAC, GPE, LANGUAGE, LOC, NORP, ORG, OTHERS, PERSON, TITLE
  • The Thai LST20 dataset uses 10 tags: Person (PER), Title (TTL), Designator (DES), Organization (ORG), Location (LOC), Brand (BRN), Date and time (DTM), Measurement unit (MEA), Number (NUM), and Terminology (TRM)
  • The Turkish Starlang dataset uses 5 tags: LOCATION, MONEY, ORGANIZATION, PERSON, TIME