Language Identification
Table of contents
- Overview
- Get The Language ID Model And Resources For English and French
- Basic Language ID Example
- Use A Custom Model
- Apply Text Cleaning To Tweets
- Restricting Language Predictions To A Subset Of Languages
- Basic Multilingual Pipeline Example
- Configure Multilingual Pipeline
- Set Multilingual Pipeline Cache Size
- Training Your Own Model
- Evaluate A Model
Overview
With Stanza a user can detect the language of text and route texts of different languages to different language specific pipelines. The current distributed model is a character level Bi-LSTM trained off of text snippets from the UD 2.5 dataset. The model works on a variety of text types, including short text snippets (10 chars), sentences, tweets, and paragraphs.
Currently the model detects the following languages:
af ar be bg bxr ca cop cs cu da de el en es et eu fa fi fr fro ga gd gl got grc he hi hr hsb hu hy id it ja kk kmr ko la lt lv lzh mr mt nl nn no olo orv pl pt ro ru sk sl sme sr sv swl ta te tr ug uk ur vi wo zh-hans zh-hant
Get The Language ID Model And Resources For English and French
import stanza
stanza.download(lang="multilingual")
stanza.download(lang="en")
stanza.download(lang="fr")
Basic Language ID Example
With the langid processor, one can identify the language of text. The detected language will be stored in the lang field of the Document.
from stanza.models.common.doc import Document
from stanza.pipeline.core import Pipeline
nlp = Pipeline(lang="multilingual", processors="langid")
docs = ["Hello world.", "Bonjour le monde!"]
docs = [Document([], text=text) for text in docs]
nlp(docs)
print("\n".join(f"{doc.text}\t{doc.lang}" for doc in docs))
Use A Custom Model
nlp = Pipeline(lang="multilingual", processors="langid", langid_model_path="/path/to/model.pt")
Apply Text Cleaning To Tweets
If running on tweets, it is helpful to clean the text before submitting to the model. The text cleaning will remove shortened urls, hashtags, user handles, and emojis. This is not turned on by default.
nlp = Pipeline(lang="multilingual", processors="langid", langid_clean_text=True)
Restricting Language Predictions To A Subset Of Languages
In some scenarios you may know that the possible language is only from a small subset of languages. The language id module can be configured to only predict from this subset. This example demonstrates restricting predictions to English or French.
nlp = Pipeline(lang="multilingual", processors="langid", langid_lang_subset=["en","fr"])
If you are using the MultilingualPipeline
, you can set this by adding langid_lang_subset
to the lang_id_config
:
lang_id_config = {"langid_lang_subset": ['ar', 'hi']}
nlp = MultilingualPipeline(lang_id_config=lang_id_config)
Basic Multilingual Pipeline Example
A MultilingualPipeline
will detect the language of text, and run the appropriate language specific Stanza pipeline on the text. The MultilingualPipeline
will maintain a cache of pipelines for each language. This example demonstrates handling some English and French text. Each example is classified as English or French, and then an appropriate English or French pipeline is run on the text.
from stanza.pipeline.multilingual import MultilingualPipeline
nlp = MultilingualPipeline()
docs = ["Hello world!", "C'est une phrase française.", "This is an English sentence."]
docs = nlp(docs)
for doc in docs:
print("---")
print(f"text: {doc.text}")
print(f"lang: {doc.lang}")
print(f"{doc.sentences[0].dependencies_string()}")
Option name | Type | Default | Description |
---|---|---|---|
model_dir | str | DEFAULT_MODEL_DIR | Where language id and language specific resources are stored. |
lang_id_config | dict | None | Configurations to use for language identification. |
lang_configs | dict | None | Mapping of language name –> pipeline configurations for that language |
ld_batch_size | int | 64 | Batch size to use for language identification |
max_cache_size | int | 10 | Max number of pipelines to cache |
Configure Multilingual Pipeline
You can configure the language identification system and each language specific pipeline in the MultilingualPipeline
. When the MultilingualPipeline
is constructed, it can be fed a dictionary with one entry per language, where each entry is a dictionary with that language’s settings. The langid
processor itself can be configured as well with a separate dictionary with langid
settings.
This example demonstrates activating the text cleaning for the language identification module and setting the cached English pipeline’s NER model.
from stanza.pipeline.multilingual import MultilingualPipeline
lang_id_config = {"langid_clean_text": True}
lang_configs = {"en": {"processors": {"ner": "conll03"}}}
nlp = MultilingualPipeline(lang_id_config=lang_id_config, lang_configs=lang_configs)
docs = ["Hello world.", "Bonjour le monde! #thisisfrench #ilovefrance"]
docs = nlp(docs)
for doc in docs:
print("---")
print(f"text: {doc.text}")
print(f"lang: {doc.lang}")
print(f"{doc.sentences[0].dependencies_string()}")
Set Multilingual Pipeline Cache Size
A MultilingualPipeline
keeps a cache of pipelines for each language. The maximum size of the cache can be configured at pipeline construction.
When a document is processed, its language is detected, and the appropriate pipeline is used. If the cache is at capacity and a new language is detected, the least recently used language pipeline is removed and a new pipeline is added.
nlp = MultilingualPipeline(max_cache_size=2)
Training Your Own Model
You can train your own model with the lang_identifier.py
script.
The data should be stored in a directory, with 3 files: train.jsonl
, dev.jsonl
, and test.jsonl
. The data format is one entry per line, each entry is JSON specifying the text and the language label.
{"text": "Hello world.", "label": "en"}
Training can be launched with the following command (assume the *.jsonl
files are in a directory called data
)
python -m stanza.models.lang_identifier --data-dir data --eval-length 10 --randomize --save-name model.pt --num-epochs 100
This command will run training with the data in train.jsonl
and evaluate with data in dev.jsonl
.
When the --randomize
option is used, snippets of between 5 and 20 characters are sampled from each training example and used as the final training examples for each epoch. So in one epoch the training example “This is an English sentence.” might yield “This is” “an English”, and “sentence.”, and in another it might yield “This is an”, “English sentence.”
The length of the snippets can be set with --randomize-lengths-range
.
To get the best performance on short strings (character length=10), it is crucial to train on relatively short examples.
--eval-length
will determine the length of the examples used for validation
Evaluate A Model
A trained model can be evaluated on any data set with the following command
python -m stanza.models.lang_identifier --data-dir data --load-model model.pt --mode eval --eval-length 50 --save-name model-results.jsonl
This command will look for the file test.jsonl
in data
and produce evaluation numbers for the data in that file.
The overall accuracy will be displayed, and a .jsonl
file with various evaluation info including the accuracy, the confusion matrix, and per-language F1, precision, and recall will be produced.