Adding a new Sentiment model
Table of contents
End to End Sentiment example
Starting with the next release of Stanza (1.4.1), there will be a new mechanism for training Sentiment models.
Here is a complete end to end example on how to build a Sentiment model for a previously unknown language. For this example, we will use a Spanish dataset:
http://tass.sepln.org/2020/?page_id=74
There are multiple tasks on that page. Task 2 does not have gold annotations (as of July 2022), making it an easy choice of Task 1.
OS
We will work on Linux for this. It is possible to recreate most of these steps on Windows or another OS, with the exception that the environment variables need to be set differently.
As a reminder, Stanza only supports Python 3.6 or later. If you are using an earlier version of Python, it will not work.
Codebase
This is a previously unknown dataset, so it will require some code changes. Accordingly, we first clone the stanza git repo and check out the dev
branch. We will then create a new branch for our code changes.
git clone git@github.com:stanfordnlp/stanza.git
cd stanza
git checkout dev
git checkout -b spanish_sentiment
Environment
There are many environment variables mentioned in the usage page, along with a config.sh
script which can set them up. However, ultimately only two are relevant for a Sentiment model, $SENTIMENT_BASE
and $SENTIMENT_DATA_DIR
.
Both of these have reasonable defaults, but we can still customize them.
$SENTIMENT_BASE
determines where the raw, unchanged datasets go.
The purpose of the data preparation scripts will be to put processed forms of this data in $SENTIMENT_DATA_DIR
. Once this is done, the execution script will expect to find the data in that directory.
In ~/.bashrc
, we can add the following lines. Here are a couple values we use on our cluster to organize shared data:
Your OS may use a different startup script than ~/.bashrc
On Windows, you can update these variables in Edit the system environment variables
in the Control Panel
export SENTIMENT_BASE=/u/nlp/data/sentiment/stanza
export SENTIMENT_DATA_DIR=/nlp/scr/$USER/data/ner
Since you will be running python programs directly from the git checkout of Stanza, you will need to make sure .
is in your PYTHONPATH
.
Language code
If your dataset involves a language currently not in Stanza, it may need the language code added to Stanza. Feel free to open an issue on our github or send us a PR.
Data download
There are several files on the Tass 2020 site which refer to Task 1 (1.1, 1.2). We download them all to a directory:
$SENTIMENT_BASE/spanish/tass2020
It will not be necessary to unzip them, as the script will process the zip files directly.
For a new sentiment dataset, please arrange to have it downloaded to $SENTIMENT_BASE/<lang>/...
Processing raw data to .json
The Sentiment model uses a .json format for storing text and annotations. It may not be strictly necessary to use .json, but this gives us an easy way to store tokens with spaces (such as can be found in Vietnamese).
Unfortunately, unlike NER datasets or especially the UD ecosystem of datasets, there is no standard format for sentiment datasets. What you will need to do is write code to turn the dataset into 3 lists of labels and text. An example can be seen in this PR:
https://github.com/stanfordnlp/stanza/pull/1104
The code to translate the test set is here:
Code to read and process the TASS2020 test set
The train & dev sets are here:
Code to read and process the TASS2020 train and dev sets
These code paths both read the .zip
files in the dataset and turn them into lists of labels and text. Originally the text is raw strings, untokenized, but we then turn it into lists of words.
Internally, the sentiment tool uses word vectors as a base input layer. It is also possible to use the pretrained charlm or a transformer. Because of the use of word vectors, though, it is expected that the text be tokenized into listss of words.
Code to tokenize the TASS2020 text
All languages for which we have any support already have a tokenizer. An unknown language with no tokenizer will need a separate mechanism to handle this.
Only load the Pipeline once, then pass it around. Otherwise, loading will be very expensive.
To write the code to the .json
format used by the training tool, there is a SentimentDatum
tuple which you can use to store a single item, a write_list
function for writing a list of SentimentDatum
, and a write_dataset
function which writes three lists, train, dev, and test. Please refer to process_es_tass2020.py
for examples of how to use them.
Once the code to translate the dataset is written, we add some documentation and a function call to stanza/utils/datasets/ner/prepare_sentiment_dataset.py
to keep everything organized in one place.
Function to call the conversion
Entry connecting the function in the dataset mapping
The intention is that once the script is prepared, the new dataset has been added to the general preparation script:
python3 stanza.utils.datasets.sentiment.prepare_sentiment_dataset es_tass2020
Labels
Typically we use 0
, 1
, and 2
to represent Negative
, Neutral
, and Positive
in three class sentiment tasks. This is not necessary, though; the tool can process any number of labels, and it can also process labels which are not numeric!
Technically the tool can be used for any sentence classification task, not just sentiment.
Word Vectors
The base version of the tool uses word vectors as an input layer to the classifier.
Please refer to the word vectors section of NER models to add a new set of word vectors.
Stanza already has Spanish word vectors, so we do not need to do anything to add them. In fact, the run_sentiment
script will attempt to automatically download them.
Training!
At this point, everything is ready to push the button and start training.
python -m stanza.utils.training.run_sentiment es_tass2020
Charlm and Bert
There is a pretrained character model that ships with Stanza, and there is also support in the Sentiment model for HuggingFace transformers. There is more description of how to use them in the corresponding section of the NER models page.
Stanza also has a pretrained Spanish charlm, so we do not need to do anything to add that. The run_sentiment
script will also attempt to automatically download that model.
Contributing back
If you like, you can open a PR with your code changes and post the models somewhere we can integrate them. It would be very appreciated!
Citations
The original sentiment model is based on “Convolutional Neural Networks for Sentence Classification”, by Yoon Kim.
Improvements to the model are incorporated from “Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling” by Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, Bo Xu
The TASS 2020 dataset is introduced in “Overview of TASS 2020: Introducing Emotion Detection” by M. Vega, M. C. Díaz-Galiano, et al