Adding a new Sentiment model

End to End Sentiment example

End to End Sentiment example

Starting with the next release of Stanza (1.4.1), there will be a new mechanism for training Sentiment models.

Here is a complete end to end example on how to build a Sentiment model for a previously unknown language. For this example, we will use a Spanish dataset:

http://tass.sepln.org/2020/?page_id=74

There are multiple tasks on that page. Task 2 does not have gold annotations (as of July 2022), making it an easy choice of Task 1.

OS

We will work on Linux for this. It is possible to recreate most of these steps on Windows or another OS, with the exception that the environment variables need to be set differently.

As a reminder, Stanza only supports Python 3.6 or later. If you are using an earlier version of Python, it will not work.

Codebase

This is a previously unknown dataset, so it will require some code changes. Accordingly, we first clone the stanza git repo and check out the dev branch. We will then create a new branch for our code changes.

git clone git@github.com:stanfordnlp/stanza.git
cd stanza
git checkout dev
git checkout -b spanish_sentiment

Environment

There are many environment variables mentioned in the usage page, along with a config.sh script which can set them up. However, ultimately only two are relevant for a Sentiment model, $SENTIMENT_BASE and $SENTIMENT_DATA_DIR.

Both of these have reasonable defaults, but we can still customize them.

$SENTIMENT_BASE determines where the raw, unchanged datasets go.

The purpose of the data preparation scripts will be to put processed forms of this data in $SENTIMENT_DATA_DIR. Once this is done, the execution script will expect to find the data in that directory.

In ~/.bashrc, we can add the following lines. Here are a couple values we use on our cluster to organize shared data:

Note:

Your OS may use a different startup script than ~/.bashrc

Note:

On Windows, you can update these variables in Edit the system environment variables in the Control Panel

export SENTIMENT_BASE=/u/nlp/data/sentiment/stanza
export SENTIMENT_DATA_DIR=/nlp/scr/$USER/data/ner

Since you will be running python programs directly from the git checkout of Stanza, you will need to make sure . is in your PYTHONPATH.

Language code

If your dataset involves a language currently not in Stanza, it may need the language code added to Stanza. Feel free to open an issue on our github or send us a PR.

Data download

There are several files on the Tass 2020 site which refer to Task 1 (1.1, 1.2). We download them all to a directory:

$SENTIMENT_BASE/spanish/tass2020

It will not be necessary to unzip them, as the script will process the zip files directly.

For a new sentiment dataset, please arrange to have it downloaded to $SENTIMENT_BASE/<lang>/...

Processing raw data to .json

The Sentiment model uses a .json format for storing text and annotations. It may not be strictly necessary to use .json, but this gives us an easy way to store tokens with spaces (such as can be found in Vietnamese).

Unfortunately, unlike NER datasets or especially the UD ecosystem of datasets, there is no standard format for sentiment datasets. What you will need to do is write code to turn the dataset into 3 lists of labels and text. An example can be seen in this PR:

https://github.com/stanfordnlp/stanza/pull/1104

The code to translate the test set is here:

Code to read and process the TASS2020 test set

The train & dev sets are here:

Code to read and process the TASS2020 train and dev sets

These code paths both read the .zip files in the dataset and turn them into lists of labels and text. Originally the text is raw strings, untokenized, but we then turn it into lists of words.

Internally, the sentiment tool uses word vectors as a base input layer. It is also possible to use the pretrained charlm or a transformer. Because of the use of word vectors, though, it is expected that the text be tokenized into listss of words.

Code to tokenize the TASS2020 text

All languages for which we have any support already have a tokenizer. An unknown language with no tokenizer will need a separate mechanism to handle this.

Note:

Only load the Pipeline once, then pass it around. Otherwise, loading will be very expensive.

To write the code to the .json format used by the training tool, there is a SentimentDatum tuple which you can use to store a single item, a write_list function for writing a list of SentimentDatum, and a write_dataset function which writes three lists, train, dev, and test. Please refer to process_es_tass2020.py for examples of how to use them.

Once the code to translate the dataset is written, we add some documentation and a function call to stanza/utils/datasets/ner/prepare_sentiment_dataset.py to keep everything organized in one place.

Documentation example

Function to call the conversion

Entry connecting the function in the dataset mapping

The intention is that once the script is prepared, the new dataset has been added to the general preparation script:

python3 stanza.utils.datasets.sentiment.prepare_sentiment_dataset es_tass2020

Labels

Typically we use 0, 1, and 2 to represent Negative, Neutral, and Positive in three class sentiment tasks. This is not necessary, though; the tool can process any number of labels, and it can also process labels which are not numeric!

Note:

Technically the tool can be used for any sentence classification task, not just sentiment.

Word Vectors

The base version of the tool uses word vectors as an input layer to the classifier.

Please refer to the word vectors section of NER models to add a new set of word vectors.

Note:

Stanza already has Spanish word vectors, so we do not need to do anything to add them. In fact, the run_sentiment script will attempt to automatically download them.

Training!

At this point, everything is ready to push the button and start training.

python -m stanza.utils.training.run_sentiment es_tass2020

Charlm and Bert

There is a pretrained character model that ships with Stanza, and there is also support in the Sentiment model for HuggingFace transformers. There is more description of how to use them in the corresponding section of the NER models page.

Note:

Stanza also has a pretrained Spanish charlm, so we do not need to do anything to add that. The run_sentiment script will also attempt to automatically download that model.

Contributing back

If you like, you can open a PR with your code changes and post the models somewhere we can integrate them. It would be very appreciated!

Citations

The original sentiment model is based on “Convolutional Neural Networks for Sentence Classification”, by Yoon Kim.

Improvements to the model are incorporated from “Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling” by Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, Bo Xu

The TASS 2020 dataset is introduced in “Overview of TASS 2020: Introducing Emotion Detection” by M. Vega, M. C. Díaz-Galiano, et al

Adding a new Sentiment model

Table of contents

End to End Sentiment example

OS

Codebase

Environment

Language code

Data download

Processing raw data to .json

Labels

Word Vectors

Training!

Charlm and Bert

Contributing back

Citations