Sentiment Analysis
Table of contents
Description
Sentiment is added to the stanza pipeline by using a CNN classifier.
Name | Annotator class name | Requirement | Generated Annotation | Description |
---|---|---|---|---|
sentiment | SentimentProcessor | tokenize | sentiment | Adds the sentiment annotation to each Sentence in the Document |
Options
Option name | Type | Default | Description |
---|---|---|---|
‘model_path’ | string | depends on the language | Where to load the model. |
‘pretrain_path’ | string | depends on the language | Which set of pretrained word vectors to use. Can be changed for existing models, but this is not recommended, as the models are trained to work specifically with one set of word vectors. |
‘batch_size’ | int | None | If None, run everything at once. If set to an integer, break processing into chunks of this size |
Example Usage
The SentimentProcessor
adds a label for sentiment to each Sentence
. The existing models each support negative, neutral, and positive, represented by 0, 1, 2 respectively. Custom models could support any set of labels as long as you have training data.
Simple code example
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize,sentiment')
doc = nlp('I hate that they banned Mox Opal')
for i, sentence in enumerate(doc.sentences):
print("%d -> %d" % (i, sentence.sentiment))
The output produced (aside from logging) will be:
0 -> 0
This represents a negative sentiment.
In some cases, such as datasets with one sentence per line or twitter data, you want to guarantee that there is one sentence per document processed. You can do this by turning off the sentence splitting.
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize,sentiment', tokenize_no_ssplit=True)
doc = nlp('Jennifer has pretty antennae. I hope I meet her someday')
for i, sentence in enumerate(doc.sentences):
print("%d -> %d" % (i, sentence.sentiment))
The output produced (aside from logging) will be:
0 -> 2
This represents a positive sentiment.
Available models
There are currently three models available: English, Chinese, and German.
English
English is trained on the following data sources:
Stanford Sentiment Treebank, including extra training sentences
MELD, text only
The score on this model is not directly comparable to existing SST models, as this is using a 3 class projection of the 5 class data and includes several additional data sources (hence the sstplus
designation). However, training this model on 2 class data using higher dimension word vectors achieves the 87 score reported in the original CNN classifier paper. On a three class projection of the SST test data, the model trained on multiple datasets gets 70.0%.
Chinese
The Chinese model is trained using the polarity signal from the following
http://a1-www.is.tokushima-u.ac.jp/member/ren/Ren-CECps1.0/Ren-CECps1.0.html
We were unable to find standard scores or even standard splits for this dataset.
Using the gsdsimp word vectors package, training with extra trained word vectors added to the existing word vectors, we built a model which gets 0.694 test accuracy on a random split of the training data. The split can be recreated using process_ren_chinese.py.
German
The German model is build from sb10k, a dataset of German tweets.
The original sb10k paper cited an F1 score of 65.09. Without using the distant learning step, but using the SB word vectors, we acheived 63 F1. The included model uses the standard German word2vec vectors and only gets 60.5 F1. We considered this acceptable instead of redistributing the much larger tweet word vectors.
We tried training with the longer snippets of text from Usage and Scare, but this seemed to have a noticeable negative effect on the accuracy.