Table of contents
Sentiment is added to the stanza pipeline by using a CNN classifier.
|Name||Annotator class name||Requirement||Generated Annotation||Description|
|sentiment||SentimentProcessor||tokenize||Adds the |
|‘model_path’||string||depends on the language||Where to load the model.|
|‘pretrain_path’||string||depends on the language||Which set of pretrained word vectors to use. Can be changed for existing models, but this is not recommended, as the models are trained to work specifically with one set of word vectors.|
|‘batch_size’||int||None||If None, run everything at once. If set to an integer, break processing into chunks of this size|
SentimentProcessor adds a label for sentiment to each
Sentence. The existing models each support negative, neutral, and positive, represented by 0, 1, 2 respectively. Custom models could support any set of labels as long as you have training data.
nlp = stanza.Pipeline(lang='en', processors='tokenize,sentiment') doc = nlp('I hate that they banned Mox Opal') for i, sentence in enumerate(doc.sentences): print(i, sentence.sentiment)
The output produced (aside from logging) will be:
This represents a negative sentiment.
There are currently three models available: English, Chinese, and German.
English is trained on the following data sources:
Stanford Sentiment Treebank, including extra training sentences
MELD, text only
The score on this model is not directly comparable to existing SST models, as this is using a 3 class projection of the 5 class data and includes several additional data sources (hence the
sstplus designation). However, training this model on 2 class data using higher dimension word vectors achieves the 87 score reported in the original CNN classifier paper. On a three class projection of the SST test data, the model trained on multiple datasets gets 70.0%.
The Chinese model is trained using the polarity signal from the following
We were unable to find standard scores or even standard splits for this dataset.
Using the gsdsimp word vectors package, training with extra trained word vectors added to the existing word vectors, we built a model which gets 0.694 test accuracy on a random split of the training data. The split can be recreated using process_ren_chinese.py.
The German model is build from sb10k, a dataset of German tweets.
The original sb10k paper cited an F1 score of 65.09. Without using the distant learning step, but using the SB word vectors, we acheived 63 F1. The included model uses the standard German word2vec vectors and only gets 60.5 F1. We considered this acceptable instead of redistributing the much larger tweet word vectors.