Recently, Martin Illecker (one of my master students) finished his studies in information systems. In his master thesis he developed a sentiment analysis approach for tweets which is based on a combination of sentiment lexica and machine learning algorithms. In particular, we make use of Part-of-Speech tagging which delivers important information for the creation of vectors representing the tweets. The vectors are subsequently classified by an SVM for the final sentiment detection. The figure below depicts the workflow behind the approach, please consult the masterthesis for details.
By using this approach, we are able to compete with the winners of the SemEval
2013 competition. The best obtained F-value in the year 2013 was 0.69 by the team NRC-Canada, followed by 0.6527 by the team GU-MLT-LT. Our approach is able to reach an F-value of 0.6685 (virtually claiming the second place in the competition).
Besides the quality of the sentiment detection, Martin also focused on the quantity or throughput of the detection algorithm. He therefore used Apache Storm (therefore, he approach is called Senti-Storm) to parallelize all the tasks previously described. We deployed the system on a Amazon c3.8xlarge EC2 instance and were able to detect the sentiment of 27.876 tweets per second using ten nodes. This allows us to perform sentiment detection of more than 2 billion tweets each and every day. This implies that our approach is able to perform real-time detection of the sentiment of tweets while still reaching high values in terms of prediction quality.