Wikidata Recommenders

At this week‘s OpenSym Conference we will present our evaluation of property recommender systems for Wikidata and generally, collaborative knowledge bases. The (admittedly way too long) title of our paper is: “An Empirical Evaluation of Property Recommender Systems for Wikidata and Collaborative Knowledge Bases“.

Wikidata is a popular example for collaboratively filled and mainained knowledge bases. These mostly rely on a community of committed people who edit and add data. These users are often supported by recommender sytems during the process of entering and editing data. Wikidata also provides a so-called „property suggestor“ which basically recommends further properties to be added. In principle, data is entered in a triple-form: subject-property-object, where property-object pairs (so-called „statements“) are used to describe a subject. Wikidata supports its editors by providing recommendations for further suitable properties for a given subject.

In this work, we evaluate different recommendation algorithms serving this purpose. In principle, we compare an approach by Abedjan and Naumann, the current Wikidata recommender and the Snoopy-approach we developed a couple of years ago.

recall

Recommender Evaluation: Recall@k

We identify three important influence factors regarding the quality of recommendations: (i) the use of classifying properties into the rule creation process as incorporated in the Wikidata approach (WD), (ii) ranking according to confidence values as performed by Abedjan and Naumann (AN) and (iii) incorporating contextual information into the ranking process as proposed by the Snoopy approach (SN_context). We find that the current implementation of the Wikidata Entity Suggester works better than the other presented approaches.  In the course of our analyses, we identify two key aspects which are essential for the quality of recommendations: incorporating classifying properties and making use of contextual information for ranking the property recommendation candidates. Combining the current Wikidata Entity Suggester approach with Snoopy’s ranking strategy, which facilitates contextual information, significantly increases the performance of the current Wikidata recommender approach as can be seen in the followRecall@k evaluation (WD_context).

GitHub-Mark-120px-plusYou can find our open source implementation of the underlying evaluation framework and the evaluated algorithms here: https://github.com/dbisibk/PropertyRecommenderEvaluator/

 

For more details on this, please check out or OpenSym paper:

  • [PDF] E. Zangerle, W. Gassler, M. Pichl, S. Steinhauser, and G. Specht, “An Empirical Evaluation of Property Recommender Systems for Wikidata and Collaborative Knowledge Bases,” in Proceedings of the 12th International Symposium on Open Collaboration, New York, NY, USA, 2016.
    [Bibtex]
    @inProceedings{opensym16,
    author = {Zangerle, Eva and Gassler, Wolfgang and Pichl, Martin and Steinhauser, Stefan and Specht, G\"{u}nther},
    title = {An Empirical Evaluation of Property Recommender Systems for Wikidata and Collaborative Knowledge Bases},
    booktitle = {Proceedings of the 12th International Symposium on Open Collaboration},
    series = {OpenSym '16},
    year = {2016},
    location = {Berlin, Germany},
    publisher = {ACM},
    address = {New York, NY, USA},
    note = {(to appear)}
    }
thumbnail of opensym16

Paper

 

 

Can Twitter predict future music charts?

This week, I‘ll be presenting our paper on „Can Microblogs Predict Music Charts? An Analysis of the Relationship Between #Nowplaying Tweets and Music Charts“ at the 17th International Society for Music Information Retrieval Conference (ISMIR) in New York.

In principle, we looked at to which extent Twitter is a useful source for predicting music charts (particularly, the Billboard Hot 100) and focused on the following research questions (RQ):

– RQ1: To which extent do #nowplaying-tweets resemble the Billboard Hot 100?
– RQ2: How are #nowplaying tweets and the Billboard Hot 100 temporally related?
– RQ3: How can Twitter data be exploited for predicting music charts?

For our analysis, we crawled the Billboard Hot 100 of 2014 and 2015 (886 tracks) and also used #nowplaying tweets having been sent in the same timeframe. #nowplaying tweets are tweets where users tweet about the music they are currently listening to as shown in the following example: “#nowplaying The Beatles – Hey Jude”.

We extracted roughly 111 million #nowplaying tweets from our #nowplaying dataset and computed the overlapping tracks occurring both in the tweets and on the charts to serve as the dataset underlying our analyses.

As for RQ1, we computed the correlation between the number of #nowplaying tweets about a given track with the chart rank of this particular track for any given week and any given track within our dataset. Testing different aggregation methods for the tweets as e.g. taking the mean or median or sum of the #nowplaying tweets per day to represent a weekly Twitter-based score for each track, we only find a mild correlation between #nowplaying tweets and charts data.

For RQ2 and RQ3, we model our data as time series. In a first experiment, we look into whether Twitter data would allow for a prediction of future charts from a timely perspective. I.e., whether trends appear earlier on Twitter than this is reflected within the charts. Therefore, we perform a cross-correlation analysis for the Twitter and charts time series for each track. This analysis showed that for 41% of all tracks, there is a negative lag and hence, these tracks appear and possibly trend earlier on Twitter. However, for 46% of all tracks, we observe a positive lag and for 11% we do not observe any lag.

lags

Cross Correlation Result: Lags (in Week)

For those 42% of all tracks, we propose three different prediction models for the future charts:

– Autoregression based on Billboard time series (BB)
– Extract the lag from cross-correlation analysis, shift base and and compute autoregression based on the shifted difference between Twitter and Billboard (T) .
– Multivariate model based on Twitter and Billboard time series (V).

Evaluating these models by computing the root mean squared errors of predictions (RMSE) shows that the multivariate model works best and hence, Twitter information can be useful for predicting charts. Also, relying solely on Twitter leads to substantially higher results in regards to RMSE as when relying on charts data or a multivariate model containing Twitter and charts data. The RMSE distribution can be seen in the following boxplot (we do not show the T model here as the results

Boxplot of RMSE of Billboard-based model (BB) and the multivariate model (V).

Boxplot of RMSE of Billboard-based model (BB) and the multivariate model (V).

Our evaluations showed that the multivariate model not only significantly decreases the RMSE in comparison to the other models (p < 0.05; Mann-Whitney), it also significantly decreases the variance (p < 0.05; Levene).

For us, the major takeaways from these analyses are:

– From a temporal perspective, there is a positive lag for 48% of all tracks. 11% do not feature a lag.
– 41% of all tracks feature negative lag and would hence allow for a prediction.
– The multivariate model based on Twitter and Billboard charts data significantly reduces the RMSE (p < 0.05; Mann-Whitney) when compared to the Billboard-based model.
– Variance of multivariate model shows significantly lower variance of RMSE than the Billboard-based model (p < 0.05; Levene).
– Mild correlation for track playcounts on Twitter and Billboard charts (p < 0.01; Pearson).

 

  • [PDF] E. Zangerle, M. Pichl, B. Hupfauf, and G. Specht, “Can Microblogs Predict Music Charts? An Analysis of the Relationship between #Nowplaying Tweets and Music Charts ,” in Proceedings of the 17th International Society for Music Information Retrieval Conference 2016 (ISMIR 2016), 2016.
    [Bibtex]
    @inProceedings{ismir16,
    booktitle ={{Proceedings of the 17th International Society for Music Information Retrieval Conference 2016 (ISMIR 2016)}},
    title = {{Can Microblogs Predict Music Charts? An Analysis of the Relationship between #Nowplaying Tweets and Music Charts }},
    publisher = {ISMIR},
    year = {2016},
    author = {Zangerle, Eva and Pichl, Martin and Benedikt Hupfauf and Specht, G\"unther},
    note = {(to appear)}
    }
thumbnail of 039_Paper

Paper

thumbnail of final_poster

Poster

Real-time Sentiment Analysis for Twitter

Recently, Martin Illecker (one of my master students) finished his studies in information systems. In his master thesis he developed a sentiment analysis approach for tweets which is based on a combination of sentiment lexica and machine learning algorithms. In particular, we make use of Part-of-Speech tagging which delivers important information for the creation of vectors representing the tweets. The vectors are subsequently classified by an SVM for the final sentiment detection. The figure below depicts the workflow behind the approach, please consult the masterthesis for details.

Storm Topology

Storm Topology

By using this approach, we are able to compete with the winners of the SemEval
2013 competition. The best obtained F-value in the year 2013 was 0.69 by the team NRC-Canada, followed by 0.6527 by the team GU-MLT-LT. Our approach is able to reach an F-value of 0.6685 (virtually claiming the second place in the competition).

Besides the quality of the sentiment detection, Martin also focused on the quantity or throughput of the detection algorithm. He therefore used Apache Storm (therefore, he approach is called Senti-Storm) to parallelize all the tasks previously described. We deployed the system on a Amazon c3.8xlarge EC2 instance and were able to detect the sentiment of 27.876 tweets per second using ten nodes. This allows us to perform sentiment detection of more than 2 billion tweets each and every day. This implies that our approach is able to perform real-time detection of the sentiment of tweets while still reaching high values in terms of prediction quality.

 

Further information:

WebScience Conference and Twitter Cybercrime

A couple of weeks ago, I was at the WebScience conference in Bloomington, Indiana. I had submitted a position paper about how to proceed with research on cybercrime and cyberwarfare in the context of Twitter to the Cybercrime / Cyberwarfare workshop which then got accepted.

My paper is titled “Cybercrime on Twitter: Shifting the User Back into Focus”. The main point I am trying to make in this publication is that research on cybercrime and fraud on Twitter has mostly been related to (i) how to detect spam and hacked accounts or (ii) how do cyber-criminals behave. However, the hacked user—as a central element in this field —is mostly neglected in today’s research. Therefore, I propose to shift the focus back onto the user and his or her needs. From my perspective, this requires the following points:

  • Study user behavior: we have to understand users in order to prevent future hacks and toprovide better support mechanisms (e.g., understand how a user’s trust into a social network changes after he or she has been hacked or how a user perceives the risk of being hacked)
  • Support the user: this point is about how to inform and support the user in regards to hacks and is tightly knit to the previous point as it requires a deeper understanding of the user. Supporting the user incorporates both the prevention of hacks due to increased awareness regarding hacks and a better understanding about e.g., how to recapture a hacked account.
  • Get the big picture: we have to not only focus on single aspects of security and user analysis, but to regain an understanding about the user experience and perception as a whole in regards to fraud on Twitter.
  • Get interdisciplinary: in order to see and analyze the user and his behavior from different perspectives, we have to get interdisciplinary. Therefore, we have to get together with psychology, social sciences, data mining experts and also human-
    computer-interaction specialists.
  • Work together: this is all about fostering cooperation 🙂 (e.g., sharing source code, experiences and also data between researchers)

If you are interesting in tackling any of the above points with me—just get in touch 🙂

  • [PDF] E. Zangerle and G. Specht, “Cybercrime on Twitter: Shifting the User Back into Focus,” in Proceedings of the WebScience Cybercrime / Cyberwar Workshop, co-located with WebSci14, 2014.
    [Bibtex]
    @InProceedings{cybercrime14,
    author={Zangerle, Eva and Specht, G\"unther},
    title = {{Cybercrime on Twitter: Shifting the User Back into Focus}},
    year = {2014},
    booktitle = {Proceedings of the WebScience Cybercrime / Cyberwar Workshop, co-located with WebSci14},
    note = {published online at http://webscience-cybercrime-workshop.blogs.usj.edu.lb/2014/05/25/accepted-presentations/}
    }

Publication News

Yeeeha, our paper “Sorry, I was hacked”—A Classification of Compromised Twitter Accounts has been accepted at the ACM Symposium on Applied Computing 🙂 In particular, it has been accepted for the Social Network and Media Analysis (SONAMA) track. Our work features an analysis of Twitter whose Twitter accounts have been compromised and it aims at analysing how users deal with this account comprimising. I’m really looking forward to presenting our work at Dongguk University, Gyeongju, Korea. If you are interested in this work, I just uploaded the pdf to the publications section or contact me 🙂

Our two journal articles (finally) have been published too. The first article elaborates on how to support and guide users during collaborative content creation at the Journal on Future Generation Computer Systems (impact factor 1.978). The second article is about how text similarity measures influence the quality of hashtag recommendations for tweets and is published in Springer’s Social Network Analysis and Mining Journal.

Furthermore, my dissertation has been featured in Datenbank-Spektrum, the Journal of  the German Computer Society.

  • [DOI] E. Zangerle, “Dissertationen: Leveraging Recommender Systems for the Creation and Maintenance of Structure within Collaborative Social Media Platforms,” Datenbank-Spektrum, vol. 13, iss. 3, p. 239, 2013.
    [Bibtex]
    @article{dbspektrum,
    title = {Dissertationen: Leveraging Recommender Systems for the Creation and Maintenance of Structure within Collaborative Social Media Platforms},
    author = {Zangerle, Eva},
    journal = {Datenbank-Spektrum},
    volume = {13},
    number = {3},
    year = {2013},
    pages = {239},
    doi = {10.1007/s13222-013-0138-6},
    }
  • [PDF] [DOI] E. Zangerle, W. Gassler, and G. Specht, “On the impact of text similarity functions on hashtag recommendations in microblogging environments,” Social Network Analysis and Mining, vol. 3, iss. 4, pp. 889-898, 2013.
    [Bibtex]
    @article{snam,
    year={2013},
    issn={1869-5450},
    journal={Social Network Analysis and Mining},
    volume={3},
    number={4},
    doi={10.1007/s13278-013-0108-x},
    title={On the impact of text similarity functions on hashtag recommendations in microblogging environments},
    url={http://dx.doi.org/10.1007/s13278-013-0108-x},
    publisher={Springer Vienna},
    author={Zangerle, Eva and Gassler, Wolfgang and Specht, G\"nther},
    pages={889-898},
    language={English},
    note = {(The final publication is available at link.springer.com.)}
    }
  • [PDF] E. Zangerle and G. Specht, ““Sorry, I was hacked"—A Classification of Compromised Twitter Accounts,” in Proceedings of the 29th ACM Symposium on Applied Computing, Gyeongju, Korea, 2014, pp. 587-593.
    [Bibtex]
    @inproceedings{sac14,
    author = {Zangerle, Eva and Specht, G\"unther},
    title = {{“Sorry, I was hacked"---A Classification of Compromised Twitter Accounts}},
    publisher = {ACM},
    year = {2014},
    booktitle = {Proceedings of the 29th ACM Symposium on Applied Computing},
    address = {Gyeongju, Korea},
    pages = {587--593}
    }
  • [DOI] W. Gassler, E. Zangerle, and G. Specht, “Guided Curation of Semistructured Data in Collaboratively-built Knowledge Bases,” Journal on Future Generation Computer Systems, vol. 31, pp. 111-119, 2014.
    [Bibtex]
    @article{fgcs,
    author = {Gassler, Wolfgang and Zangerle, Eva and Specht, G\"unter},
    title = {{Guided Curation of Semistructured Data in Collaboratively-built Knowledge Bases}},
    journal = {Journal on Future Generation Computer Systems},
    publisher = {Elsevier Science Publishers},
    year = {2014},
    note = {impact factor 1.978.},
    url = {http://www.sciencedirect.com/science/article/pii/S0167739X13001076},
    pages = {111-119},
    volume = {31},
    doi = {10.1016/j.future.2013.05.008},
    }

Die zweite überarbeitete Auflage ist da – jetzt mit NoSQL-Teil!

Die zweite überarbeitete Auflage ist da – jetzt mit NoSQL-Teil!

Die zweite überarbeitete Auflage: MySQL 5.6 - Das umfassende Handbuch

Die zweite überarbeitete Auflage: MySQL 5.6 – Das umfassende Handbuch

Nach einer intensiven Überarbeitungsphase ist es endlich soweit! Wir freuen uns, die zweite Auflage unseres Buches zur aktuellsten MySQL-Version 5.6 präsentieren zu können! Natürlich haben wir neben vielen kleinen Anpassungen auch alle neuen Features der MySQL-Version 5.6, wie zum Beispiel die topaktuelle NoSQL-Schnittstelle oder der neue Volltextindex der InnoDB-Engine, behandelt.

Weitere Informationen bei Galileo oder Amazon

 

In der zweiten Ausgabe lesen Sie neu:

  • wie Sie MySQL effizient und performant über die NoSQL-Schnittstelle bedienen
  • wie ein Volltext-Index nun auch (endlich) für InnoDB-Tabellen zur Textsuche eingesetzt werden kann
  • wie Sie mit serverseitigem JavaScript über Node.js MySQL einsetzen können
  • wie Sie über die neuen Sicherheits-Features Ihr System noch besser und komfortabler absichern können
  • wie Sie über das erweiterte Performance-Schema Performance-Bremsen in Ihrem System aufspüren und den Turbo zünden können

News

Just added two new publications on the publications page:

  • Eva Zangerle, Wolfgang Gassler, and Günther Specht. On the impact of text similarity functions on hashtag recommendations in microblogging environments. Social Network Analysis and Mining, 2013. to appear.
  • Wolfgang Gassler, Eva Zangerle, and Günter Specht. Guided Curation of Semistructured Data in Collaboratively-built Knowledge Bases. Journal on Future Generation Computer Systems, 2013. impact factor 1.978, to appear.