Project

The use of social media for analyzing and predicting the impact of events on the popularity of places, products and people

Code

178EA0413

Duration

01 January 2013 → 26 June 2016

Funding

Regional and community funding: IWT/VLAIO

Promotor

Bart Dhoedt

Fellow

Steven Van Canneyt

Research disciplines

Engineering and technology
- Computer hardware
- Computer theory
- Scientific computing
- Other computer engineering, information technology and mathematical engineering

Keywords

social media big data data science generic framework

Project description

Data science is a field that has gained a lot of interest in the last few years, and has heavily influenced research and business practices. Many companies and organizations nowadays use data science to make better business decisions. Additionally, data science leads to new opportunities in the scientific community, not only to verify or disprove existing models, but also to consider problems from a totally different perspective and to model them at a much larger scale. For instance, detecting anomalies in monitoring data of equipment and software at an early stage may prevent failure of machines and software, significantly reducing costs. The main reason underlying the rise of data science is that almost every sector of the modern economy now has access to more data than was imaginable even a decade ago. IBM estimates that 90% of the global data today has been created in the past two years. Such very large sets of data are referred to as ‘ig data’and are often described using 4Vs: the extreme Volume of data, the wide Variety of types of data, the Velocity at which the data must be processed, and the Veracity of big data. In this dissertation, social media such as Twitter or Facebook play an important role. Social media are particularly promising for the field of data science, due to their large data volume, broad user base, and real-time nature. In the first place, social media can be used to discover information before it is picked up and stored in structured databases. Examples include the use of social media to detect events, even before they are reported in traditional media. Secondly, as social media form an important tool to attract new customers, it becomes increasingly important for companies and organizations to monitor, analyze and optimize the interactions on social media in relation to their brand, products, and ideas. A number of challenges arise when applying data science to social media data. In this dissertation, we address three major challenges. The first challenge we address is that social media content is often very short, noisy and diverse, and therefore very difficult to interpret automatically. For instance, more than 50% of the messages on Twitter do not contain useful information and are mostly random thoughts, self promotion or presence maintenance such as ‘m back’or ‘ust watched TV’ In other words, methods should be constructed that efficiently extract the useful content from the very large and diverse social media data. As a first step to tackle this challenge, we propose an approach that discovers and characterizes places of interest using social media. In particular, we investigate how geographically annotated textual information obtained from Flickr photos and Twitter posts can be used to discover new places of a given type such as ‘otel’or ‘chool’to extend semantic databases of places. For several place types, our pro posed methodology finds places that are not yet contained in the databases used by Foursquare, Google, LinkedGeoData and Geonames. We have extended this work by introducing a method for discovering the semantic type of events which are extracted from social media. We have in particular focused on how the semantic type such as ‘onference’or ‘port event’is influenced by the spatio-temporal grounding of the event, the profile of its attendees, and the semantic type of the venue and other entities which are associated with the event. Experimental results show that our methodology can be used to discover events of a given semantic type which are not mentioned in the Upcoming datasets, by analyzing social media data. As our last contribution on structured information extraction, we consider the extraction of newsworthy topics from social media. The proposed method allows automatically mining social media streams to provide journalists with a set of headlines and complementary information that summarizes the current newsworthy topics. Independent evaluation shows the effectiveness of the proposed methodology. Secondly, working with a large amount of real-time data requires distinctive new techniques and technologies. Frameworks should be developed to handle and analyze a large volume of data in real-time. In this dissertation, we propose a generic framework which can be used to monitor and analyze the consumption patterns of users on news websites. The framework monitors the popularity and features of online news articles in real-time, and can be easily scaled to handle millions of visits and thousands of articles. Our framework has been thoroughly evaluated on two quite different news websites: a young online news company that focuses on accessing readers through social media (newsmonkey), and the online platform from an established national broadcaster, with a more traditional take on news consumption (deredactie.be). We show that our generic data-driven framework and analysis approach are well suited for both use cases, and lead to new insights into online news sharing and consumption behavior. The last challenge we handle in this dissertation is to predict the popularity of media content (e.g., news articles) in social media. This is very challenging due to the high skewness in the popularity distribution and due to the very large and complex set of factors which influence the popularity. Therefore, advanced prediction methodologies should be constructed that can model the complex dependencies between the features of content and their final popularity. To address this challenge, we propose in this dissertation a novel methodology to model and predict the popularity of online news. We first conduct a thorough analysis of the view patterns of online news, and their underlying distributions. This knowledge is then used to better predict the popularity of articles, compared to various existing methods. By means of a new real-world dataset, we show that the combination of features related to content, meta-data, and the temporal behavior leads to significantly improved predictions, compared to existing approaches which only consider features based on the historical popularity of the considered articles. As most industries will become more data-driven, the applicability of our contributions will grow and the insights gained in this dissertation can form a sound basis for further research. For example, the online news monitoring and prediction framework has been deployed at newsmonkey and deredactie.be. The framework will also be made available for other news websites, to monitor and analyze their data and to optimize their strategy. In future research, the knowledge presented in this dissertation on consumption behavior of online news and its predicted popularity can be used to construct methodologies which actively suggest how the publishing strategy can be optimized.