Social Media for Large Studies of Behavior

Derek Ruths Jurgen Pfeffer


Large-scale studies of human behavior in social media need to be held to higher methodological standards. Works that use a combination of machine learning, natural language processing, network analysis, and statistics for the measurement of population structure and human behavior now work at unprecedented scale. However, problems exist that should not be ignored. Different social media platforms have substantial and variant population biases--and researchers are unaware of platform changes. Also, it can be hard to extract psychology from platform-driven behavior, given that homophily (“birds of a feather flock together”), transitivity (“the friend of a friend is a friend”), and propinquity (“those close by form a tie”) are all known by designers of social media platforms and, to increase platform use and adoption, have been incorporated in their link suggestion algorithms. In a similar vein, what users see as the use of a platform (e.g. Twitter is for political discourse) can change behavior. As the authors state, "online social platforms are building tools to serve a specific, practical purpose— not necessarily to represent social behavior or provide good data for research."

Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose

Fred Morstatter, Ju ̈rgen Pfeffer, Huan Liu, Kathleen M. Carley


ABSTRACT: Twitter is a social media giant famous for the exchange of short, 140-character messages called “tweets”. In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a “Streaming API” which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, compa- nies, and governmental institutions that want to extract knowledge in accordance with a diverse array of ques- tions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concern- ing what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by compar- ing data collected using Twitter’s sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet.  

SUMMARY: They find that the Streaming API data estimates the top hashtags for a large n well, but is often misleading when n is small. Similarly, topical analysis is more accurate  with more data from the Streaming API. By analyzing retweet User × User networks the authors were able to identify, on average, 50–60% of the top 100 key-players when creating the networks based on one day of Streaming API data. Aggregating some days of data can increase the accuracy substantially.  Finally, when using the geographic boundary box on the Twitter Streaming API, the authors find, surprisingly, that they get the complete set of the geotagged tweets despite sampling. Although the number of geotagged tweets is still very small in general (1%), researchers using this information at least have representative data.

Quantifying the Invisible Audience in Social Networks

Michael S. Bernstein, Eytan Baksh, Moira Burke, Brian Karrer


We find that social media users consistently underestimate their audience size for their posts, guessing that their audience is just 27% of its true size. Despite the variation, users typically reach 61% of their friends each month. To- gether, our results begin to reveal the invisible undercurrents of audience attention and behavior in online social networks. 

The Critical Periphery in the Growth of Social Protests

Barberá, P., Wang, N., Bonneau, R., Jost, J.T., Nagler, J., Tucker, J., and González-Bailón, S.


ABSTRACT: Social media have provided instrumental means of communication in many recent political protests. The efficiency of online networks in disseminating timely information has been praised by many commentators; at the same time, users are often derided as “slacktivists” because of the shallow commitment involved in clicking a forwarding button. Here we consider the role of these peripheral online participants, the immense majority of users who surround the small epicenter of protests, representing layers of diminishing online activity around the committed minority. We analyze three datasets tracking protest communication in different languages and political contexts through the social media platform Twitter and employ a network decomposition technique to examine their hierarchical structure. We provide consistent evidence that peripheral participants are critical in increasing the reach of protest messages and generating online content at levels that are comparable to core participants. Although committed minorities may constitute the heart of protest movements, our results suggest that their success in maximizing the number of online citizens exposed to protest messages depends, at least in part, on activating the critical periphery. Peripheral users are less active on a per capita basis, but their power lies in their numbers: their aggregate contribution to the spread of protest messages is comparable in magnitude to that of core participants. An analysis of two other datasets unrelated to mass protests strengthens our interpretation that core-periphery dynamics are characteristically important in the context of collective action events. Theoretical models of diffusion in social networks would benefit from increased attention to the role of peripheral nodes in the propagation of information and behavior.