Digital Social Research - Giuseppe A. Veltri - ebook

Digital Social Research ebook

Giuseppe A. Veltri

82,99 zł


To analyse social and behavioural phenomena in our digitalized world, it is necessary to understand the main research opportunities and challenges specific to online and digital data. This book presents an overview of the many techniques that are part of the fundamental toolbox of the digital social scientist. Placing online methods within the wider tradition of social research, Giuseppe Veltri discusses the principles and frameworks that underlie each technique of digital research. This practical guide covers methodological issues such as dealing with different types of digital data, construct validity, representativeness and big data sampling. It looks at different forms of unobtrusive data collection methods (such as web scraping and social media mining) as well as obtrusive methods (including qualitative methods, web surveys and experiments). Special extended attention is given to computational approaches to statistical analysis, text mining and network analysis. Digital Social Research will be a welcome resource for students and researchers across the social sciences and humanities carrying out digital research (or interested in the future of social research).

Ebooka przeczytasz w aplikacjach Legimi na:

czytnikach certyfikowanych
przez Legimi

Liczba stron: 429



Front Matter



1 Social Research Using Digital Data and Methods

Self-Reported and Behavioural Data

Big Data

The Construct Validity Problem

Representativeness and Access

‘Native’ or Complex Digital Methods

Digital Structured, Unstructured and Semi-structured Data

2 Unobtrusive vs Obtrusive Methods

Web Scraping and News Sources

Social Media Data

Collecting Data from APIs

Understanding Social Media Data

Tools and Instruments


3 Online Obtrusive Data Collection Methods

Online Qualitative Research Methods

Web Surveys



4 Quantitative Data Analysis Reloaded

Quantitative Analysis and Digital Data

Conventional and Computational Approaches

Further Differences

Model-based Recursive Partitioning

Further Readings


5 Networks and Data


Key Concepts

Basic Network Metrics

Network-level Metrics

Types of Networks

Property of Networks

Longitudinal Network Analysis


Further Readings


6 Text Mining

From Content Analysis to Text Mining

Text Mining

Text-mining Pre-processing Basic Concepts

Parts of Speech Tagging

Sentiment Analysis

Topic Models

Semantic Networks


Further Readings


7 Final Remarks

On Digital Social Research




End User License Agreement

List of Tables

Chapter 1

Table 1.1

: A schematic overview of the two ‘systems of thinking’ underlying human behaviour

Table 1.2

: Cross-tabulation between typology of data and modes of behaviour

Table 1.3

: Differences between structured and unstructured data

Chapter 2

Table 2.1

: Online obtrusive and unobtrusive data collection methods

Table 2.2

: Different formats of XML

Chapter 3

Table 3.1

: Comparison between qualitative and quantitative research in social sciences

Table 3.2

: Online interview types

Table 3.3

: Cross-tabulation between types of online in-depth interviews

Table 3.4

: A list of further analytical approaches to digital data

Table 3.5

: Factual and counterfactual outcomes

Table 3.6

: Pros and cons of online experiments

Chapter 4

Table 4.1

: Comparison between the conventional approach of quantitative analysis in the soc…

Chapter 5

Table 5.1

: Network elements and terms in different disciplines

List of Figures

Chapter 1

Figure 1.1

: The three Vs of big data

Figure 1.2

: The development process from concepts to variables for designed data

Figure 1.3

: The repurposing process from available data to concepts for organic data

Figure 1.4

: The partiality of relationship between a concept and its indicators

Chapter 2

Figure 2.1

: Mapping of technologies required for scraping

Figure 2.2

: Working with web texts combining methods

Figure 2.3

: API to end-user workflow

Figure 2.4

: Conceptual model for the social media platform Twitter

Figure 2.5

: Social media entities and some of their relations

Figure 2.6

: Conceptual diagram of how explicit relations between social media data entities …

Figure 2.7

: The two-mode to one-mode projection

Figure 2.8

: Implicit relations between content/resources, metadata, groups

Chapter 3

Figure 3.1

: Combination of synchronous, asynchronous, active and passive modes of data colle…

Figure 3.2

: Scrolling and paging design approaches for web surveys

Chapter 4

Figure 4.1

: Types of multivariate methods

Figure 4.2

: The two cultures of modelling

Figure 4.3

: Example of model-based recursive partitioning

Chapter 5

Figure 5.1

: Fundamental network features

Figure 5.2

: Forum and newsgroups network relationships

Figure 5.3

: Hyperlink analysis

Figure 5.4

: Facebook friendship network relationships

Figure 5.5

: The Twitter network of relationships

Figure 5.6

: Linear snowball sampling

Figure 5.7

: Exponential non-discriminative snowball sampling

Figure 5.8

: Exponential discriminative snowball sampling

Figure 5.9

: Examples of centralities and nodes

Figure 5.10

: Degree, closeness and betweenness centralities

Figure 5.11

: Ego network and circles

Figure 5.12

: A two-mode network

Figure 5.13

: A network visualization of the women by events study

Figure 5.14

: A two-mode network between users and posts

Figure 5.15

: Multilayer network example

Figure 5.16

: Distribution of links in a non-scale-free network (left) and in a scale-free one…

Figure 5.17

: Examples of a random network (left), a scale-free network (centre) and a small-w…

Chapter 6

Figure 6.1

: Comparison between bag-of-words only and use of


grams (bi-grams in this case)

Figure 6.2

: Example of supervised sentiment analysis analytical flow using Twitter data

Figure 6.3

: LDA process of topic creation

Figure 6.4

: A graphical model of LDA

Figure 6.5

: Semantic network of lexical units’ co-occurrences from tweets related to climate…



Table of Contents

Begin Reading










































































































































































































































To my family and friends

Digital Social Research

Giuseppe A. Veltri


Copyright © Giuseppe A. Veltri 2020

The right of Giuseppe A. Veltri to be identified as Author of this Work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988.

First published in 2020 by Polity Press

Polity Press65 Bridge StreetCambridge CB2 1UR, UK

Polity Press101 Station LandingSuite 300Medford, MA 02155, USA

All rights reserved. Except for the quotation of short passages for the purpose of criticism and review, no part of this publication may be reproduced, stored in a retrieval system or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher.

ISBN-13: 978-1-5095-2933-9

A catalogue record for this book is available from the British Library.

Library of Congress Cataloging-in-Publication DataNames: Veltri, Giuseppe A., author.Title: Digital social research / Giuseppe A. Veltri.Description: Medford, MA : Polity Press, [2019] | Includes bibliographical references and index.Identifiers: LCCN 2019014898 (print) | LCCN 2019015511 (ebook) | ISBN 9781509529339 (Epub) | ISBN 9781509529308 (hardback) | ISBN 9781509529315 (pbk.)Subjects: LCSH: Social media--Research. | Social sciences--Research--Data processing. | Social sciences--Research--Methodology.Classification: LCC HM742 (ebook) | LCC HM742 .V45 2019 (print) | DDC 302.23/1--dc23LC record available at

The publisher has used its best endeavours to ensure that the URLs for external websites referred to in this book are correct and active at the time of going to press. However, the publisher has no responsibility for the websites and can make no guarantee that a site will remain live or that the content is or will remain appropriate.

Every effort has been made to trace all copyright holders, but if any have been overlooked the publisher will be pleased to include any necessary credits in any subsequent reprint or edition.

For further information on Polity, visit our website:



auto-logistic actor attribute models


application program interface


computer-assisted qualitative data analysis


classification and regression trees


confirmatory factor analysis


chi-square automatic interaction detection


computational social science


correlated topic model


data-generating process


Digital Methods Initiative


electronic data interchange


exponential random graph models


factor analysis


General Data Protection Regulation


hypertext transfer protocol


information-communication infrastructure


information retrieval


JavaScript Object Notation


latent class analysis


latent Dirichlet allocation


latent semantic analysis


multitrait–multimethod matrix


named entity recognition


natural language processing


not only SQL


online focus group


ordinary least square


principal component analysis


randomized controlled trial


relational database management system


representational state transfer


rich site summary


structural equation modelling


social network analysis


simple object access protocol


structured query language


simple random sample


singular value decomposition


uniform resource locator


units, treatments, observation operations, settings


Voson Activity Units


volunteer geographic information


World Wide Web


In July 2014, a group of researchers published a paper titled ‘Experimental evidence of massive-scale emotional contagion through social networks’ (Kramer et al., 2014), a study conducted on the social media platform Facebook involving hundreds of thousands of its users as participants. This study went under the spotlight not so much because of its scientific findings but because it exemplified how social scientific research has been changing, in terms both of opportunities and of associated risks. Hundreds of thousands of participants were manipulated, in experimental terms, without knowing that they were part of a study that was being conducted on a scale unprecedented in social science research. Since then, terms like ‘big data’, digital research and web social science have inundated conferences and paper abstracts. Social scientists have reacted fundamentally in one of two ways: wild enthusiasm or stern scepticism. For the enthusiasts, the availability of digital data collected by multiple means of recording our digital traces represented the long-awaited turn in an increasingly difficult reality of data collection. For the sceptic, while appreciating the potential, digital research raised questions about the quality of such data; it posed questions about issues of data access and ownership, and started a debate about integrating online and offline data. Of course, there are plenty of researchers who fall in between these two broad categories, and that is where this book ideally situates itself. Its approach is to provide a critical overview of the common methods used to carry out digital research, and, at the same time, to have a particular sensitivity to methodological principles and theoretical issues that are not easily dismissed by the new availability of data about society and people.

The enthusiasm exists, perhaps, because of a personal academic history. During my PhD years, I experienced at first hand the move from analogical to digital data, the increasing presence of software-assisted data management and analysis. Often, at that time, learning to use the latest software would provide a sense of thrill and new possibilities of research would open up in front of you. However, soon enough, after this feeling of great potential, old problems and questions of research methodology would return, throwing cold water on your enthusiasm.

A similar cycle of enthusiasm and doubt is part and parcel of being a digital researcher these days. There is little doubt that digital data are changing the way social science is done. One example of this is the perception of value that data now have. During my training as a researcher, I was taught that data are those precious things that are very hard to obtain and therefore, once in our grasp, should be exploited to the fullest. Today, most social scientists collecting digital data have datasets sleeping in their hard disks or in the Cloud. Because of the scarcity of data, each data collection was often maniacally crafted in terms of the instruments used – for example, a questionnaire – with already a pretty clear idea of how the analysis would be conducted. As data have become much cheaper, the amount of planning and analytical strategy appears to be decreasing. The latter point is also due to the increased obsolescence of data. In a context of fast, continuous and affordable data collection, data become ‘old’ very quickly and yet the use of archives is problematic because of access issues. As the entire research process speeds up, data are collected, analysed and archived very quickly in order to be able to move on to the next project.

At the same time, digital social research is fast-moving, and for several reasons. The first is that the division between online and offline research is increasingly fading. The usual distinction between research ‘about the Internet’ and ‘through the ‘Internet’, where the first term refers to human behaviour and social phenomena specific to the online world, while the second label refers to using the Internet as a field of research to conduct a study that could also be conducted offline, does not hold up well these days. Digital data defy formal definition, but for practical reasons; we can describe them as the digital traces of human behaviour and opinions recorded by a wide set of digital services operating in different domains of society (e.g., financial, transportation, health, commercial, social). The nature of digital data is continuously expanding as digital services emerge and many different objects acquire the capacity of recording information about their use and the environment in which they operate, the so-called ‘Internet of Things’.

The second reason to consider digital data, especially those from social media, as fast-moving is the consideration that their use by people is evolving over time. For example, Facebook has become a main social arena for many people; their initial naive use of the platform that probably existed when it started is long gone. People are now strategic in their use of their Facebook presence. One example of this is the positivity bias that Facebook content has (Spottswood and Hancock, 2016). Almost like a large-scale Hawthorne effect, in which individuals modify an aspect of their behaviour in response to their awareness of being observed, people know that their digital presence is observable and therefore adapt to this visibility.

The third reason is that social digital data are becoming increasingly complex. To illustrate this point, let’s take an example based on one of the fundamental research instruments in social research: the questionnaire. In the pre-digital age, a questionnaire would collect data designed by its makers in terms of answers to questions as well as some contextual data provided by interviewers if a door-to-door collection was part of the design. As surveys started to be conducted by telephone, different types of data became available to researchers – for example, the duration of the task of completing the survey (speed of response is used as a proxy of quality, as we will discuss in Chapter 3). Online surveys have enlarged the type of data that are collected in a questionnaire. Together with the outcome of questions, a plethora of metadata and paradata can be collected and analysed jointly with the ‘main’ data. Metadata (data about data) and paradata (data about process) are empirical measurements about the process of creating survey data, in other words recordings about the fieldwork process. They include time spent per screen, keystrokes and mouse clicks, change of answers, etc.

The latter is just one example of how a relatively uncomplicated type of data familiar to social scientists is now a potentially complex, multidimensional object of analysis. Needless to say, data collected from online sources are often of this type, with a degree of complexity sometimes much higher than previously encountered. This is a somewhat new situation for social scientists – too much choice can be overwhelming and confusing. The ‘digital challenge’ will be a crucial one because the ‘thirst’ for methodological innovation in social sciences is due to the enduring crisis that has characterized most of the widely used existing techniques. Surveys are exemplary in this case, a pillar methodology across so many different disciplines that is suffering a long-lasting crisis due to the increased difficulty in assessing response rates and sampling frames, and limited capacity in capturing variables that are not self-reported measures but important proxies. Similar considerations concern the in-depth interview, another important instrument of data collection in social science. One criticism concerns the translation of a technique developed before the advent of digital media and the question related to the implications of interviews carried via computer-mediated communication.

Increasingly, self-reported surveys and interviews measuring human motivations and behaviour are under scrutiny and being compared to more ‘organic’ sources of data (Curti et al., 2015). This is not to say that digital data do not raise a substantial amount of concern regarding the tendency to consider these as ‘organic’: the current debate relates to the kind of critical awareness that should accompany all methods used by social scientists. Perhaps for historical reasons, the artificial nature of traditional methods has been long forgotten until recently, when their capacity for generating quality data has become increasingly problematic.

Such limitations are even more clear if we consider two further aspects: first, the vast majority of social science data from surveys and interviews are cross-sectional without a longitudinal temporal dimension (Abbott, 2001); second, most social science datasets are coarse aggregations of variables because of the limitations in what can be asked from self-reported instruments. Digital data are forcing innovation on both accounts, moving from static snapshots to dynamic processes and from coarse aggregations to high resolutions of data. The interesting by-product of these innovations is the possibility of an increased focus in the social sciences on processes rather than structures. For the first time, we can obtain longitudinal baseline norms, variance, patterns and cyclical behaviour. This requires thinking beyond the simple causality of the traditional scientific method into extended systemic models of correlation, association and episode triggering. Network analysis is a good example here: the availability of longitudinal relational data sparked the recent methodological and theoretical innovations about the dynamics of networks (Barabási and Posfai, 2016).

The aim of this book is to provide an overview and understanding of the most used digital research methods of data collection and the associated analytical strategies, paying particular attention to the methodological theoretical issues that still need further reflection and discussion. This work is the outcome of my research experience as well as of teaching done over five years at the University of Leicester on the MA course in new media and society, at the methodology summer school of the London School of Economics and at my current institution, the University of Trento, with the addition of several courses taught across Europe and Asia. The backbone of this book constitutes material developed for two courses: ‘Research methods for the online world’ and its complementary and more applied module ‘New media, online persuasion and behaviour’.

Equally important has been the experience of several large-scale behavioural research projects conducted for several European institutions on topics such as online transparency of platforms (Lupiáñez-Villanueva et al., 2018), online advertising (Lupiáñez-Villanueva et al., 2016) and online gambling (Codagnone et al., 2014). These have all been invaluable in learning how to study digital phenomena within the context of providing evidence for the development of public policies.

There is a vast number of books dedicated to mastering specific research techniques and the aim here is not to emulate such texts. Instead, the aim is to contextualize each technique in the study of human behaviour and societies. The use of the term social sciences is meant to describe the reality that there are different disciplines dealing with human affairs. Sociology, political science, anthropology, psychology and economics all have their own epistemological positions and methodological preferences. Digital social research does not escape the same condition: it is conducted with different aims, theories and methods depending on the academic discipline of context. Most of the content of this book should be applicable to all social science disciplines, but it is inevitable that some discipline-related emphasis is present. Therefore, it is better to make it explicit that most of the author digital research has been carried out in the context of social psychological, behavioural and sociological studies. While there is increasing collaboration across the social sciences, those familiar with different approaches and from different disciplines will recognize an implicit set of assumptions and research goals, and will not, hopefully, be put off.

The first chapter is dedicated to the nature of digital data from the perspective of a social scientist. This is because the complexity of digital data is large and requires particular attention in their use for social research. The second chapter is dedicated to one category of data collection methods for the digital world. This category is labelled ‘unobtrusive’, meaning that these methods do not require the active participation of individuals: data are collected without directly engaging with people, who are not aware, unless they are notified, that their data are being used for research. However, these methods pose new challenges to researchers in that they have to take into account the design of these platforms when they draw conclusions about social phenomena. Both the so-called ‘affordance’ of technological infrastructures and their political economy need to be considered (Madsen, 2015; Fuchs, 2015).

The third chapter concerns methods that have found their digital evolution: surveys, focus groups, experiments have found a new life online, albeit with some caveats. These are obtrusive methods; they require active engagement by participants. While they have a consolidated history of practice, their extension to the digital domain also poses challenges and raises opportunities.

Chapter 4 is dedicated to what I believe is a crucial issue: the epistemological and methodological changes and challenges that digital data are bringing to social science research. The emphasis here is on the increasingly common use of analytical methods coming from computer science in the domain of digital social science. This point has been the object of debate among methodologists and it is a crucial obstacle in finding common ground with computer scientists in joint projects.

Chapter 5 presents an overview of network analysis, a longstanding tradition in social science research that has found new life thanks to the availability of digital relational data, data about relationships between actors, and that is largely, but not only, applied to datasets from social media (e.g., emails, telephone calls, text messages, etc.).

Chapter 6 deals with one of the most interesting developments in methodology for the social scientist: text mining. Text has been always part of the data collected by social science research, particularly in the qualitative tradition. Content analysis has been the dominant way of quantifying text characteristics and the most common analytical strategy adopted by researchers who have to manage large quantities of text. The digital age has, among other changes, brought an exponential growth of text produced by people. It is the experience of the everyday use of social media, but also blogs and forums. Never before has so much text spontaneously produced by people been available. At the same time, ever more sophisticated methods of automatic analysis of texts have been developed, allowing researchers to analyse anything from a few hundred to millions of documents (see, for example, Sudhahar et al., 2015). While these types of automatic analysis do not aim at substituting the in-depth understanding that human analysis can provide, they do provide a unique bird’s-eye view of a large set of documents that was simply impossible to have before. A bit like the Nazca Lines, a series of large ancient geoglyphs in the Nazca Desert in southern Peru, and observable in their entirety only from the sky above, so text-mining techniques can allow researchers to detect common patterns or even structures across many different documents.

The last chapter is dedicated to a few general remarks about doing digital social research, among which we will discuss the ethical aspects of this kind of studies. The recent Cambridge Analytica scandal about the misuse, for commercial purposes, of Facebook data has grabbed the attention of millions of citizens across the globe. At the same time, the introduction in the European Union of the new GDPR (General Data Protection Regulation) has changed the rules of the game, including those for social scientists (European Commission, 2018). There is no doubt that this is a complex and important issue that deserves an entire book by itself. I cannot provide a lengthy discussion of the ethical and legal implications of using digital data and will therefore limit discussion to what I believe are the most salient points. I will also mention the issue of access in terms of inequality of research opportunities across the research community. It is no mystery that many large digital platforms, including the largest social media, are owned by American companies. The consequence has been that the North American academic institutions had historically stronger relationships with these private entities compared to European and Asian universities.

After this rather long introduction, we move next to discuss the nature of digital data for the purpose of social research. Exciting opportunities are emerging, while, at the same time, old and new methodological challenges are not easily settled. These challenges are the future of social science research: if we do not include our digital social life in our research practices, our capacity to understand human societies is greatly diminished. And yet, the critical eye that social scientists have learned to exercise needs to be sharp in a research domain in which digital data have become the most valuable asset for very large sectors of the economy and are also more and more crucial for the political evolution of democracies and non-democracies alike.

1Social Research Using Digital Data and Methods

Self-Reported and Behavioural Data

All social research methods have underlying assumptions about human nature and, in particular, about the way people make decisions, create their opinions and behave. In fact, this is one of the aspects that differentiates between the various different disciplines within the social sciences. Psychology, economics and sociology all have different models of how people behave. Besides the theoretical implications of such differences, the consequences for the type of methodology employed are substantial. Depending on which underlying model is selected, a particular research method is considered appropriate to study human behaviour.

For a long time, in economics but not only, a model of how people make decisions, known as the ‘rational choice theory’, has been considered the baseline. According to this model, people’s preferences have a well-defined structure and the choice between courses of action is an almost automatic mechanism in which the individual applies his or her system of preferences to a limited set of options (for example, the set of products that fall within the budget available). In other social sciences, the most common underlying model of human behaviour was unbalanced by an ‘oversocialized view’, ‘a conception of people as so overwhelmingly sensitive to the opinions of others, and hence obedient to the dictates of consensually developed norms and values, internalized through socialization, that obedience is not burdensome but unthinking and automatic’ (Granovetter, 2017: 11; see also Di Maggio, 1997).

Both models look primarily at people’s conscious thought processes and determine what they think, believe and how they act. Deviation from either economic rationality or forms of ‘sociological rationality’ were labelled as ‘irrational’. In such a context, the way to elicit data and study people’s behaviour relies on what people themselves report, about their opinions, social norms, attitudes and beliefs. These are often defined as ‘self-reported’ data, meaning that researchers rely on participants of a study to report on something they have done or on what they think or believe. Surveys and interviews of all sorts are examples of self-reported data. In contrast to this approach, there are observational or behavioural data, data about the actual actions and behaviour carried out by someone. To better appreciate the difference, let’s take the example of asking someone how many times she or he goes to the gym every month, and compare the response to the actual tracking of their movements, for example, on a GPS phone or watch. The two pieces of information can differ dramatically. Researchers in the social sciences have learned to live with limitations of self-reported data, such as the social desirability bias (Kreuter et al., 2009). The concept of social desirability rests on the notion that there are social norms governing some behaviours and attitudes and that people may misrepresent themselves in order to appear to comply with these norms. This is the reason why participants might provide inaccurate information about their behaviour to researchers. At the same time, people have difficulty in verbalizing accurately what they have done, felt and thought. Recalling events from memory is not easy either (Gaskell et al., 2000). In other words, self-reported measures have their limitations, but they have been the most common way of conducting social research related to human behaviour.

However, the biggest challenge to self-reported data has come from a shift in the model of human behaviour. Since the late 1990s, psychologists have distinguished between two systems of thought with different capacities and processes (Kahneman, 2011; Kahneman and Frederick, 2002; Metcalfe and Mischel, 1999; Sloman, 1996; Smith and DeCoster, 2000; Lichtenstein and Slovic, 2006), which have been referred to as System 1 and System 2 (Evans and Stanovich, 2013). System 1 (S1) is made up of intuitive thoughts of great capacity, is based on associations acquired through experience and quickly and automatically calculates information. System 2 (S2), on the other hand, involves low-capacity reflective thinking, is based on rules acquired through culture or formal learning, and calculates information in a relatively slow and controlled manner. The processes associated with these systems have been defined as Type 1 (fast, automatic, unconscious) and Type 2 (slow, conscious, controlled) respectively (see Table 1.1). The perspective of the dual system became increasingly popular, even outside the academy, after the publication of Daniel Kahneman’s book Thinking, Fast and Slow (2011). Kahneman was awarded the Nobel Memorial Prize in 2002 for his contribution to the explanation of individual economic behaviour through the elaboration of the ‘prospect theory’ (see Kahneman and Tversky, 2008).

Table 1.1 A schematic overview of the two ‘systems of thinking’ underlying human behaviour

System 1

System 2

Quick, automatic, no effort, no sense of voluntary control

Slow, effortful, attention to mental activities requiring it

Continuous construal of what is going on at any instant

Good at cost/benefit analysis, but lazy and saddled by decision paralysis (cognitive overload)



Quick (reflexive)

Deliberate (reflective)



Use shortcuts


When it plays

When it plays

When speed is criticalAvoid decision paralysis

May take over when System 1 cannot process data

When System 2 is lazy or not activated (not worth, no energy, lack of awareness)

May correct/override System 1 if effort shows that intuition or impulse is wrong

The so-called ‘dual model’ of the mind is now the most supported way of understanding human behaviour at the individual level. The model has also been applied outside psychology, for example in sociology (Moore, 2017; Lizardo et al., 2016) and in political science (Achen and Bartels, 2017), and the implications of Kahneman and Tversky’s work have led to the research programme known as behavioural economics, which has had a great impact on traditional micro-economics theory.

From the initial underlying model of human behaviour based on the ‘theory of rational decision-making’, or rational choice theory, the current model portrays human beings as characterized by ‘bounded rationality’ – in other words, they are rational with limits, in which the ‘irrational’ is not some mysterious and almost metaphysical force, but instead the outcome of systematic error and biases originated by how our cognition and emotions work (and interact).

A more precise model of human behaviour and decision-making has implications for social science research methodology and in particular for the aforementioned distinction between self-reported and observational/behavioural data. The dual mode of thinking brings back the importance of unconscious thought processes, but also of contextual and environmental influences on human behaviour, something that is highly problematic in studies using self-reported measures and instruments only. Traditionally, collecting behavioural data has been very difficult and expensive for social scientists. To keep track of people’s actual behaviour could be done only for small groups of people and for a very limited amount of time. The availability of digital data has brought us a large increase in behavioural data; we now have digital traces of people’s actual behaviour that were quite simply never available before.

The combined effect of a relatively new and powerful foundational model of human behaviour and decision-making offered by the dual model together with the availability of behavioural data thanks to the digital traces recorded by a multitude of services and tools is very promising for social scientists. Before continuing this line of argument, let’s clarify one point that might be the object of criticism. Considering human behaviour as the outcome of mutual influences of conscious acting and unconscious heuristics, of biases and environmental influences, does not mean a return to a form of reductionism in which people’s opinions count for nothing. Self-reported data will remain an important source of information for social scientists, but, at the same time, the availability of behavioural data will function as complementary data to understand complex social phenomena. If we cross-tabulate that typology of data with the modality of human behaviour and decision-making – as shown in Table 1.2 – the complementarity becomes clearer.

Table 1.2 Cross-tabulation between typology of data and modes of behaviour

Typology of data

Type of human behaviour


System 2, rational deliberation, attitudes conscious description


System 1, heuristics use and context/environment influence

The distinction between self-reported and behavioural data is no longer mainly theoretical because the new opportunities for collecting the latter are unprecedented. Such opportunity opens up new research directions, as well as the possibility of reviewing current theories and existing models. Table 1.2 reports a distinction that is useful particularly for those who are interested in studying human behaviour at the micro and meso levels, that is to say at the individual and the group levels of analysis, but it is less pertinent to the macro level.

According to Granovetter (2017: 13), both overrational and oversocialized models of human behaviour are atomistic in nature: ‘both share a conception of action by atomized actors. In the under-socialized account, atomization results from narrow pursuit of self-interest; in the oversocialized one, from behavioural patterns having been internalized and thus little affected by ongoing social relations.’ Behavioural digital data can have, among other features, a great deal of information about social relations and people’s embeddedness; they can help overcome such an atomized view (we will return to this issue later in the book).

However, the increased availability of collecting data about people’s behaviour does not free us from biases generated by the design and aims of digital platforms. People’s behaviour is constrained by the platform they use; for example, it is not possible to write an essay on Twitter unless we decide to write it using a large number of individual tweets. There are, therefore, several potential sources of confounding factors, as we will further elaborate in the section below on construct validity.

Returning to the issue of the different levels of analysis, at all levels another distinction is relevant: the one between static and dynamic data. The large majority of data collected in the social sciences have been ‘static’ – that is, data collection has been carried out at a given time. The reason for this is because longitudinal data collection, data collected over a period of time, was very difficult and expensive. The only exceptions were analysis carried out on documents and data that were archived and therefore accessible – for example, newspapers but also administrative data collected by governments or other institutions. Relying on static data has produced an involuntary emphasis on theories that focus more on ‘structures’ rather than on processes (Abbott, 2001). In other words, it has been historically difficult for social scientists to observe the dynamic unfolding of social events, especially at the macro level, because collecting data for this purpose was extremely complex and demanding in terms of resources. Most surveys are cross-sectional, meaning that they are carried out once or twice, and the same applies to interviews and other forms of data collection. Digital data introduce a much-increased capacity for recording and using longitudinal data for social scientific purposes. Obviously, digital data have not been historically around for many decades, but future researchers might have at their disposal longitudinal datasets that were absent in the past. The dynamic nature of digital data might be more enriching than their raw size; ‘big data’ concerns not only size but also resolution, as we will discuss later.

Behavioural digital data are the object of attention of a new generation of social scientists who believe in their potential to bring about a regeneration of the current theories and framework that were developed in a condition of data scarcity, with different models of human behaviour and an overreliance on self-reported data. It is too early to say what changes will bring the context of increased data availability, but this is the most exciting aspect of the use of digital data for social scientific research. The nature of the data collected from the digital world is not without problems and it poses specific challenges to researchers. In the next section, we will discuss the nature of big data and later we will look at some of the methodological issues concerning their application in the context of social science research. After that, we will distinguish between different types of digital data, with an emphasis on unstructured and semi-structured data, given that these are particularly interesting, as well as challenging, for social scientists.

Big Data

Currently, the most discussed type of digital data in the social sciences is so-called ‘big data’. The expression hides different objects and technologies and is therefore one of those umbrella terms on which it is virtually impossible to attach a consensual definition. A common description of big data refers to the three Vs (Figure 1.1) of volume, velocity and variety. Volume refers to the quantity of data produced: this is the most salient feature of big data, the sheer amount of data created by digital services and goods. Velocity stands for the fast-moving nature of digital data that are often produced ‘on the fly’; for example, when we search something on a search engine, the exact list of results is generated instantly from our query. The third V stands for variety, the multiplicity of formats that data can have in the digital world. The latter is a source of richness but also of troubles for social science researchers. This is because big data is generated by a vast range of largely invisible processes with frequently incomparable dimensions, and different degrees of dimensionality.

Figure 1.1 The three Vs of Big Data

Source: based on Claverie-Berge (2012).

This is an important point that deserves consideration. The large majority of big data, from the most common such as social media and search engines data to transactions at self-check-out in hotels or supermarkets, are generated for different and specific purposes. They are not the design of a researcher who already has in mind an idea of a theoretical framework of reference and an analytical strategy. In contrast, surveys are designed data-harvesting instruments. Survey designers are experts in the art of eliciting the types of records that allow the processes that generated them to be inferred, and to contribute in pre-understood ways to the statistical modelling and sample selection controls that will be used to model them. Surveys are deliberately designed to tame the effects of multiply entangled correlations. Big data, by contrast, are just a large conglomerate of such correlations – very often they are not carefully designed. Twitter and big national surveys have both been used to analyse public opinion, but their data are different and so what they can reveal about public opinion is different in each case. Sentiment analysis on Twitter data, the emotional valence of tweets computed by text mining, is now a popular way of tracking public opinion and is not well suited for surveys. However, unpacking people’s attitudes about public issues is probably still better served by carefully designed surveys and associated samples. From this point of view, the debate about big data enthusiasts and sceptics should be formulated differently. There are social research questions and issues for which big data are interesting and others for which ‘traditional’ social scientific methods are still more reliable and useful.

Therefore, one of the first characteristics of big data, highly relevant for the social scientist, is their ‘organic’, as opposed to ‘designed’, nature (for social research data). Currently, data are becoming a cheap commodity, simply because society has created systems that automatically track transactions of all sorts. For example, Internet search engines build datasets with every entry; Twitter generates tweet data continuously; traffic cameras digitally count cars; scanners record purchases; Internet sites capture and store mouse clicks. Collectively, human society is assembling massive amounts of behavioural data. If we think of these processes as an ecosystem, it is self-measuring on an increasingly broad scale. Indeed, we might label these data as ‘organic’, a now-natural feature of this ecosystem. Therefore, big data are considered ‘organic’, they are created by different actors in the context not of research, but of producing or delivering goods or services. This is in contrast to ‘designed’ data, those that are collected when we design experiments, questionnaires, focus groups, etc. and that do not exist until they are collected.

Social scientists are not completely new to this context in the use of data. There is a longstanding tradition of secondary datasets analysis but there are some important differences as well. Secondary data in the social sciences generally indicate the reuse of existing datasets collected either by official institutions or by other researchers (Vartanian, 2011). Although these datasets might not be collected for research purpose (although they often are), they are usually publicly accessible and their methodological features are quite transparent. Some datasets are of very high quality, for example from academic or research institutions as well as governmental organizations; others might be less reliable. Common to big data is the idea of the repurposing of data. Data that were initially collected for other aims are repurposed for new specific research goals set by the secondary analyst. The difference is that, for big data, especially those collected by private companies, the lack of transparency about how the data are collected or coded is a problem that digital social researchers have to face.

Repurposing of data requires a good understanding of the context in which the repurposed data were generated in the first place. In other words, these are not ‘natural’: they are the outcome of designers and socioeconomic processes, therefore created with certain goals and trade-offs. It is about finding a balance between identifying the weaknesses of the repurposed data and, at the same time, finding their strengths. A good practice for social scientists, which applies not only to big data, is to think about the ideal dataset for their research and then compare it with what is available. It will make salient the problems and opportunities of what is available.

In the previous sections, we have highlighted how digital data, including in their big data format, possess some useful characteristics for social scientific research. There are essentially three positive features:

Increased size and resolution

. Big data are by definition ‘big’, but what this adjective really means requires further elaboration. Size refers to the sheer number of cases or participants that we can include in our research. A much-debated ambition of big data is to move from samples to populations, in other words to include in a study not a carefully selected and hopefully representative portion of a population to infer something about it, but a direct analysis of the entire population itself. Another way of interpreting the ‘big’ in big data is not only about the number of cases but about their increased ‘resolution’ (Housley et al., 2014). For resolution, we are considering the number of data points available for each subject. To explain the concept, let’s make a comparative example. If we design a questionnaire with twenty questions, we will collect twenty data points per participant. Historically in the social sciences, the amount of data points available per person has been limited by constraints in the data collection processes. It has been unfeasible to use questionnaires with two hundred questions because participants would not be able to answer without great fatigue, the latter affecting the quality of the data collected. Therefore, the amount of data collected for each individual has always been a careful trade-off between different variables determined by research goals and planned analytical strategy. One of the most attractive features of big data is the possibility of having a much larger set of data points per individual. (However, it is a different story to have coherent data points to be used in one specific ‘construct’, something that we will discuss in the next section about construct validity.) Let’s take the example of the data collected by Facebook for each of its users. While it is impossible to know exactly how many variables are collected by the American social media company, what is visible to researchers already is in the order of hundreds of variables. Such richness represents an intriguing possibility for social scientists, but at the same time poses the challenge of how to deal with so much data for each individual. The availability of many data points is particularly interesting for the measurement of

stratified variables

, the latter being variables that measure complex constructs constituted by components. Already, in survey research, complex research objects are measured by multiple items (questions) that are supposed to measure different sub-aspects and that can be combined together. This is an important methodological point to which we will return later in the section about construct validity because big data also present problems in testing the validity of constructs.

Big data are long – that is, longitudinal

. The other useful feature of big data for social scientists is their lengthy time span. The longitudinal nature of big data creates a novel opportunity for the social sciences, which have historically encountered high costs and technical difficulties in collecting data over long periods of time. Having the possibility of studying the unfolding of social events can help to understand their processes. For example, Russell Neumann et al. (2014) revised the famous agenda-setting theory studying more than 100 million Twitter active users, archives of approximately 160 million active blogs, and 300,000 forums and message boards for one entire calendar year. Besides the size of the datasets in terms of users and documents involved, one year of tracking such data is a step forward from previous studies based on a much more limited time span. The span of the available digital data is limited to the diffusion of the related technologies, but the digitalization of archives of documents, when complete and in a good state, provides researchers with unprecedented opportunities. In general, the continuous recording of digital data means that the ‘recorder is always on’ and therefore we can reconstruct people’s expressed opinions, behaviour and choices for longer periods of time.

Non-reactive heterogeneity, including behavioural

. Another important feature of big data is that they are not, in most cases, collected by means of direct elicitation of people. While surveys, interviews, experiments, etc. require the active engagement of participants, most digital data are collected in the background. People are barely aware of the process of data collection while they are using services or tools such smartphones, tablets or PCs. The advantage of this almost invisible footprint is to make the Hawthorne effect less likely. At the same time, this opacity of the data collection process does raise ethical concerns, as we will discuss later in the book. There is another important consequence of the invisibility of digital data collection: digital data and their enhanced large version, big data, are better suited to capture behavioural information than traditional social scientific instruments. We have already discussed the difference between self-reported and behavioural data. The latter type has been traditionally difficult to accurately collect, for large groups of people and over a long time. Now, let’s take the example of my smartphone GPS data: since it has been active, it has accurately tracked my movements for example over the past year. One year of movements by someone can reveal a great deal and is therefore an object of privacy protection. However, the possibility of using the aggregated data of a group of people or of a specific community in order to study mobility on roads and urban design issues is already being exploited by some city planners.