Tag Archives: big data

The limits of social media big data

handbook social media researchA new book chapter by Rob Kitchin has been published in The Sage Handbook of Social Media Research Methods edited by Luke Sloan and Anabel Quan-Haase. The chapter is titled ‘Big data – hype or revolution’ and provides a general introduction to big data, new epistemologies and data analytics, with the latter part focusing on social media data.  The text below is a sample taken from a section titled ‘The limits of social media big data’.

The discussion so far has argued that there is something qualitatively different about big data from small data and that it opens up new epistemological possibilities, some of which have more value than others. In general terms, it has been intimated that big data does represent a revolution in measurement that will inevitably lead to a revolution in how academic research is conducted; that big data studies will replace small data ones. However, this is unlikely to be the case for a number of reasons.

Whilst small data may be limited in volume and velocity, they have a long history of development across science, state agencies, non-governmental organizations and business, with established methodologies and modes of analysis, and a record of producing meaningful answers. Small data studies can be much more finely tailored to answer specific research questions and to explore in detail and in-depth the varied, contextual, rational and irrational ways in which people interact and make sense of the world, and how processes work. Small data can focus on specific cases and tell individual, nuanced and contextual stories.

Big data is often being repurposed to try and answer questions for which it was never designed. For example, geotagged Twitter data have not been produced to provide answers with respect to the geographical concentration of language groups in a city and the processes driving such spatial autocorrelation. We should perhaps not be surprised then that it only provides a surface snapshot, albeit an interesting snapshot, rather than deep penetrating insights into the geographies of race, language, agglomeration and segregation in particular locales. Moreover, big data might seek to be exhaustive, but as with all data they are both a representation and a sample. What data are captured is shaped by: the field of view/sampling frame (where data capture devices are deployed and what their settings/parameters are; who uses a space or media, e.g., who belongs to Facebook); the technology and platform used (different surveys, sensors, lens, textual prompts, layout, etc. all produce variances and biases in what data are generated); the context in which data are generated (unfolding events mean data are always situated with respect to circumstance); the data ontology employed (how the data are calibrated and classified); and the regulatory environment with respect to privacy, data protection and security (Kitchin, 2013, 2014a). Further, big data generally capture what is easy to ensnare – data that are openly expressed (what is typed, swiped, scanned, sensed, etc.; people’s actions and behaviours; the movement of things) – as well as data that are the ‘exhaust’, a by-product, of the primary task/output.

Small data studies then mine gold from working a narrow seam, whereas big data studies seek to extract nuggets through open-pit mining, scooping up and sieving huge tracts of land. These two approaches of narrow versus open mining have consequences with respect to data quality, fidelity and lineage. Given the limited sample sizes of small data, data quality – how clean (error and gap free), objective (bias free) and consistent (few discrepancies) the data are; veracity – the authenticity of the data and the extent to which they accurately (precision) and faithfully (fidelity, reliability) represent what they are meant to; and lineage – documentation that establishes provenance and fit for use; are of paramount importance (Lauriault, 2012). In contrast, it has been argued by some that big data studies do not need the same standards of data quality, veracity and lineage because the exhaustive nature of the dataset removes sampling biases and more than compensates for any errors or gaps or inconsistencies in the data or weakness in fidelity (Mayer-Schonberger and Cukier, 2013). The argument for such a view is that ‘with less error from sampling we can accept more measurement error’ (p.13) and ‘tolerate inexactitude’ (p. 16).

Nonetheless, the warning ‘garbage in, garbage out’ still holds. The data can be biased due to the demographic being sampled (e.g., not everybody uses Twitter) or the data might be gamed or faked through false accounts or hacking (e.g., there are hundreds of thousands of fake Twitter accounts seeking to influence trending and direct clickstream trails) (Bollier, 2010; Crampton et al., 2012). Moreover, the technology being used and their working parameters can affect the nature of the data. For example, which posts on social media are most read or shared are strongly affected by ranking algorithms not simply interest (Baym, 2013). Similarly, APIs structure what data are extracted, for example, in Twitter only capturing specific hashtags associated with an event rather than all relevant tweets (Bruns, 2013), with González-Bailón et al. (2012) finding that different methods of accessing Twitter data – search APIs versus streaming APIs – produced quite different sets of results. As a consequence, there is no guarantee that two teams of researchers attempting to gather the same data at the same time will end up with identical datasets (Bruns, 2013). Further, the choice of metadata and variables that are being generated and which ones are being ignored paint a particular picture (Graham, 2012). With respect to fidelity there are question marks as to the extent to which social media posts really represent peoples’ views and the faith that should be placed on them. Manovich (2011: 6) warns that ‘[p]eoples’ posts, tweets, uploaded photographs, comments, and other types of online participation are not transparent windows into their selves; instead, they are often carefully curated and systematically managed’.

There are also issues of access to both small and big data. Small data produced by academia, public institutions, non-governmental organizations and private entities can be restricted in access, limited in use to defined personnel, or available for a fee or under license. Increasingly, however, public institution and academic data are becoming more open. Big data are, with a few exceptions such as satellite imagery and national security and policing, mainly produced by the private sector. Access is usually restricted behind pay walls and proprietary licensing, limited to ensure competitive advantage and to leverage income through their sale or licensing (CIPPIC, 2006). Indeed, it is somewhat of a paradox that only a handful of entities are drowning in the data deluge (boyd and Crawford, 2012) and companies such as mobile phone operators, app developers, social media providers, financial institutions, retail chains, and surveillance and security firms are under no obligations to share freely the data they collect through their operations. In some cases, a limited amount of the data might be made available to researchers or the public through Application Programming Interfaces (APIs). For example, Twitter allows a few companies to access its firehose (stream of data) for a fee for commercial purposes (and have the latitude to dictate terms with respect to what can be done with such data), but with a handful of exceptions researchers are restricted to a ‘gardenhose’ (c. 10 percent of public tweets), a ‘spritzer’ (c. one percent of public tweets), or to different subsets of content (‘white-listed’ accounts), with private and protected tweets excluded in all cases (boyd and Crawford, 2012). The worry is that the insights that privately owned and commercially sold big data can provide will be limited to a privileged set of academic researchers whose findings cannot be replicated or validated (Lazer et al., 2009).

Given the relative strengths and limitations of big and small data it is fair to say that small data studies will continue to be an important element of the research landscape, despite the benefits that might accrue from using big data such as social media data. However, it should be noted that small data studies will increasingly come under pressure to utilize the new archiving technologies, being scaled-up within digital data infrastructures in order that they are preserved for future generations, become accessible to re-use and combination with other small and big data, and more value and insight can be extracted from them through the application of big data analytics.

Rob Kitchin

Video: Data Politics and Internet of Things

In November 2016, CONNECT, The Programmable City and Maynooth University Social Science Institute (MUSSI) invited a panel of international and local experts from different disciplines to explore the broader political, economic and social implications of Internet of Things.

The panel included Linda Doyle (Trinity College Dublin), Anne Helmond (University of Amsterdam), Aphra Kerr (Maynooth University), Rob Kitchin (Maynooth University), Liz McFall (Open University) and Alison Powell (LSE). The video of the presentations by the panel members and also the discussion afterwards are available to view now.

For more details of the event, please see Science Gallery Dublin’s event page here, or here for a workshop organised for earlier in the day.

New paper: The ethics of smart cities and urban science

A new paper by Rob Kitchin has been published in Philosophical Transactions A titled ‘The ethics of smart cities and urban science’ in a special issue on ‘The ethical impact of data science’.

Abstract

Software-enabled technologies and urban big data have become essential to the functioning of cities. Consequently, urban operational governance and city services are becoming highly responsive to a form of data-driven urbanism that is the key mode of production for smart cities. At the heart of data-driven urbanism is a computational understanding of city systems that reduces urban life to logic and calculative rules and procedures, which is underpinned by an instrumental rationality and realist epistemology. This rationality and epistemology are informed by and sustains urban science and urban informatics, which seek to make cities more knowable and controllable. This paper examines the forms, practices and ethics of smart cities and urban science, paying particular attention to: instrumental rationality and realist epistemology; privacy, datafication, dataveillance and geosurveillance; and data uses, such as social sorting and anticipatory governance. It argues that smart city initiatives and urban science need to be re-cast in three ways: a re-orientation in how cities are conceived; a reconfiguring of the underlying epistemology to openly recognize the contingent and relational nature of urban systems, processes and science; and the adoption of ethical principles designed to realize benefits of smart cities and urban science while reducing pernicious effects.

The paper is behind a paywall, so if you don’t have access and you’re interested in reading email Rob (rob.kitchin@nuim.ie) and he’ll send you a copy.

Big data and the city

A special issue of ‘Built Environment’ – Big Data and the City – edited by Mike Batty has just been published and includes a paper by Gavin McArdle and Rob Kitchin on improving the veracity of open and real-time urban data.  Full details of contents below:

  • Editorial: Big Data, Cities and Herodotus by MICHAEL BATTY
  • Big Data and the City by MICHAEL BATTY
  • From Origins to Destinations: The Past, Present and Future of Visualizing Flow Maps by MATTHEW CLAUDEL, TILL NAGEL, and CARLO RATTI
  • Towards a Better Understanding of Cities Using Mobility Data by MAXIME LENORMAND and JOSÉ J. RAMASCO
  • Finding Pearls in London’s Oysters by JON READES, CHEN ZHONG, ED MANLEY, RICHARD MILTON and MICHAEL BATTY
  • A Classification of Multidimensional Open Data for Urban Morphology by ALEXANDROS ALEXIOU, ALEX SINGLETON, and PAUL A. LONGLEY
  • User-Generated Big Data and Urban Morphology by A.T. CROOKS, A. CROITORU, A. JENKINS, R. MAHABIR, P. AGOURIS and A. STEFANIDIS
  • Sensing Spatiotemporal Patterns in Urban Areas: Analytics and Visualizations Using the Integrated Multimedia City Data Platform by PIYUSHIMITA (VONU) THAKURIAH, KATARZYNA SILA-NOWICKA, and JORGE GONZALEZ PAULE
  • Playful Cities: Crowdsourcing Urban Happiness with Web Games by DANIELE QUERCIA
  • Big Data for Healthy Cities: Using Location-Aware Technologies, Open Data and 3D Urban Models to Design Healthier Built Environment by HARVEY J. MILLER and KRISTIN TOLLE
  • Improving the Veracity of Open and Real-Time Urban Data by GAVIN MCARDLE and ROB KITCHIN
  • Wise Cities: ‘Old’ Big Data and ‘Slow’ Real Time by FABIO CARRERA
  • Collecting and Visualizing Real-Time Urban Data Through City Dash-Boards by STEVEN GRAY, OLIEVER O’BRIEN and STEPHAN HÜGEL

Seminar: “Understanding Human Behavior, the Environment and Our Cities Through Measurement & Analysis” by Marguerite Nyhan

We are delighted to have Dr. Marguerite Nyhan as a guest speaker on Tuesday 11th October at 4pm, Iontas Building, room 2.31 for the first of our Programmable City seminars this academic year 2016/17.

Dr. Marguerite Nyhan is a Post-Doctoral Researcher at Harvard University, based in the Department of Environmental Health. Prior to her current appointment, she led the Urban Environmental Research Team at Massachusetts Institute of Technology’s Senseable City Laboratory. Marguerite holds a PhD in Civil & Environmental Engineering from Trinity College Dublin. During her PhD, she was a Fulbright Scholar at MIT. Marguerite has spoken widely about her research including addressing the United Nations Environment Assembly in Kenya, and TEDx Dublin. She has also lectured in the Department of Urban Studies & Planning at MIT.

Marguerite will be talking about modeling and predicting interactions between human populations, urban systems, the natural environment and the built environment.

ProgCity_Seminar_1-2016_MaggieNyhan

New Paper: Locative media and data-driven computing experiments

Sung-Yueh Perng, Rob Kitchin and Leighton Evans have a new open access paper, Locative media and data-driven computing experiments, published in Big Data & Society today. It examines the staging of locative data and computing experiments to envision urban futures, and its consequences. More details are in the abstract below and the paper can be downloaded at http://bds.sagepub.com/content/3/1/2053951716652161.

Abstract

Over the past two decades urban social life has undergone a rapid and pervasive geocoding, becoming mediated, augmented and anticipated by location-sensitive technologies and services that generate and utilise big, personal, locative data. The production of these data has prompted the development of exploratory data-driven computing experiments that seek to find ways to extract value and insight from them. These projects often start from the data, rather than from a question or theory, and try to imagine and identify their potential utility. In this paper, we explore the desires and mechanics of data-driven computing experiments. We demonstrate how both locative media data and computing experiments are ‘staged’ to create new values and computing techniques, which in turn are used to try and derive possible futures that are ridden with unintended consequences. We argue that using computing experiments to imagine potential urban futures produces effects that often have little to do with creating new urban practices. Instead, these experiments promote Big Data science and the prospect that data produced for one purpose can be recast for another and act as alternative mechanisms of envisioning urban futures.