Category Archives: news

The limits of social media big data

handbook social media researchA new book chapter by Rob Kitchin has been published in The Sage Handbook of Social Media Research Methods edited by Luke Sloan and Anabel Quan-Haase. The chapter is titled ‘Big data – hype or revolution’ and provides a general introduction to big data, new epistemologies and data analytics, with the latter part focusing on social media data.  The text below is a sample taken from a section titled ‘The limits of social media big data’.

The discussion so far has argued that there is something qualitatively different about big data from small data and that it opens up new epistemological possibilities, some of which have more value than others. In general terms, it has been intimated that big data does represent a revolution in measurement that will inevitably lead to a revolution in how academic research is conducted; that big data studies will replace small data ones. However, this is unlikely to be the case for a number of reasons.

Whilst small data may be limited in volume and velocity, they have a long history of development across science, state agencies, non-governmental organizations and business, with established methodologies and modes of analysis, and a record of producing meaningful answers. Small data studies can be much more finely tailored to answer specific research questions and to explore in detail and in-depth the varied, contextual, rational and irrational ways in which people interact and make sense of the world, and how processes work. Small data can focus on specific cases and tell individual, nuanced and contextual stories.

Big data is often being repurposed to try and answer questions for which it was never designed. For example, geotagged Twitter data have not been produced to provide answers with respect to the geographical concentration of language groups in a city and the processes driving such spatial autocorrelation. We should perhaps not be surprised then that it only provides a surface snapshot, albeit an interesting snapshot, rather than deep penetrating insights into the geographies of race, language, agglomeration and segregation in particular locales. Moreover, big data might seek to be exhaustive, but as with all data they are both a representation and a sample. What data are captured is shaped by: the field of view/sampling frame (where data capture devices are deployed and what their settings/parameters are; who uses a space or media, e.g., who belongs to Facebook); the technology and platform used (different surveys, sensors, lens, textual prompts, layout, etc. all produce variances and biases in what data are generated); the context in which data are generated (unfolding events mean data are always situated with respect to circumstance); the data ontology employed (how the data are calibrated and classified); and the regulatory environment with respect to privacy, data protection and security (Kitchin, 2013, 2014a). Further, big data generally capture what is easy to ensnare – data that are openly expressed (what is typed, swiped, scanned, sensed, etc.; people’s actions and behaviours; the movement of things) – as well as data that are the ‘exhaust’, a by-product, of the primary task/output.

Small data studies then mine gold from working a narrow seam, whereas big data studies seek to extract nuggets through open-pit mining, scooping up and sieving huge tracts of land. These two approaches of narrow versus open mining have consequences with respect to data quality, fidelity and lineage. Given the limited sample sizes of small data, data quality – how clean (error and gap free), objective (bias free) and consistent (few discrepancies) the data are; veracity – the authenticity of the data and the extent to which they accurately (precision) and faithfully (fidelity, reliability) represent what they are meant to; and lineage – documentation that establishes provenance and fit for use; are of paramount importance (Lauriault, 2012). In contrast, it has been argued by some that big data studies do not need the same standards of data quality, veracity and lineage because the exhaustive nature of the dataset removes sampling biases and more than compensates for any errors or gaps or inconsistencies in the data or weakness in fidelity (Mayer-Schonberger and Cukier, 2013). The argument for such a view is that ‘with less error from sampling we can accept more measurement error’ (p.13) and ‘tolerate inexactitude’ (p. 16).

Nonetheless, the warning ‘garbage in, garbage out’ still holds. The data can be biased due to the demographic being sampled (e.g., not everybody uses Twitter) or the data might be gamed or faked through false accounts or hacking (e.g., there are hundreds of thousands of fake Twitter accounts seeking to influence trending and direct clickstream trails) (Bollier, 2010; Crampton et al., 2012). Moreover, the technology being used and their working parameters can affect the nature of the data. For example, which posts on social media are most read or shared are strongly affected by ranking algorithms not simply interest (Baym, 2013). Similarly, APIs structure what data are extracted, for example, in Twitter only capturing specific hashtags associated with an event rather than all relevant tweets (Bruns, 2013), with González-Bailón et al. (2012) finding that different methods of accessing Twitter data – search APIs versus streaming APIs – produced quite different sets of results. As a consequence, there is no guarantee that two teams of researchers attempting to gather the same data at the same time will end up with identical datasets (Bruns, 2013). Further, the choice of metadata and variables that are being generated and which ones are being ignored paint a particular picture (Graham, 2012). With respect to fidelity there are question marks as to the extent to which social media posts really represent peoples’ views and the faith that should be placed on them. Manovich (2011: 6) warns that ‘[p]eoples’ posts, tweets, uploaded photographs, comments, and other types of online participation are not transparent windows into their selves; instead, they are often carefully curated and systematically managed’.

There are also issues of access to both small and big data. Small data produced by academia, public institutions, non-governmental organizations and private entities can be restricted in access, limited in use to defined personnel, or available for a fee or under license. Increasingly, however, public institution and academic data are becoming more open. Big data are, with a few exceptions such as satellite imagery and national security and policing, mainly produced by the private sector. Access is usually restricted behind pay walls and proprietary licensing, limited to ensure competitive advantage and to leverage income through their sale or licensing (CIPPIC, 2006). Indeed, it is somewhat of a paradox that only a handful of entities are drowning in the data deluge (boyd and Crawford, 2012) and companies such as mobile phone operators, app developers, social media providers, financial institutions, retail chains, and surveillance and security firms are under no obligations to share freely the data they collect through their operations. In some cases, a limited amount of the data might be made available to researchers or the public through Application Programming Interfaces (APIs). For example, Twitter allows a few companies to access its firehose (stream of data) for a fee for commercial purposes (and have the latitude to dictate terms with respect to what can be done with such data), but with a handful of exceptions researchers are restricted to a ‘gardenhose’ (c. 10 percent of public tweets), a ‘spritzer’ (c. one percent of public tweets), or to different subsets of content (‘white-listed’ accounts), with private and protected tweets excluded in all cases (boyd and Crawford, 2012). The worry is that the insights that privately owned and commercially sold big data can provide will be limited to a privileged set of academic researchers whose findings cannot be replicated or validated (Lazer et al., 2009).

Given the relative strengths and limitations of big and small data it is fair to say that small data studies will continue to be an important element of the research landscape, despite the benefits that might accrue from using big data such as social media data. However, it should be noted that small data studies will increasingly come under pressure to utilize the new archiving technologies, being scaled-up within digital data infrastructures in order that they are preserved for future generations, become accessible to re-use and combination with other small and big data, and more value and insight can be extracted from them through the application of big data analytics.

Rob Kitchin

Post advertised: Postdoc on ProgCity project

We are seeking a postdoctoral researcher (14 month contract) to join the Programmable City project.  The researcher will critically examine:

  • the political economy of smart city technologies and initiatives; the creation of smart city markets; the inter-relation of urban (re)development and smart city initiatives; the relationship between vendors, business lobby groups, economic development agencies, and city administrations; financialization and new business models; and/or,
  • the relationship between the political geography of city administration, governance arrangements, and smart city initiatives; political and legal geographies of testbed urbanism and smart city initiatives; smart city technologies and governmentality.

There will be some latitude to negotiate with the principal investigator the exact focus of the research undertaken. While some of the research will require primary fieldwork (Dublin/Boston), it is anticipated it will also involve the secondary analysis of data already generated by the project.

More details on the post and how to apply can be found on the university HR website.  Closing date: 5th December.

Seminar: “Understanding Human Behavior, the Environment and Our Cities Through Measurement & Analysis” by Marguerite Nyhan

We are delighted to have Dr. Marguerite Nyhan as a guest speaker on Tuesday 11th October at 4pm, Iontas Building, room 2.31 for the first of our Programmable City seminars this academic year 2016/17.

Dr. Marguerite Nyhan is a Post-Doctoral Researcher at Harvard University, based in the Department of Environmental Health. Prior to her current appointment, she led the Urban Environmental Research Team at Massachusetts Institute of Technology’s Senseable City Laboratory. Marguerite holds a PhD in Civil & Environmental Engineering from Trinity College Dublin. During her PhD, she was a Fulbright Scholar at MIT. Marguerite has spoken widely about her research including addressing the United Nations Environment Assembly in Kenya, and TEDx Dublin. She has also lectured in the Department of Urban Studies & Planning at MIT.

Marguerite will be talking about modeling and predicting interactions between human populations, urban systems, the natural environment and the built environment.

ProgCity_Seminar_1-2016_MaggieNyhan

Two new postdoctoral posts on ProgCity project

The Programmable City project is seeking two postdoctoral researchers (14 month contracts). Preferably the posts will critically examine either:

• the production of software underpinning smart city technologies and how software developers translate rules, procedures and policies into a complex architecture of interlinked algorithms that manage and govern how people traverse or interact with urban systems; or,

• the political economy of smart city technologies and initiatives; the creation of smart city markets; the inter-relation of urban (re)development and smart city initiatives; the relationship between vendors, business lobby groups, economic development agencies, and city administrations; financialization and new business models; or,

• the relationship between the political geography of city administration, governance arrangements, and smart city initiatives; political and legal geographies of testbed urbanism and smart city initiatives; smart city technologies and governmentality.

We are prepared to consider any other proposal that critical interrogates the relationship between software, data and the production of smart cities and there will be some latitude to negotiate with the principal investigator the exact focus of the research undertaken.

While some of the research will require primary fieldwork, it is anticipated it will also involve the secondary analysis of data already generated by the project.

The project will be based in the National Institute for Regional and Spatial Analysis (NIRSA) at Maynooth University.

More details on how to apply can be found on the University human resources site.  Closing date is 5th August.

Boston fieldwork

Boston and Brookline from the tower in Mount Auburn Cemetery.

Boston from the tower in Mount Auburn Cemetery

From April 2nd to 30th five of the Programmable City team travelled to Boston (or rather as we quickly learned the Metro-Boston area, which is a conglomerate of 101 municipalities) to undertake fieldwork, staying in Cambridge.  Over the course of a busy month the team:

  • conducted 75 interviews/focus groups;
  • had 25 informal meetings;
  • undertook participant observation at 3 civic hacks;
  • were given 4 tours of facilities and 2 of the city;
  • presented 7 invited talks (at MIT (3), Harvard, Northeastern, UMass Boston and Analog Devices);
  • attended 8 other workshops/conferences (Bits and Bricks at MIT; Using Technology to Engage Constituents and Improve Governance at Northeastern; Civic Media meetup at MIT; Urban Mobility in Green Cities at Boston Univ; Microsoft Civic Innovation; Climate Change Policy after Paris at Boston Univ; Digital GeoHumanities at Harvard; City Mart at NY Civic Hall).

The interviews were conducted with a range of different stakeholders including municipal, regional and state-level government officials, various agencies, university researchers, and companies.  The research focused on mapping out the smart city landscape in general terms, with a particular in-depth focus on various data-driven initiatives in the metro area, transportation solutions, civic hacking, the development of civic tech, procurement of smart city technologies, and emergency management response.

Along with the 29 interviews conducted on previous visits, we now have a rich dataset of over 100 interviews to analyse in order to make sense of the Boston Metro area’s use of smart city technologies and to compare with Dublin (for which we have a couple of hundred interviews).  That said, we’ve not quite finished with the fieldwork and a couple of team members will be back at some point to extend their work.  We’ll also be returning for the Association of American Geographers conference which is being held in Boston in 2017 to present some of our findings.

We would like to thank everyone who agreed to take part in our research and for generously sharing their knowledge, insights and time, and also for helping to introduce us to other potential interviewees and generally steer us in the right direction.  We very much appreciate the excellent hospitality we received during our visit.  The next task is to get all the interviews transcribed and to start the coding work.  No small task!

Rob Kitchin

Seminar: “Smartcontracts and smartcities, displacing power through authentication?”

We are delighted to have Dr. Gianluca Miscione as a guest speaker on Wednesday 18th May at 3pm, Iontas Building, room 2.31 for the fourth of our Programmable City seminars this year.

Gianluca Miscione joined the group of Management Information Systems at the School of Business of University College Dublin in June 2012. Previously, he worked as Assistant Professor in Geo-Information and Organization at the Department of Urban and Regional Planning and Geo-Information Management, Faculty of Geo-Information Science and Earth Observation, University of Twente, Netherlands. He received his Ph.D. in Information Systems and Organization from the Sociology Department of the University of Trento, in collaboration with the Sociology Department of Binghamton University New York and the School of International Service of American University in Washington DC. While at the Department of Informatics of the University of Oslo, he broadened his research on information infrastructures on the global scale. Gianluca conducted and contributed to research in Europe, Latin America, India, East Africa, and on the Internet. The focus remained on the interplay between technologies and organizing processes with a specific interest on innovation, development, organizational change and trust.

Gianluca will be talking about organizing processes related to automation of authentication in “smart contracts” exploring what novel forms of ‘sociation’ smart contracts entangle with.

ProgCity_Seminar_2016_1_GM