Below is the first draft of a 1000 word entry on Big Data by Rob Kitchin for the forthcoming International Encyclopedia of Geography to be published by Wiley-Blackwell and the Association of American Geographers. It sets out a definition of big data, how they are produced and analyzed, and some of their pros and cons. Feedback on the content would be welcome.
Big data consist of huge volumes of diverse, fine-grained, interlocking data produced on a dynamic basis, much of which are spatially and temporally referenced. The generation, processing and storing of such data have been enabled by advances in computing and networking. Insight and value can be extracted from them using new data analytics, including forms of machine learning and visualisation. Proponents contend that big data are reshaping how knowledge is produced, business conducted, and governance enacted. However, there are concerns regarding access to big data, their quality and veracity, and the ethical implications of their use with respect to dataveillance, social sorting, security, and control creep.
Keywords: big data, analytics, visualisation, ethics
Defining big data
The etymology of ‘big data’ can be traced to the mid-1990s, first used to refer to the handling and analysis of massive datasets (Diebold 2012). It is only since 2008, however, that the term has gained traction, becoming a business and industry buzzword. Like many rapidly emerging concepts, big data has been variously defined, but most commentators agree that it differs from what might be termed ‘small data’ with respect to its traits of volume, velocity and variety (Zikopoulos et al., 2012). Traditionally, data have been produced in tightly controlled ways using sampling techniques that limit their scope, temporality and size. Even very large datasets, such as national censuses, have been restricted to generally 30-40 questions and are carried out once every ten years. Advances in computing hardware and software and networking have, however, enabled much wider scope for producing, processing, analyzing and storing massive amounts of diverse data on a continuous basis. Moreover, big data generation strives to be: exhaustive, capturing entire populations or systems (n=all); fine-grained in resolution and uniquely indexical in identification; relational in nature, containing common fields that enable the conjoining of different data sets; and flexible, holding the traits of extensionality (can add new fields easily) and scaleability (can expand in size rapidly) (boyd and Crawford 2012; Kitchin 2013; Marz and Warren 2012; Mayer-Schonberger and Cukier 2013). Big data thus comprises of huge volumes of diverse, fine-grained, interlocking data produced on a dynamic basis. For example, in 2012 Wal-Mart was generating more than 2.5 petabytes (250 bytes) of data relating to more than 1 million customer transactions every hour (Open Data Center Alliance 2012), and Facebook was processing 2.5 billion pieces of content (links, comments, etc), 2.7 billion ‘Like’ actions and 300 million photo uploads per day (Constine 2012). Such big data, its proponents argue, enable new forms of knowledge that produce disruptive innovations with respect to how business is conducted and governance enacted. Given that much big data are georeferenced they hold much promise for new kinds of spatial analysis and modelling.
Sources of big data
Big data are produced in three broad ways: through directed, automated and volunteered systems (Kitchin 2013). Directed systems are controlled by a human operator and include CCTV, spatial video and LiDAR scans. Automated systems automatically capture data as an inherent function of the technology and include: the recording of retail purchases at the point of sale; transactions and interactions across digital networks (e.g., sending emails, internet banking); the use of digital devices such as mobile phones that record and communicate the history of their own utilisation; clickstream data that records navigation through a website or app; measurements from sensors embedded into objects or environments; the scanning of machine-readable objects such as transponders and barcodes; and machine to machine interactions across the internet. Volunteered systems rely on users to gift data through uploads and interactions and include engaging in social media (e.g., posting comments, observations, photos to social networking sites such as Facebook) and the crowdsourcing of data wherein users generate data and then contribute them into a common platform (e.g., uploading GPS-traces into OpenStreetMap).
Analyzing big data
Given their volume, variety and velocity, big data present significant analytical challenges that traditional methods — which have been designed to extract insights from scarce and static data — are not well suited. The solution has been the development of a new suite of data analytics that are rooted in research around artificial intelligence and expert systems, and new forms of data visualisation and visual analytics, both of which rely on high powered computing. Data analytics seek to produce machine learning that iteratively evolves an understanding of datasets using computer algorithms, automatically recognizing complex patterns and constructing models that explain and predict such patterns and optimize outcomes (Han et al. 2011). Moreover, since different approaches have their strengths and weaknesses, depending on the type of problem and data, an ensemble approach can be employed that builds multiple solutions using a variety of techniques to model and predict the same phenomena. As such, it becomes possible to apply hundreds of different algorithms to a dataset to ensure that the most illuminating insights are produced. Given the enormous volumes and velocity of big data, visualisation has proven a popular way for both making sense of data and communicating that sense. Visualisation methods seek to reveal the structure, pattern and trends of variables and their interconnections. Tens of thousands of data points can be plotted to reveal a structure that is otherwise hidden (e.g, mapping trends across millions of tweets to see how they vary across people and places) or the real-time dynamics of a phenomenon can be monitored using graphic and spatial interfaces (e.g., the flow of traffic across a city).
Pros and cons of big data
The hype surrounding big data is for good reason. Big data offers the possibility of shifting from data-scarce to data-rich studies of all aspects of the world; narrow to exhaustive samples; static snapshots to dynamic vistas; coarse aggregations to high resolutions; relatively simple models to complex, sophisticated simulations and predictions (Kitchin 2013). More so, big data consist of both qualitative and quantitative data, most of which are spatially and temporally referenced. Big data provides greater breadth, depth, scale, timeliness and are inherently longitudinal in nature. They enable researchers to gain greater insights into various systems. For businesses and government, such data hold the promise of increased productivity, competitiveness, efficiency, effectiveness, utility, sustainability and securitisation and the potential to better manage organisations, leverage value and produce capital, govern people, and create better places (Kitchin 2014).
Big data are not without negative issues, however. For example, most big data are generated by private corporations such as mobile phone operators, app developers, social media providers, financial institutions, retail chains, and surveillance and security firms, none of whom are under any obligations to share freely the data they generate. As such, access to such data is at present limited. There are also concerns with respect to how clean (error and gap free), objective (bias free) and consistent (few discrepancies) the data are; their veracity and the extent to which they accurately (precision) and faithfully (fidelity, reliability) represent what they are meant to. Further, big data raise a number of ethical questions concerning the extent to which they facilitate dataveillance (surveillance through data records), infringe on privacy and other human rights, enable social sorting (provide differential treatment to services), pose security concerns with regards to identity theft, and enable control creep wherein data generated for one purpose is used for another (Kitchin 2014).
boyd, D. and Crawford, K. (2012) Critical questions for big data. Information, Communication and Society 15(5): 662-679
Constine, J. (2012) How Big Is Facebook’s Data? 2.5 Billion Pieces Of Content And 500+ Terabytes Ingested Every Day, 22 August 2012, http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/ (last accessed 28 January 2013)
Diebold, F. (2012) A personal perspective on the origin(s) and development of ‘big data’: The phenomenon, the term, and the discipline. http://www.ssc.upenn.edu/~fdiebold/papers/paper112/Diebold_Big_Data.pdf (last accessed 5th February 2013)
Han, J., Kamber, M. and Pei, J. (2011) Data Mining: Concepts and Techniques. 3rd edition. Morgan Kaufmann, Waltham, MA.
Kitchin, R. (2013) Big data and human geography: Opportunities, challenges and risks. Dialogues in Human Geography 3(3) 262–267
Kitchin, R. (2014) The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. Sage, London.
Marz, N. and Warren, J. (2012) Big Data: Principles and Best Practices of Scalable Realtime Data Systems. MEAP edition. Westhampton, NJ: Manning.
Mayer-Schonberger, V. and Cukier, K. (2013) Big Data: A Revolution that will Change How We Live, Work and Think. John Murray, London.
Open Data Center Alliance (2012) Big Data Consumer Guide. Open Data Center Alliance. http://www.opendatacenteralliance.org/docs/Big_Data_Consumer_Guide_Rev1.0.pdf (last accessed 11 February 2013)
Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T. and Lapis, G. (2012) Understanding Big Data. McGraw Hill, New York.