Rob Kitchin and Gavin McArdle have published a new Programmable City working paper (no. 21) – Urban data and city dashboards: Six key issues – on SocArXiv today. It is a pre-print of a chapter that will be published in Kitchin, R., Lauriault, T.P. and McArdle, G. (eds) (forthcoming) Data and the City. Routledge, London..
Abstract
This chapter considers the relationship between data and the city by critically examining six key issues with respect city dashboards: epistemology, scope and access, veracity and validity, usability and literacy, use and utility, and ethics. While city dashboards provide useful tools for evaluating and managing urban services, understanding and formulating policy, and creating public knowledge and counter-narratives, our analysis reveals a number of conceptual and practical shortcomings. In order for city dashboards to reach their full potential we advocate a number of related shifts in thinking and praxes and forward an agenda for addressing the issues we highlight. Our analysis is informed by our endeavours in building the Dublin Dashboard.
A couple of weeks ago I attended the Web Summit in Dublin, a large, tech entrepreneur event (my observations on the event are posted here). This week I spent three days at the Smart City Expo World Congress in Barcelona, another event that considered how technology is being used to reshape social and economic life, but which had a very different vibe, a much more mixed constituency of exhibitors and speakers (a mix of tech companies, consultants, city administrations/officials, politicians, NGOs, and academics; over 400 cities sent representatives and 240 companies were present, and there were over 10,000 attendees), and for the most part had a much more tempered discourse. We presented our work on the Dublin Dashboard and the use of indicators in knowing and governing cities, attended the congress (keynote talks, plenary panels, and parallel paper sessions) and toured round the expo (a trade fair made up mostly of company and city stands). I thought it would be useful to share my observations with respect to the event and in particular some of the absences. Continue reading →
A chunk of the Programmable City team attended the Web Summit in Dublin last week. I was fortunate to be asked to MC the Machine Stage for Tuesday afternoon (on smart cities/smart cars), and also presented a paper, participated in a panel discussion, and chaired a private panel session, all on smart cities. As well reported in the media, it was an enormous event attended by 22,000 people, with 600 speakers across nine stages, and hundreds of stands, many of which changed daily to accommodate them all. No doubt a huge amount of business was conducted, personal networks extended, and thousands of pages of copy for newspapers, magazines and websites filed.
To me what was interesting about the event were the silences as much as what was presented and displayed. There were loads of very interesting apps and technologies demoed, many of which will have real world impact. That said, there was also a lot of hype, hubris, hope, self-promotion, buzzwords (to my ear ‘disruption’, ‘smart’, ‘platform’, ‘internet of things’ and ‘use case’ were used a lot), Californian ideology (radical individualism, libertarianism, neoliberal economics, and tech utopianism), and heads in the sand. In contrast, there was an absence of critical reflection about the following three broad concerns. Continue reading →
You can write down equations that predict what people will do. That’s the huge change. So I have been running the big data conversation … It’s about the fact that you can now understand customers, employees, how we organise, in a quantitative, predictive way for the first time.
Predictive analytics is fervently discussed in the business world, if not fully taken up, and increasingly by public services, governments or medical practices to exploit the value hidden in the public archive or even in social media. In New York for example, there is a geek squad to Mayor’s office, seeking to uncover deep and detailed relationships between the people living there and the government, and at the same time realising “how insanely complicated this city is”. In there, an intriguing question remains as to the effectiveness of predictive analytics, the extent to which it can support and facilitate urban life and the consequences to the cities that are immersed in a deep sea of data, predictions and humans.
Let’s start with an Australian example. The Commonwealth Scientific and Industrial Research Organisation (CSIRO) has partnered with Queensland Health, Griffith University and Queensland University of Technology and developed the Patient Admission Prediction Tool (PAPT) to estimate the presentations of schoolies, leavers of Australian high schools going on week-long holidays after final examines, to nearby hospitals. The PAPT derives their estimates from Queensland Health data on schoolies presentations in previous years, including statistics about the numbers of presentations, parts of the body injured and types of these injuries. Using the data, the PAPT benefits hospitals, their employees and patients by improved scheduling of hospital beds, procedures and staff, with the potential of saving $23 million per annum if implemented in hospitals across Australia. As characterised by Dr James Lind, one of the benefits of adapting predictive analytics is the proactive rather than reactive approaches towards planning and management:
People like working in a system that is proactive rather than reactive. When we are expecting a patient load everyone knows what their jobs [are], and you are more efficient with your time.
The patients are happy too, because they receive and finish treatment quickly:
Can we find such success when predictive analytics is practised in various forms of urban governance? Moving the discussion back to US cities again and using policing as an example. Policing work is shifting from reactive to proactive in many cities, in experimental or implementation stages. PredPol is predictive policing software produced by a startup company and has caught considerable amount of attention from various police departments in the US and other parts of the world. Their success as a business, however, is partly to do with by their “energetic” marketing strategies, contractual obligations of referring the startup company to other law enforcement agencies, and so on.
Above all, claims of success shown by the company are difficult to sustain in closer examination. The subjects of the analytics that the software focuses are very specific: burglaries, robberies, vehicle thefts, thefts from vehicles and gun crimes. In other words, the crimes that have “plenty of data to chew on” for making predictions, and are of the opportunistic crimes which are easier to prevent by the presence of the patrolling police (more details here).
This further brings us to the issue of the “proven” and “continued” aspects of success. These are even more difficult and problematic aspects of policing work for the purpose of evaluating and “effectiveness” and “success” of predictive policing. To prove that an algorithm performs well, expectations for which an algorithm is built and tweaked have to be specified, not only for those who build the algorithm, but also for people who will be entangled in the experiments in intended and unintended ways. In this sense, transparency and objectivity related to predictive policing are important. Without acknowledging, considering and evaluating how both the crimes and everyday life, or both normality and abnormality, are transformed into algorithms and disclosing them for validation and consultation, a system of computational criminal justice can turn into, if not witchhunting, alchemy – let’s put two or more elements into a magical pot, stir them and see what happens! This is further complicated by knowing that there are already existing and inherent inequalities in crime data, such as reporting or sentencing, and the perceived neutrality of algorithms can well justify cognitive biases that are well documented in justice system, biases that could justify the rational that someone should be treated more harshly because the person is already on the black list, without reconsidering how the person gets onto such list in the first place. There is an even darker side of predictive policing when mundane social activities are constantly treated as crime data when using social network analysis to profile and incriminate groups and grouping of individuals. This is also a more dynamic and contested field of play considering that while crime prediction practitioners (coders, private companies, government agencies and so on) appropriate personal data and private social media messages for purposes they are not intended for, criminals (or activists for that matter) play with social media, if not yet prediction results obtained by the reverse engineering of algorithms, to plan coups, protests, attacks, etc.
For those who want to look further into how predictive policing is set up, proven, run and evaluated, there are ways of opening up the black box, at least partially, for critically reflecting upon what exactly it could achieve and how the “success” is managed both in computer simulation and in police practices. The Chief scientist of PredPol gave a lecturer where, as pointed out:
He discusses the mathematics/statistics behind the algorithm and, at one point, invites the audience not to take his word for it’s accuracy because he is employed by PredPol, but to take the equations discussed and plug in crime data (e.g. Chicago’s open source crime data) to see if the model has any accuracy.
The video of the lecturer is here
Furthermore, RAND provides a review of predictive policing methods and practices across many US cities. The report can be found here and analyses the advantages gained by various crime prediction methods as well as their limitations. Predictive policing as shown in the report is far from a crystal ball, and has various levels of complexity to run and implement, mathematically, computationally and organisationally. Predictions can range from crime mapping to predicting crime hotspots when given certain spatiotemporal characteristics of crimes (see a taxonomy in p. 19). As far as prediction are concerned, they are good as long as crimes in the future look similar to the ones in the past – their types, temporality and geographic prevalence, if the data is good, which is a big if!. Also, predictions are good when they are further contextualised. Compared with predicting crimes without any help (not even from the intelligence that agents in the field can gather), applying mathematics to help in a guessing game creates a significant advantage, but the differences among these methods are not as dramatic. Therefore, one of the crucial messages intended by reviewing and contextualising predictive methods is that:
It is important to bear in mind that the predictive methods discussed here do not predict where and when the next crime will be committed. Rather, they predict the relative level of risk that a crime will be associated with a particular time and place. The assumption is always that the past is prologue; predictions are made based on the analysis of past data. If the criminal adapts quickly to police interventions, then only data from the recent past will be useful to police departments. (p. 55)
Therefore, however automated, human and organisational efforts are still required in many areas in practice. Activities such as finding relevant data, preparing them for analysis, tweaking factors, variables and parameters, all require human efforts, collaboration as a team and transforming decisions into actions for reducing crimes at organisational levels. Similarly, human and organisational efforts are again needed when types and patterns of crimes are changing, targeted crimes shift, results are to be interpreted and integrated in relation to changing availabilities of resources.
Furthermore, the report reviews the issues of privacy, transparency, trust and civil liberties within existing legal and policy frameworks. However, it becomes apparent that predictions and predictive analytics need careful and mindful designs, responding to emerging ethical, legal and social issues (ELSI) when the impacts of predictive policing occur at individual and organisational levels, affecting the day-to-day life of residents, communities and frontline officers. While it is important to maintain and revisit existing legal requirements and frameworks, it is also important to respond to emerging information and data practices, and notions of “obscurity by design” and “prodecural data due processes” are ways of rethinking and designing relationships between privacy, data, algorithms and predictions. Even the term transparency needs further reflections to make progress on issues concerning what it means under the context of predictive analytics and how it can be achieved by taking into account renewed theoretical, ethical, practical and legal considerations. Under this context, “transparent predictions” is proposed wherein the importance and potential unintended consequences are outlined with regards to rendering prediction processes interpretable to humans and driven by causations rather than correlations. Critical reflections on such a proposal are useful, for example this two part series – (1)(2), further contextualising transparency both in prediction precesses and case-specific situations.
Additionally, IBM has partnered with New York Fire Department and Lisbon Fire Brigade. The main goal is to use predictive analytics to make smart cities safer by using the predictions to better and more effectively allocate emergency response resources. Similarly, crowd behaviours have already been simulated for understanding and predicting how people would react in various crowd events in places such as major traffic and travel terminals, sports and concert venues, shopping malls and streets, busy traffic areas, etc. Simulation tools take into account data generated by sensors, as well as quantified embodied actions, such as walking speeds, body sizes or disabilities, and it is not difficult to imagine that more data and analysis could take advantage of social media data where sentiments and psychological aspects are expected to refine simulation results (a review of simulation tools).
To bring the discussion to a pause, data, algorithms and predictions are quickly turning not only cities but also many other places into testbeds, as long as there are sensors (human and nonhuman) and the Internet. Data will become available and different kinds of tests can then be run to verify ideas and hypotheses. As many articles have pointed out, data and algorithms are flawed, revealling and reinforcing unequal parts and aspects of cities and city lives. Tests and experiments, such as manipulating user emtions by Facebook in their experiments, can make cities vulnerable too, when they are run without regards to embodied and emotionally chared humans. Therefore, there is a great deal more to say about data, algorithms and experiments, because the production of data and experiments of making use of them are always an intervention rather than evaluation. We will be coming back to these topics in subsequent posts.
At the recent Conference of the Association of American Geographers held in Tampa, April 8-12, I was asked to be a discussant on a set of three sessions concerning geographers engagement with big data. The first session was a general intro panel to big data from a geographical perspective, the second panel consisted of a set of a dozen or so short lightening talks (no more than 5 mins each) about each speaker’s on-going research, and the third panel presented some demos of practical approaches researchers are making to harvesting, curating and sharing big geo-data.
Rather than focus my discussion on the individual comments, papers and demos, I reflected more broadly on the presentations, which I felt had been overly focused on one particular kind of big data, namely social media, with a little crowdsourcing thrown in, and had done so from a standpoint that was overly technical or quite narrowly conceived in conceptual terms. My argument was that we need to help develop, along with other social science disciplines, critical data studies (a term borrowed from Craig Dalton and Jim Thatcher) that fully appreciate and uncover the complex assemblages that produce, circulate, share/sell and utilise data in diverse ways and recognize the politics of data and the diverse work that they do in the world. This also requires a critical examination of the ontology of big data and its varieties which extend well beyond social media to include various forms of digital and automated surveillance, techno-social systems of work, exhaust from digital devices, sensors, scanners, the internet of things, interaction and transactional data, sousveillance, and various modes of volunteered data. As well as a thorough consideration of its technical and organisational shortcomings/issues, its associated politics and ethics, and its consequences for the epistemologies, methodologies and practices of academia and various domains of everyday life. I concluded with a call for more synoptic, conceptual and normative analysis of big data, as well as detailed empirical research that examine all aspects of big data assemblages. In other words, I was advocating for a more holistic and critical analysis of big data. Given the speed at which the age of big data is coming into being, such analyses in my view are very much needed to make sense of the changes occuring.
For another reflection on the sessions see Mark Graham’s comments on Zero Geography.
Below is the first draft of a 1000 word entry on Big Data by Rob Kitchin for the forthcoming International Encyclopedia of Geography to be published by Wiley-Blackwell and the Association of American Geographers. It sets out a definition of big data, how they are produced and analyzed, and some of their pros and cons. Feedback on the content would be welcome.
Abstract
Big data consist of huge volumes of diverse, fine-grained, interlocking data produced on a dynamic basis, much of which are spatially and temporally referenced. The generation, processing and storing of such data have been enabled by advances in computing and networking. Insight and value can be extracted from them using new data analytics, including forms of machine learning and visualisation. Proponents contend that big data are reshaping how knowledge is produced, business conducted, and governance enacted. However, there are concerns regarding access to big data, their quality and veracity, and the ethical implications of their use with respect to dataveillance, social sorting, security, and control creep.
Keywords: big data, analytics, visualisation, ethics
Defining big data
The etymology of ‘big data’ can be traced to the mid-1990s, first used to refer to the handling and analysis of massive datasets (Diebold 2012). It is only since 2008, however, that the term has gained traction, becoming a business and industry buzzword. Like many rapidly emerging concepts, big data has been variously defined, but most commentators agree that it differs from what might be termed ‘small data’ with respect to its traits of volume, velocity and variety (Zikopoulos et al., 2012). Traditionally, data have been produced in tightly controlled ways using sampling techniques that limit their scope, temporality and size. Even very large datasets, such as national censuses, have been restricted to generally 30-40 questions and are carried out once every ten years. Advances in computing hardware and software and networking have, however, enabled much wider scope for producing, processing, analyzing and storing massive amounts of diverse data on a continuous basis. Moreover, big data generation strives to be: exhaustive, capturing entire populations or systems (n=all); fine-grained in resolution and uniquely indexical in identification; relational in nature, containing common fields that enable the conjoining of different data sets; and flexible, holding the traits of extensionality (can add new fields easily) and scaleability (can expand in size rapidly) (boyd and Crawford 2012; Kitchin 2013; Marz and Warren 2012; Mayer-Schonberger and Cukier 2013). Big data thus comprises of huge volumes of diverse, fine-grained, interlocking data produced on a dynamic basis. For example, in 2012 Wal-Mart was generating more than 2.5 petabytes (250 bytes) of data relating to more than 1 million customer transactions every hour (Open Data Center Alliance 2012), and Facebook was processing 2.5 billion pieces of content (links, comments, etc), 2.7 billion ‘Like’ actions and 300 million photo uploads per day (Constine 2012). Such big data, its proponents argue, enable new forms of knowledge that produce disruptive innovations with respect to how business is conducted and governance enacted. Given that much big data are georeferenced they hold much promise for new kinds of spatial analysis and modelling.
Sources of big data
Big data are produced in three broad ways: through directed, automated and volunteered systems (Kitchin 2013). Directed systems are controlled by a human operator and include CCTV, spatial video and LiDAR scans. Automated systems automatically capture data as an inherent function of the technology and include: the recording of retail purchases at the point of sale; transactions and interactions across digital networks (e.g., sending emails, internet banking); the use of digital devices such as mobile phones that record and communicate the history of their own utilisation; clickstream data that records navigation through a website or app; measurements from sensors embedded into objects or environments; the scanning of machine-readable objects such as transponders and barcodes; and machine to machine interactions across the internet. Volunteered systems rely on users to gift data through uploads and interactions and include engaging in social media (e.g., posting comments, observations, photos to social networking sites such as Facebook) and the crowdsourcing of data wherein users generate data and then contribute them into a common platform (e.g., uploading GPS-traces into OpenStreetMap).
Analyzing big data
Given their volume, variety and velocity, big data present significant analytical challenges that traditional methods — which have been designed to extract insights from scarce and static data — are not well suited. The solution has been the development of a new suite of data analytics that are rooted in research around artificial intelligence and expert systems, and new forms of data visualisation and visual analytics, both of which rely on high powered computing. Data analytics seek to produce machine learning that iteratively evolves an understanding of datasets using computer algorithms, automatically recognizing complex patterns and constructing models that explain and predict such patterns and optimize outcomes (Han et al. 2011). Moreover, since different approaches have their strengths and weaknesses, depending on the type of problem and data, an ensemble approach can be employed that builds multiple solutions using a variety of techniques to model and predict the same phenomena. As such, it becomes possible to apply hundreds of different algorithms to a dataset to ensure that the most illuminating insights are produced. Given the enormous volumes and velocity of big data, visualisation has proven a popular way for both making sense of data and communicating that sense. Visualisation methods seek to reveal the structure, pattern and trends of variables and their interconnections. Tens of thousands of data points can be plotted to reveal a structure that is otherwise hidden (e.g, mapping trends across millions of tweets to see how they vary across people and places) or the real-time dynamics of a phenomenon can be monitored using graphic and spatial interfaces (e.g., the flow of traffic across a city).
Pros and cons of big data
The hype surrounding big data is for good reason. Big data offers the possibility of shifting from data-scarce to data-rich studies of all aspects of the world; narrow to exhaustive samples; static snapshots to dynamic vistas; coarse aggregations to high resolutions; relatively simple models to complex, sophisticated simulations and predictions (Kitchin 2013). More so, big data consist of both qualitative and quantitative data, most of which are spatially and temporally referenced. Big data provides greater breadth, depth, scale, timeliness and are inherently longitudinal in nature. They enable researchers to gain greater insights into various systems. For businesses and government, such data hold the promise of increased productivity, competitiveness, efficiency, effectiveness, utility, sustainability and securitisation and the potential to better manage organisations, leverage value and produce capital, govern people, and create better places (Kitchin 2014).
Big data are not without negative issues, however. For example, most big data are generated by private corporations such as mobile phone operators, app developers, social media providers, financial institutions, retail chains, and surveillance and security firms, none of whom are under any obligations to share freely the data they generate. As such, access to such data is at present limited. There are also concerns with respect to how clean (error and gap free), objective (bias free) and consistent (few discrepancies) the data are; their veracity and the extent to which they accurately (precision) and faithfully (fidelity, reliability) represent what they are meant to. Further, big data raise a number of ethical questions concerning the extent to which they facilitate dataveillance (surveillance through data records), infringe on privacy and other human rights, enable social sorting (provide differential treatment to services), pose security concerns with regards to identity theft, and enable control creep wherein data generated for one purpose is used for another (Kitchin 2014).
References
boyd, D. and Crawford, K. (2012) Critical questions for big data. Information, Communication and Society 15(5): 662-679
Constine, J. (2012) How Big Is Facebook’s Data? 2.5 Billion Pieces Of Content And 500+ Terabytes Ingested Every Day, 22 August 2012, http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/ (last accessed 28 January 2013)
Diebold, F. (2012) A personal perspective on the origin(s) and development of ‘big data’: The phenomenon, the term, and the discipline. http://www.ssc.upenn.edu/~fdiebold/papers/paper112/Diebold_Big_Data.pdf (last accessed 5th February 2013)
Han, J., Kamber, M. and Pei, J. (2011) Data Mining: Concepts and Techniques. 3rd edition. Morgan Kaufmann, Waltham, MA.
Kitchin, R. (2013) Big data and human geography: Opportunities, challenges and risks. Dialogues in Human Geography 3(3) 262–267
Kitchin, R. (2014) The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. Sage, London.
Marz, N. and Warren, J. (2012) Big Data: Principles and Best Practices of Scalable Realtime Data Systems. MEAP edition. Westhampton, NJ: Manning.
Mayer-Schonberger, V. and Cukier, K. (2013) Big Data: A Revolution that will Change How We Live, Work and Think. John Murray, London.
Open Data Center Alliance (2012) Big Data Consumer Guide. Open Data Center Alliance. http://www.opendatacenteralliance.org/docs/Big_Data_Consumer_Guide_Rev1.0.pdf (last accessed 11 February 2013)
Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T. and Lapis, G. (2012) Understanding Big Data. McGraw Hill, New York.