Rob Kitchin and Gavin McArdle have a new paper – What makes big data, big data? Exploring the ontological characteristics of 26 datasets – published in Big Data and Society.
Abstract
Big Data has been variously defined in the literature. In the main, definitions suggest that Big Data possess a suite of key traits: volume, velocity and variety (the 3Vs), but also exhaustivity, resolution, indexicality, relationality, extensionality and scalability. However, these definitions lack ontological clarity, with the term acting as an amorphous, catch-all label for a wide selection of data. In this paper, we consider the question ‘what makes Big Data, Big Data?’, applying Kitchin’s taxonomy of seven Big Data traits to 26 datasets drawn from seven domains, each of which is considered in the literature to constitute Big Data. The results demonstrate that only a handful of datasets possess all seven traits, and some do not possess either volume and/or variety. Instead, there are multiple forms of Big Data. Our analysis reveals that the key definitional boundary markers are the traits of velocity and exhaustivity. We contend that Big Data as an analytical category needs to be unpacked, with the genus of Big Data further delineated and its various species identified. It is only through such ontological work that we will gain conceptual clarity about what constitutes Big Data, formulate how best to make sense of it, and identify how it might be best used to make sense of the world.
Rob Kitchin and Gavin McArdle have published a new paper entitled ‘The diverse nature of big data‘ available as Programmable City Working Paper 15 on SSRN.
Abstract: Big data has been variously defined in the literature. In the main, definitions suggest that big data are those that possess a suite of key traits: volume, velocity and variety (the 3Vs), but also exhaustivity, resolution, indexicality, relationality, extensionality and scalability. However, these definitions lack ontological clarity, with the term acting as an amorphous, catch-all label for a wide selection of data. In this paper, we consider the question ‘what makes big data, big data?’, applying Kitchin’s (2013, 2014) taxonomy of seven big data traits to 26 datasets drawn from seven domains, each of which is considered in the literature to constitute big data. The results demonstrate that only a handful of datasets possess all seven traits, and some do not possess either volume and/or variety. Instead, there are multiple forms of big data. Our analysis reveals that the key definitional boundary markers are the traits of velocity and exhaustivity. We contend that big data as an analytical category needs to be unpacked, with the genus of big data further delineated and its various species identified. It is only through such ontological work that we will gain conceptual clarity about what constitutes big data, formulate how best to make sense of it, and identify how it might be best used to make sense of the world.
Key words: big data, ontology, taxonomy, types, characteristics
For the past few years I’ve co-taught a professional development course for doctoral students on completing a thesis, getting a job, and publishing. The course draws liberally on a book I co-wrote with the late Duncan Fuller entitled, The Academics’ Guide to Publishing. One thing we did not really cover in the book was how to write and place pieces that have impact, rather providing more general advice about getting through the peer review process.
The general careers advice mantra of academia is now ‘publish or perish’. Often what is published and its utility and value can be somewhat overlooked — if a piece got published it is assumed it must have some inherent value. And yet a common observation is that most journal articles seem to be barely read, let alone cited.
Both authors and editors want to publish material that is both read and cited, so what is required to produce work that editors are delighted to accept and readers find so useful that they want to cite in their own work?
A taxonomy of publishing impact
The way I try and explain impact to early career scholars is through a discussion of writing and publishing a paper on airport security (see Figure 1). Written pieces of work, I argue, generally fall into one of four categories, with the impact of the piece rising as one traverses from Level 1 to Level 4.
Figure 1: Levels of research impact
Level 1: the piece is basically empiricist in nature and makes little use of theory. For example, I could write an article that provides a very detailed description of security in an airport and how it works in practice. This might be interesting, but would add little to established knowledge about how airport security works or how to make sense of it. Generally, such papers appear in trade magazines or national level journals and are rarely cited.
Level 2: the paper uses established theory to make sense of a phenomena. For example, I could use Foucault’s theories of disciplining, surveillance and biopolitics to explain how airport security works to create docile bodies that passively submit to enhanced screening measures. Here, I am applying a theoretical frame that might provide a fresh perspective on a phenomena if it has not been previously applied. I am not, however, providing new theoretical or methodological tools but drawing on established ones. As a consequence, the piece has limited utility, essentially constrained to those interested in airport security, and might be accepted in a low-ranking international journal.
Level 3: the paper extends/reworks established theory to make sense of phenomena. For example, I might argue that since the 1970s when Foucault was formulating his ideas there has been a radical shift in the technologies of surveillance from disciplining systems to capture systems that actively reshape behaviour. As such, Foucault’s ideas of governance need to be reworked or extended to adequately account for new algorithmic forms of regulating passengers and workers. My article could provide such a reworking, building on Foucault’s initial ideas to provide new theoretical tools that others can apply to their own case material. Such a piece will get accepted into high-ranking international journals due to its wider utility.
Level 4: uses the study of a phenomena to rethink a meta-concept or proposes a radically reworked or new theory. Here, the focus of attention shifts from how best to make sense of airport security to the meta-concept of governance, using the empirical case material to argue that it is not simply enough to extend Foucault’s thinking, rather a new way of thinking is required to adequately conceptualize how governance is being operationalised. Such new thinking tends to be well cited because it can generally be applied to making sense of lots of phenomena, such as the governance of schools, hospitals, workplaces, etc. Of course, Foucault consistently operated at this level, which is why he is so often reworked at Levels 2 and 3, and is one of the most impactful academics of his generation (cited nearly 42,000 time in 2013 alone). Writing a Level 4 piece requires a huge amount of skill, knowledge and insight, which is why so few academics work and publish at this level. Such pieces will be accepted into the very top ranked journals.
One way to think about this taxonomy is this: generally, those people who are the biggest names in their discipline, or across disciplines, have a solid body of published Level 3 and Level 4 material — this is why they are so well known; they produce material and ideas that have high transfer utility. Those people who well known within a sub-discipline generally have a body of Level 2 and Level 3 material. Those who are barely known outside of their national context generally have Level 1/2 profiles (and also have relatively small bodies of published work).
In my opinion, the majority of papers being published in international journals are Level 2/borderline 3 with some minor extension/reworking that has limited utility beyond making sense of a specific phenomena, or Level 3/borderline 2 with narrow, timid or uninspiring extension/reworking that constrains the paper’s broader appeal. Strong, bold Level 3 papers that have wider utility beyond the paper’s focus are less common, and Level 4 papers that really push the boundaries of thought and praxis are relatively rare. The majority of articles in national level journals tend to be Level 2; and the majority of book chapters in edited collections are Level 1 or 2. It is not uncommon, in my experience, for authors to think the paper that they have written is category above its real level (which is why they are often so disappointed with editor and referee reviews).
Does this basic taxonomy of impact work in practice?
I’ve not done a detailed empirical study, but can draw on two sources of observations. First, my experience as an editor two international journals (Progress in Human Geography, Dialogues in Human Geography), and for ten years an editor of another (Social and Cultural Geography), and viewing download rates and citational analysis for papers published in those journals. It is clear from such data that the relationship between level and citation generally holds — those papers that push boundaries and provide new thinking tend to be better cited. There are, of course, some exceptions and there are no doubt some Level 4 papers that are quite lowly cited for various reasons (e.g,, their arguments are ahead of their time), but generally the cream rises. Most academics intuitively know this, which is why the most consistent response of referees and editors to submitted papers is to provide feedback that might help shift Level 2/borderline Level 3 papers (which are the most common kind of submission) up to solid Level 3 papers – pieces that provide new ways of thinking and doing and provide fresh perspectives and insights.
Second, by examining my own body of published work. Figure 2 displays the citation rates of all of my published material (books, papers, book chapters) divided into the four levels. There are some temporal effects (such as more recently published work not having had time to be cited) and some outliers (in this case, a textbook and a coffee table book) but the relationship is quite clear, especially when just articles are examined (Figure 3) — the rate of citation increases across levels. (I’ve been fairly brutally honest in categorising my pieces and what’s striking to me personally is proportionally how few Level 3 and 4 pieces I’ve published, which is something for me to work on).
So what does this all mean?
Basically, if you want your work to have impact you should try to write articles that meet Level 3 and 4 criteria — that is, produce novel material that provides new ideas, tools, methods that others can apply in their own studies. Creating such pieces is not easy or straightforward and demands a lot of reading, reflection and thinking, which is why it can be so difficult to get papers accepted into the top journals and the citation distribution curve is so heavily skewed, with a relatively small number of pieces having nearly all the citations (Figure 4 shows the skewing for my papers; my top cited piece has the same number of cites as the 119 pieces with the least number).
Figure 4: Skewed distribution of citations
In my experience, papers with zero citations are nearly all Level 1 and 2 pieces. That’s not the only kind of papers you should be striving to publish if you want some impact from your work.