Big Data’s Little Secrets (Part 2)

In my previous post I looked at some of the issues affecting the extent to which ‘big data’ gives a reliable picture of the world around us. In this post I want to take you through one of the least sexy—but most important—parts: the data itself. My point, again, is not to suggest that big data is fatally flawed, but to call into question some of the easy assumptions upon which we rely when working with this type of data, and the universality of the conclusions that we can draw from this type of research.

Operational & Analytical Data

When you really dig into the data sources that are typically used in big data research, the first thing that you need to understand is that you’re not looking at the ‘real’ output of the system. The extent of the difference between what is actually happening ‘live’ and what comes through the ‘cleaning’ process into the hand of a data scientist can be difficult to grasp. Don’t we just take the raw data and run it through our magical analytical algorithm? The simple, truthful answer is ‘no’.

Based on many years working on ETL systems (Extract, Transform & Load) for a range of clients and research partners, I am frequently surprised that what I see on my monthly billing statement bears any resemblance at all to what I think it should be based on my usage. The fact that you get a credit card statement, a mobile phone statement, and even an electricity bill that are even mostly accurate is one of the wonders of modern technology every bit as miraculous as WiFi or air travel.

As I alluded to in the previous post, there are enormous difference between Call Data Records (from the mobile phone billing system) and Handovers (from the network management system). The divergent scales of the two data sources are a good indicator of this but, more subtly, you need to remember that the role of the billing system is to ensure that customers are billed for their calls and texts, and that the role of the network logs is as a record of how well the network is performing. To start asking either for locational data about individuals is to force them to do something other than what they were designed for.

To see why this matters, let’s take a more straightforward case: tap-ins and tap-outs for London’s Oyster charging system and the capacity they give us to build a picture of Origin/Destination (O/D) flows. Here is a short list of the just some of the travel-related events (there are, of course, all sorts of other non-travel events as well) that can show up in the public data feed generated by the Oyster Card charging system:

You tap into a station only to discover that it’s too crowded and decide to take another route by tapping out of the same station.
You tap out of a station only to discover you’ve made a terrible mistake and that you should have gone two more stops and tap back in to the same station.
Your card is not properly read by the machine so you have to tap it against the reader multiple times.
Your ticket is not properly stored on the card so you have to ask a staff member tap you out.
You accidentally tap someone else out of the system by following too closely behind them. Or you accidentally tap someone else in to the system by following too closely behind them.
You disembark from one bus because it’s too crowded and board a different one on the same route and so appear to board the same bus twice.
You use an intermediate validator when travelling on the Overground even though you are travelling through Zone 1.
You don’t use an intermediate validator when travelling by Overground even though you aren’t travelling through Zone 1.
You use an Out-of-Station Interchange (OSI) such as the one between Euston and Euston Square and so appear to be existing, but your ticket is still valid and there is no charge for entering Euston Square.
You exit at a station where OSIs are allowed but it actually is the end of your journey so the OSI is invalidated automatically by the system the next time you tap in somewhere else.
You use a validator to move between National Rail and the TfL Network. Or vice versa.
You tap out of a mainline station, but began your journey outside the Oyster charging zone.
You tap in at an Oyster-enabled station with your Oyster pass, travel outside the zone using your rail card, returning later to a different Oyster-enabled station.
You exit the Tube at, say, Waterloo using your Oyster, take a train towards Deptford (using a National Rail ticket), and then tap on to a bus many miles away using your Oyster again.
The gates are locked open and so you can’t tap out.

The list goes on. Now multiply all of this by the number of different types of tickets (Pay-As-You-Go, Weekly, etc.) and the number of different Zones (adding in Extension Permits and such) where these events can happen, and you begin to have a sense of just how complex this system actually is. Plus, of course, there is the fact that you never have to tap out of a bus or tram.

What you need to understand is that the Oyster system cannot ultimately tell the difference between any of these: it is simply trying to figure out if your ticket is valid and if the barriers should open, or remain shut. There are rules that that the gate uses to assess whether a ticket is valid, then the response is logged and off you go. Oyster is a ticketing system, not a transit survey system, and so to try to ask an O/D question is immediately to try to make it do something for which it was not configured.

Really Understanding the Data

So to ask a seemingly simple question about origin and destination flows (especially since you can’t track magnetic tickets, which are used by certain types of travellers; see: Sampling Issues) is actually to raise a whole load of subsidiary questions about the extent to which the data is able to answer the question you are asking and about the way in which the data was processed to make it relevant for your research.

Here’s a simple example: if you use an intermediate validator at, for instance, Highbury & Islington when switching between Tube and Overground then this is useful route choice information, but it’s not an origin or a destination as most of us would understand it. But you can’t just say “OK, well we’ll store one intermediate station if it comes up in the data” because: 1) there is no limit to the number of intermediate validations that can come up in a single journey; 2) there’s not one record coming through in the data that gives you the full story; and 3) some people use the validators to exit the station!

The issue is that you are now looking at business rules. Since the definition of a trip adopted by the analysts may not be that of the system itself, there is (inevitably) an imperfect match between the two. Better still, the definition adopted by one analyst may not be the same as that adopted by another: that’s why Transport for London and I have different answers to the question “How many people changed their behavior during the Olympics?” It’s not that either of us is necessarily wrong (though it’s certainly possible that I am), but that our choice of rules or definitions dramatically impacts our results.

Furthermore, the choices that you made hours, days, or months ago while processing the data are likely to have already determined whether you can answer the questions you want to ask. A truly flexible analytical system requires you to design the right processing framework and put in the time to experiment, test, and (finally) to endlessly re-run parts of your ETL process in order to get something that is ‘right’. It can require weeks or even months of investment to get to the point where you really ‘get’ the data!

Properly ‘big data’ approaches that rely on NoSQL techniques – such as Hadoop/BigTable/MapReduce – are explicitly designed to speed up this iterative, exploratory process but they are not going to actually give you an understanding of the underlying data. And the risks of not getting the data are severe: a student can catch out top economists working in Excel with public data, but this same verification process is impossible with closely guarded big data sets to which only a group or two of researchers might have access. And it doesn’t help that the people roped into supplying this data to bolshy researchers are usually the network engineers whose principal interest is in ensuring that the network doesn’t fail catastrophically, not in ensuring that everything is delivered on-spec to a third-party.

That’s party why I have often encountered projects where the data specification did not remotely match the data eventually supplied. Sometimes it was a mistake in communication or in understanding, but sometimes it was an issue of which even our clients were unaware: a misconfiguration, or an improperly documented configuration/option in a licensed system. There’s a reason that companies are able to make money providing ‘data integration services’, and even then some proportion of all data (you hope that it’s less than 15%) is simply discarded as unusable or corrupt.

None of this is ‘sexy’ big data work, but getting to grips with it is absolutely essential to reaching conclusions that are informed by the strengths and weaknesses of the data. And the point of all of these ‘little secrets’ is not that big data-led work doesn’t have an enormous role to play in the future of many types of research – or in the emergence of a ‘digital turn‘ in geography – but that we need to be much more careful than we have been so far in circumscribing the results and setting out the implications of our findings. If I didn’t believe passionately that big data is a very real augmentation of our capacity to do great research then I wouldn’t have invested so much of my time in finding research collaborators and partners who agree with me.

I feel very strongly that spotting and quantifying ‘the pattern’ is often just the very first step – especially when it comes to work dealing with society – in a much longer and less glamorous process that rarely gets the attention of journals or newspapers. We have a long tradition of good research design and execution, and it would be a shame to see it thrown aside because working with behavioural data is faster and cheaper. At the end of the day, behavioural data can give you the ‘what’, but in the case of social science research you still need to get out and speak to people to understand the ‘why’.