{"id":266,"date":"2013-08-09T11:14:35","date_gmt":"2013-08-09T11:14:35","guid":{"rendered":"http:\/\/www.reades.com\/?p=266"},"modified":"2013-08-09T11:14:35","modified_gmt":"2013-08-09T11:14:35","slug":"big-data-little-secrets-2","status":"publish","type":"post","link":"http:\/\/www.reades.com\/wp\/?p=266","title":{"rendered":"Big Data&#8217;s Little Secrets (Part 2)"},"content":{"rendered":"<p>\t\t\t\tIn my <a href=\"http:\/\/www.reades.com\/2013\/05\/31\/big-data-little-secret\/\">previous post<\/a> I looked at some of the issues affecting the extent to which \u2018big data\u2019 gives a reliable picture of the world around us. In this post I want to take you through one of the least sexy\u2014but most important\u2014parts: the data itself. My point, again, is not to suggest that big data is fatally flawed, but to call into question some of the easy assumptions upon which we rely when working with this type of data, and the universality of the conclusions that we can draw from this type of research.<\/p>\n<p><!--more--><\/p>\n<p><b>Operational &amp; Analytical Data<\/b><\/p>\n<p>When you really dig into the data sources that are typically used in big data research, the first thing that you need to understand is that you\u2019re not looking at the \u2018real\u2019 output of the system. The extent of the difference between what is actually happening \u2018live\u2019 and what comes through the \u2018cleaning\u2019 process into the hand of a data scientist can be difficult to grasp. Don\u2019t we just take the raw data and run it through our magical analytical algorithm? The simple, truthful answer is \u2018no\u2019.<\/p>\n<p>Based on many years working on ETL systems (Extract, Transform &amp; Load) for a range of clients and research partners, I am frequently surprised that what I see on my monthly billing statement bears <i>any<\/i> resemblance at all to what I think it should be based on my usage. The fact that you get a credit card statement, a mobile phone statement, and even an electricity bill that are even mostly accurate is one of the wonders of modern technology every bit as miraculous as WiFi or air travel.<\/p>\n<p>As I alluded to in the previous post, there are enormous difference between Call Data Records (from the mobile phone billing system) and Handovers (from the network management system). The divergent scales of the two data sources are a good indicator of this but, more subtly, you need to remember that the role of the billing system is to ensure that customers are billed for their calls and texts, and that the role of the network logs is as a record of how well the network is performing. To start asking either for locational data about individuals is to force them to do something <i>other<\/i> than what they were designed for.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/www.tfl.gov.uk\/businessandpartners\/syndication\/default.aspx\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-287\" alt=\"TfL Developers' Area\" src=\"http:\/\/www.reades.com\/wp-content\/uploads\/2013\/08\/Screen-Shot-2013-08-09-at-12.08.25.png\" width=\"520\" height=\"370\" \/><\/a><\/p>\n<p>To see why this matters, let\u2019s take a more straightforward case: tap-ins and tap-outs for London\u2019s Oyster charging system and the capacity they give us to build a picture of Origin\/Destination (O\/D) flows. Here is a short list of the just some of the travel-related events (there are, of course, all sorts of other non-travel events as well) that can show up in the\u00a0<a href=\"http:\/\/www.tfl.gov.uk\/businessandpartners\/syndication\/16493.aspx\">public data feed<\/a>\u00a0generated by the Oyster Card charging system:<\/p>\n<ul>\n<li>You tap into a station only to discover that it\u2019s too crowded and decide to take another route by tapping out of the same station.<\/li>\n<li>You tap out of a station only to discover you\u2019ve made a terrible mistake and that you should have gone two more stops and tap back in to the same station.<\/li>\n<li>Your card is not properly read by the machine so you have to tap it against the reader multiple times.<\/li>\n<li>Your ticket is not properly stored on the card so you have to ask a staff member tap you out.<\/li>\n<li>You accidentally tap someone else out of the system by following too closely behind them. Or you accidentally tap someone else in to the system by following too closely behind them.<\/li>\n<li>You disembark from one bus because it\u2019s too crowded and board a different one on the same route and so appear to board the same bus twice.<\/li>\n<li>You use an intermediate validator when travelling on the Overground even though you are travelling through Zone 1.<\/li>\n<li>You don\u2019t use an intermediate validator when travelling by Overground even though you aren\u2019t travelling through Zone 1.<\/li>\n<li>You use an Out-of-Station Interchange (OSI) such as the one between Euston and Euston Square and so appear to be existing, but your ticket is still valid and there is no charge for entering Euston Square.<\/li>\n<li>You exit at a station where OSIs are allowed but it actually is the end of your journey so the OSI is invalidated automatically by the system the next time you tap in somewhere else.<\/li>\n<li>You use a validator to move between National Rail and the TfL Network. Or vice versa.<\/li>\n<li>You tap out of a mainline station, but began your journey outside the Oyster charging zone.<\/li>\n<li>You tap in at an Oyster-enabled station with your Oyster pass, travel outside the zone using your rail card, returning later to a different Oyster-enabled station.<\/li>\n<li>You exit the Tube at, say, Waterloo using your Oyster, take a train towards Deptford (using a National Rail ticket), and then tap on to a bus many miles away using your Oyster again.<\/li>\n<li>The gates are locked open and so you can\u2019t tap out.<\/li>\n<\/ul>\n<p>The list\u00a0<a href=\"http:\/\/www.oyster-rail.org.uk\/\">goes on<\/a>. Now multiply all of this by the number of different types of tickets (Pay-As-You-Go, Weekly, etc.) and the number of different Zones (adding in Extension Permits and such) where these events can happen, and you begin to have a sense of just how complex this system actually is. Plus, of course, there is the fact that you never have to tap out of a bus or tram.<\/p>\n<p>What you need to understand is that the Oyster system cannot ultimately tell the difference between <i>any<\/i> of these: it is simply trying to figure out if your ticket is valid and if the barriers should open, or remain shut. There are rules that that the gate uses to assess whether a ticket is valid, then the response is logged and off you go. Oyster is a <i>ticketing<\/i> system, not a transit survey system, and so to try to ask an O\/D question is <i>immediately<\/i> to try to make it do something for which it was not configured.<\/p>\n<p><b>Really Understanding the Data<\/b><\/p>\n<p>So to ask a seemingly simple question about origin and destination flows (especially since you can\u2019t track magnetic tickets, which are used by certain types of travellers; see: Sampling Issues) is actually to raise a whole load of subsidiary questions about the extent to which the data is able to answer the question you are asking and about the way in which the data was processed to make it relevant for your research.<\/p>\n<p>Here\u2019s a simple example: if you use an intermediate validator at, for instance, Highbury &amp; Islington when switching between Tube and Overground then this is useful route choice information, but it\u2019s not an origin or a destination as most of us would understand it. But you can\u2019t just say \u201cOK, well we\u2019ll store one intermediate station if it comes up in the data\u201d because: 1) there is no limit to the number of intermediate validations that can come up in a single journey; 2) there\u2019s not one record coming through in the data that gives you the full story; and 3) some people use the validators to exit the station!<\/p>\n<p>The issue is that you are now looking at business rules. Since the definition of a trip adopted by the analysts may not be that of the system itself, there is (inevitably) an imperfect match between the two. Better still, the definition adopted by one analyst may not be the same as that adopted by another: that\u2019s why Transport for London and I have different answers to the question \u201cHow many people changed their behavior during the Olympics?\u201d It\u2019s not that either of us is necessarily wrong (though it\u2019s certainly possible that I am), but that our choice of rules or definitions dramatically impacts our results.<\/p>\n<p>Furthermore, the choices that you made hours, days, or months ago while processing the data are likely to have already determined whether you can answer the questions you want to ask. A truly flexible analytical system requires you to design the right processing framework and put in the time to experiment, test, and (finally) to endlessly re-run parts of your ETL process in order to get something that is \u2018right\u2019. It can require weeks or even months of investment to get to the point where you really \u2018get\u2019 the data!<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/senseable.mit.edu\/signature-of-humanity\/\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-288\" alt=\"Signature of Humanity\" src=\"http:\/\/www.reades.com\/wp-content\/uploads\/2013\/08\/Screen-Shot-2013-08-09-at-12.12.45.png\" width=\"626\" height=\"328\" \/><\/a><\/p>\n<p>Properly \u2018big data\u2019 approaches that rely on NoSQL techniques \u2013 such as Hadoop\/BigTable\/MapReduce \u2013 are explicitly designed to speed up this iterative, exploratory process but they are\u00a0<i>not<\/i>\u00a0going to actually give you an understanding of the underlying data.\u00a0And the risks of\u00a0not\u00a0getting the data are severe: a student can\u00a0<a href=\"http:\/\/www.bbc.co.uk\/news\/magazine-22223190\">catch out top economists<\/a>\u00a0working in Excel with public data, but this same verification process is impossible with closely guarded big data sets to which only a group or two of researchers might have access. And it doesn\u2019t help that the people roped into supplying this data to bolshy researchers are usually the network engineers whose principal interest is in ensuring that the network doesn\u2019t fail catastrophically, not in ensuring that everything is delivered on-spec to a third-party.<\/p>\n<p>That\u2019s party why I have <i>often<\/i> encountered projects where the data specification did not remotely match the data eventually supplied. Sometimes it was a mistake in communication or in understanding, but sometimes it was an issue of which even our clients were unaware: a misconfiguration, or an improperly documented configuration\/option in a licensed system. There\u2019s a reason that companies are able to make money providing \u2018data integration services\u2019, and even then some proportion of <i>all<\/i> data (you hope that it\u2019s less than 15%) is simply discarded as unusable or corrupt.<\/p>\n<p>None of this is \u2018sexy\u2019 big data work, but getting to grips with it is absolutely essential to reaching conclusions that are informed by the strengths and weaknesses of the data. And the point of all of these \u2018little secrets\u2019 is not that big data-led work doesn\u2019t have an enormous role to play in the future of many types of research \u2013 or in the emergence of a &#8216;<a href=\"http:\/\/www.publiclysited.com\/geographys-digital-turn\/\">digital turn<\/a>&#8216; in geography \u2013 but that we need to be much more careful than we have been so far in circumscribing the results and setting out the implications of our findings. If I didn\u2019t believe passionately that big data is a very real augmentation of our capacity to do great research then I wouldn\u2019t have invested so much of my time in finding research collaborators and partners who agree with me.<\/p>\n<p>I feel very strongly that spotting and quantifying \u2018the pattern\u2019 is often just the very first step \u2013 especially when it comes to work dealing with society \u2013 in a\u00a0<a href=\"http:\/\/arxiv.org\/abs\/1301.1674\">much longer and less glamorous process<\/a>\u00a0that rarely gets the attention of journals or newspapers. We have a long tradition of good research design and execution, and it would be a shame to see it thrown aside because working with behavioural data is faster and cheaper. At the end of the day, behavioural data can give you the \u2018what\u2019, but in the case of social science research you still need to get out and speak to people to understand the \u2018why\u2019.\t\t<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my previous post I looked at some of the issues affecting the extent to which \u2018big data\u2019 gives a reliable picture of the world around us. In this post I want to take you through one of the least sexy\u2014but most important\u2014parts: the data itself. My point, again, is not to suggest that big [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":286,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14,4],"tags":[25,36,67,101],"class_list":["post-266","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data","category-fp","tag-big-data-2","tag-data-2","tag-mobile-phones","tag-social-science"],"_links":{"self":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts\/266","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=266"}],"version-history":[{"count":0,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts\/266\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=266"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=266"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=266"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}