{"id":420,"date":"2021-12-08T11:07:52","date_gmt":"2021-12-08T11:07:52","guid":{"rendered":"http:\/\/www.reades.com\/?p=420"},"modified":"2026-01-25T17:33:58","modified_gmt":"2026-01-25T17:33:58","slug":"spatial-data-science-a-few-thoughts-for-students","status":"publish","type":"post","link":"http:\/\/www.reades.com\/wp\/?p=420","title":{"rendered":"(Spatial) Data Science: A Few Thoughts for Students"},"content":{"rendered":"\n\t\t\t\t\n<p>I was recently approached by an American Masters student with the following email:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>I am working on a course assignment that requires me to contact a Data Science professional to gain insights into the industry and profession&#8230;  Would you be willing to answer a few questions about your work?<\/p><cite>Student Email<br><\/cite><\/blockquote>\n\n\n\n<p>Here&#8217;s what I had to say about my work&#8230; <\/p>\n\n\n\n<hr class=\"wp-block-separator is-style-wide\"\/>\n\n\n\n<p>Having put some effort into answer the (good) questions that followed, I figured it was worthwhile sharing my responses\u2014not because I think I have <em>the<\/em> answers, but because the <em>differences<\/em> between my and other people&#8217;s answers might be useful for students to reflect upon <em>regardless<\/em> of where they are at in their studies. As always, my advice is worth what you paid for it&#8230;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) What sparked your interest in the field of spatial data science?\u00a0<\/h3>\n\n\n\n<p>I\u2019ve sort of covered this elsewhere online in my bio(s), but it boils down to the fact that when I got into the field large data sets were becoming available to study cities (mobile, transit smart card, etc.) but very few planners or geographers were geared up (then) to work with these types of data. <\/p>\n\n\n\n<p>So I saw an opportunity to study something I was really interested in (cities) using skills that I enjoyed deploying (coding) and which were hard to find on the ground amongst researchers (geographers and planners). Although we called it &#8216;database mining&#8217; back then, and lacked many of the cool algorithms and tools available to use now, the nature of the &#8216;work&#8217; hasn&#8217;t changed all that much. <\/p>\n\n\n\n<p>As well, even in the late 2000s hardware had already progressed to the point where you didn\u2019t need \u00a320,000 computers in order to process the data\u2014I no longer call any of what I do now \u2018big data\u2019 if speaking with a CS person because I can do it all on a high-spec laptop even if it sometimes takes a day to run. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2) In your view, how does the field of data science compare and expand upon traditional academic disciplines such as geography and economics?<\/h3>\n\n\n\n<p>You can find <em>at least<\/em> two quite contrasting views on this:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><a href=\"https:\/\/onlinelibrary.wiley.com\/doi\/10.1111\/gec3.12403\"><em>Geography and computers: Past, present, and future<\/em><\/a> and also <em><a href=\"https:\/\/onlinelibrary.wiley.com\/doi\/full\/10.1111\/gean.12194\">Geographic Data Science<\/a><\/em> \u2014 if you don&#8217;t have access to these as an academic then check the Institutional Repositories such as <a href=\"https:\/\/kclpure.kcl.ac.uk\/portal\/en\/persons\/jonathan-reades(1b1fb620-3cb0-4c5c-a863-24dc9b1b9e95)\/publications.html\">KCL Pure<\/a> as well as <a href=\"https:\/\/scholar.google.co.uk\/citations?user=L_S18IEAAAAJ&amp;hl=en&amp;oi=ao\">Google Scholar<\/a>.<\/li><li><em><a href=\"https:\/\/onlinelibrary.wiley.com\/doi\/full\/10.1111\/gec3.12537\">Why geographic data science is not a science<\/a><\/em> \u2014 obviously not my view, but poses some good questions that I hope to answer one of these days.<\/li><\/ol>\n\n\n\n<p>So I\u2019d probably define Spatial Data Science or Geographic Data Science mostly in terms of creating a readily-identifiable interface between disciplines both for those <em>inside<\/em>\u00a0the discipline and for those\u00a0<em>outside<\/em> of it: in this case it\u2019s principally geography and computer science. You could <em>also<\/em> think of it as spatially-aware approach to data science; one that is informed both by spatial analytical techniques and by geographic theory so as to ensure that geography is a first-class citizen in the analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3) How do you approach a data science research project?<\/h3>\n\n\n\n<p>If you mean, where do the projects originate, then I\u2019d say it\u2019s the same way as I\u2019d approach any other research project, which is to say: depends on the origins of the project. By knowing a fair bit about particular areas of research from reading the academic lit I am always on the lookout <a href=\"https:\/\/doi.org\/10.1177\/0042098018789054\">for data and methods<\/a> (IR version <a href=\"https:\/\/kclpure.kcl.ac.uk\/portal\/files\/97604699\/Author_s_final_version.pdf\">here<\/a>) that could extend our understanding of this area. <\/p>\n\n\n\n<p>For example, I\u2019m interested in housing markets, but here in the UK we lack detailed data on the characteristics of sold properties\u2026 there\u2019s another data set out there that does have this data but wasn\u2019t intended for housing market analysis\u2026 so can we fuse these two data sets in order to better-understand the market? The answer is <a href=\"https:\/\/ucl.scienceopen.com\/hosted-document?doi=10.14324\/111.444\/ucloe.000019\">definitely yes<\/a> (and see also <a href=\"https:\/\/onlinelibrary.wiley.com\/doi\/full\/10.1111\/gean.12287\">this<\/a>), and this data can now inform <a href=\"https:\/\/www.liverpooluniversitypress.co.uk\/journals\/article\/60464\">other research<\/a> (accessible version <a href=\"http:\/\/cp-cloudpublish-public.s3.amazonaws.com\/p6\/5f75dcf52bdcd.pdf\">here<\/a>).<\/p>\n\n\n\n<p>But as I\u2019ve developed a \u2018presence\u2019 in this area <a href=\"https:\/\/journals.sagepub.com\/doi\/10.1177\/0038026120906790\">other researchers<\/a> have begun to approach me in order to tackle questions they\u2019ve not been able to solve but for which they have a sense that a \u2018data science\u2019 approach could help. <\/p>\n\n\n\n<p>So as\u00a0<em>researcher<\/em> (not an industrial analyst or modeller) I look for gaps between what we know and what novel data\/methods make accessible. But I think it\u2019s important that it\u2019s not just \u201cOh, give me your data and I\u2019ll give you an answer\u201d, everything is informed by the fact that I already understand <em>something<\/em> of the problem domain.<\/p>\n\n\n\n<p>If you mean, what kind of tools do I use when developing a project, then it\u2019s very much the data science workflow: Git\/GitHub, Docker, JupyterLab, Python (conda, sklearn, etc.), R (rarely for me, others use it all the time), Postgres\/PostGIS (because of spatial support), Markdown with LaTex\/BibTex, and the classic ETL (Extract, Transform and Learn) pipeline for working with data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4) How do you prepare and gather information relevant to a research project that you are working on?<\/h3>\n\n\n\n<p>I\u2019d divide that into information about the context (literature\u2014often in a folder on Dropbox and shared with collaborators with references going into a .bib file) and information about the techniques (tools\/methods\/data): the amount of preparation and information gathering needed depends on how much of each of these branches is \u2019novel\u2019 (to me). <\/p>\n\n\n\n<p>The former is typically a normal literature review process, just using \u2018modern\u2019 tools such as Markdown and BibTex in place of full-on LaTeX or Word+EndNote to keep track of what I find. The latter is often a mix of Twitter\/Google\/notes in GitHub repos, other people\u2019s notebooks\u2026 whatever seems relevant, really, often tracked via the README file in a repo or in individual notebooks.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5) How important are communication skills in your field?<\/h3>\n\n\n\n<p>I\u2019d say essential, because you are dealing in abstracts (millions or billions of rows of data and difficult issues of interpretation\/subjectivity) that often mean little to &#8216;non-quantitative\u2019 people. Big numbers seem to give confidence, but in most cases what they really call for is more humility and a greater appreciation of the uncertainty that adheres to real-world data and data collection systems. And the &#8216;non-quants&#8217; are often the people who know the most about the problem, so you need to use communications skills to present things clearly and simply, even if they\u2019re not. You could, of course, also abuse those skills to \u2018big up\u2019 and obfuscate so as to avoid scrutiny. <\/p>\n\n\n\n<p>At one extreme of the communications spectrum are the students who have a &#8216;Limitations&#8217; section that is so self-flagellating and raises such monstrous issues of validity that it seem to call into question the whole purpose of the research in the first place. At the other end are those who say (as was said to me by a large company&#8217;s in-house data science team): &#8220;We have all the data, why do we need a model?&#8221; Somewhere in between is the right balance of &#8220;Look, there are some questions here about the data or our approach that should be factored into our next steps, but from the evidence that we have it looks like&#8230;&#8221;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6) What type of communication skills are important and why? Do you have examples of communication breakdowns you can share?<\/h3>\n\n\n\n<p>My\u00a0<em>sense<\/em>\u00a0is that people tend to think that Data Science communications is about data visualisation, but to my mind that\u2019s like saying the only part of your journey that matters is the last 5<em>mi<\/em>. Listening is pretty bloody important: how do you know if you\u2019re working on the right problem to begin with? How do you know if your \u2018answer\u2019 is going to be relevant to the question? <\/p>\n\n\n\n<p>I see a lot of PhD students spend way too much time trying to find the ideal solution when all they need is a workable one: I\u2019ve changed the way that I supervise as a result and we now follow a more \u2018Agile\u2019 approach of regular, small deliverables. This seems to help.<\/p>\n\n\n\n<p>There are near-legendary Agent-Based Models that take a month to run since they are simulating most of a city at the household level: I can&#8217;t help wondering if anyone even <em>uses<\/em> the output from these models in the way it was intended. All models are wrong, but some are useful: useful because they allow us to explore scenarios; useful because they give an answer in a meaningful timeframe; useful because they point to boundary-conditions that need attention.<\/p>\n\n\n\n<p>This is all much clearer in the corporate world, but parachuting in a \u2018data scientist\u2019 who is interested only in the algorithms, not the data (where it came from, how it was produced, etc) or the end-user is basically guaranteeing either failure or reinvention of the wheel. <\/p>\n\n\n\n<p>Regardless of audience, writing skills are important: a choice metaphor, explaining a complex issue in layperson\u2019s terms, selecting the evidence and justifying those selections.<\/p>\n\n\n\n<p>Many years ago I was consulting to a large company: we had a very good relationship with their CRM team but had been so successful that their much larger marketing and IT teams worked together to dismember the smaller CRM group. IT announced that they didn\u2019t see the value of our consultancy work and would bring our custom data integration toolset and platform in-house to save money. <\/p>\n\n\n\n<p>We were many weeks into this migration process (handing over documentation, answering queries\u2026) when one of their analysts blurted out in a meeting \u201cwait, so you\u2019re doing this every single\u00a0<em>night<\/em>?\u201d It was clear they hadn\u2019t remotely understood what we (or their own CRM people) been saying the whole way through. That didn\u2019t stop them proceeding with the project and pretending they\u2019d replicate our functionality internally, but they hadn\u2019t. <\/p>\n\n\n\n<p><em>Lesson<\/em>: politics matters more than anything else in most (large) organisations. We were \u2018right\u2019 and had a better product about which we (thought we) clearly communicated the benefits, but we couldn\u2019t win the argument because it wasn\u2019t really about the product. So I&#8217;d say that there were multiple types of communication failure here but that&#8217;s because it was the unspoken requirements that really mattered. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">7) What technologies, software, and skill sets do you recommend for students looking to enter the field?<\/h3>\n\n\n\n<p>I think it would be near-impossible to enter the field without knowledge of Python\u00a0<em>or<\/em>\u00a0R. Being functional in both is better still since workplaces tend to use one or the other preferentially and bringing in a &#8216;switch hitter&#8217; never hurts. I think academics have a slight preference for R while corporations have a slight preference for Python but both will use whichever language gets the job done. <\/p>\n\n\n\n<p>After that, knowledge of database design and Postgres\/PostGIS is an important bonus\u2014other databases (eg MySQL) have come a long way but PostGIS is still the gold standard for spatial I think and would allow you to support QGIS users or to make prettier \/ more complex outputs via QGIS (I think, a bit unfairly, of QGIS as being like Photoshop for maps). Ability to use Git\/GitHub to manage code, fork and pull, rollback, and test would presumably be seen as very important too. <\/p>\n\n\n\n<p>Obviously a tech-company like Google is looking for something different from an analytics-focussed consulting company: I personally would rather see someone who could effectively use StackOverflow (<em>note<\/em>: I don&#8217;t just mean copy+paste) to get things done than someone who could sketch a b-tree on a whiteboard but has no idea how to actually look at real-world data (which is always messy and requires attention to detail), but that\u2019s because I\u2019m not writing the tools, I\u2019m using them.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">8) Aside from academic resources, what resources do you use (if any) to stay up to date on the software and technical advancements in the field?<\/h3>\n\n\n\n<p>Twitter and Medium (eg. Towards Data Science) are my main sources for \u2018<em>what<\/em> is happening in field X\u2019. I use StackOverflow all the time to update specific coding knowledge (\u2018<em>how<\/em> to do X\u2019). I also use Pocket to collect information on a topic I am planning to research. Medium and Pocket both suggest related articles that support further learning.\u00a0<\/p>\n\n\n\n<p>Networking is also essential: meeting up (face to face or electronically\u2014even via WhatsApp or Twitter DMs) is also a good way to find out what\u2019s on the horizon, talk out ideas, get advice, etc. In my bit of academia it\u2019s very collegial and there\u2019s lots of willingness to share. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">9) What are some of the emerging trends currently taking place in the field of spatial data science?<\/h3>\n\n\n\n<p>Spatially-aware ML seems like a big deal to me. Most ML ignores space as a \u2019special\u2019 category of data even though you can find it embedded in clustering tools like k-Nearest Neighbours, DBSCAN, and so forth. It may not matter that much for many applications (someone tested Random Forests against GWR and didn\u2019t find much difference) but it might matter a lot for some.<\/p>\n\n\n\n<p>I\u2019d say that textual data is also going to matter: NLP hasn\u2019t been much-considered beyond basic sentiment analysis of Twitter. This is starting to change (see: Elizabeth Delmelle\u2019s work on Charlotte and Emmanouil Tranos et al.\u2019 stuff with the Internet Archive) but my guess is that, because NLP is hard and computationally intensive, it\u2019s been neglected.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">10) How do you envision that the field and its associated methods and technologies will evolve moving into the future?<\/h3>\n\n\n\n<p>See above.<\/p>\n\n\n\n<p><\/p>\n\t\t","protected":false},"excerpt":{"rendered":"<p>\t\t\t\tWhat do you say when someone asks: &#8220;I&#8217;m working on a course assignment&#8230; to gain insights into the [data science] industry&#8221;?\t\t<\/p>\n","protected":false},"author":1,"featured_media":431,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,10],"tags":[],"class_list":["post-420","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-coding","category-teaching"],"_links":{"self":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts\/420","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=420"}],"version-history":[{"count":1,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts\/420\/revisions"}],"predecessor-version":[{"id":432,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts\/420\/revisions\/432"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/media\/431"}],"wp:attachment":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=420"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=420"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=420"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}