(Spatial) Data Science: A Few Thoughts for Students

I was recently approached by an American Masters student with the following email:

I am working on a course assignment that requires me to contact a Data Science professional to gain insights into the industry and profession… Would you be willing to answer a few questions about your work?
Student Email

Here’s what I had to say about my work…

Having put some effort into answer the (good) questions that followed, I figured it was worthwhile sharing my responses—not because I think I have the answers, but because the differences between my and other people’s answers might be useful for students to reflect upon regardless of where they are at in their studies. As always, my advice is worth what you paid for it…

1) What sparked your interest in the field of spatial data science?

I’ve sort of covered this elsewhere online in my bio(s), but it boils down to the fact that when I got into the field large data sets were becoming available to study cities (mobile, transit smart card, etc.) but very few planners or geographers were geared up (then) to work with these types of data.

So I saw an opportunity to study something I was really interested in (cities) using skills that I enjoyed deploying (coding) and which were hard to find on the ground amongst researchers (geographers and planners). Although we called it ‘database mining’ back then, and lacked many of the cool algorithms and tools available to use now, the nature of the ‘work’ hasn’t changed all that much.

As well, even in the late 2000s hardware had already progressed to the point where you didn’t need £20,000 computers in order to process the data—I no longer call any of what I do now ‘big data’ if speaking with a CS person because I can do it all on a high-spec laptop even if it sometimes takes a day to run.

2) In your view, how does the field of data science compare and expand upon traditional academic disciplines such as geography and economics?

You can find at least two quite contrasting views on this:

Geography and computers: Past, present, and future and also Geographic Data Science — if you don’t have access to these as an academic then check the Institutional Repositories such as KCL Pure as well as Google Scholar.
Why geographic data science is not a science — obviously not my view, but poses some good questions that I hope to answer one of these days.

So I’d probably define Spatial Data Science or Geographic Data Science mostly in terms of creating a readily-identifiable interface between disciplines both for those inside the discipline and for those outside of it: in this case it’s principally geography and computer science. You could also think of it as spatially-aware approach to data science; one that is informed both by spatial analytical techniques and by geographic theory so as to ensure that geography is a first-class citizen in the analysis.

3) How do you approach a data science research project?

If you mean, where do the projects originate, then I’d say it’s the same way as I’d approach any other research project, which is to say: depends on the origins of the project. By knowing a fair bit about particular areas of research from reading the academic lit I am always on the lookout for data and methods (IR version here) that could extend our understanding of this area.

For example, I’m interested in housing markets, but here in the UK we lack detailed data on the characteristics of sold properties… there’s another data set out there that does have this data but wasn’t intended for housing market analysis… so can we fuse these two data sets in order to better-understand the market? The answer is definitely yes (and see also this), and this data can now inform other research (accessible version here).

But as I’ve developed a ‘presence’ in this area other researchers have begun to approach me in order to tackle questions they’ve not been able to solve but for which they have a sense that a ‘data science’ approach could help.

So as researcher (not an industrial analyst or modeller) I look for gaps between what we know and what novel data/methods make accessible. But I think it’s important that it’s not just “Oh, give me your data and I’ll give you an answer”, everything is informed by the fact that I already understand something of the problem domain.

If you mean, what kind of tools do I use when developing a project, then it’s very much the data science workflow: Git/GitHub, Docker, JupyterLab, Python (conda, sklearn, etc.), R (rarely for me, others use it all the time), Postgres/PostGIS (because of spatial support), Markdown with LaTex/BibTex, and the classic ETL (Extract, Transform and Learn) pipeline for working with data.

4) How do you prepare and gather information relevant to a research project that you are working on?

I’d divide that into information about the context (literature—often in a folder on Dropbox and shared with collaborators with references going into a .bib file) and information about the techniques (tools/methods/data): the amount of preparation and information gathering needed depends on how much of each of these branches is ’novel’ (to me).

The former is typically a normal literature review process, just using ‘modern’ tools such as Markdown and BibTex in place of full-on LaTeX or Word+EndNote to keep track of what I find. The latter is often a mix of Twitter/Google/notes in GitHub repos, other people’s notebooks… whatever seems relevant, really, often tracked via the README file in a repo or in individual notebooks.

5) How important are communication skills in your field?

I’d say essential, because you are dealing in abstracts (millions or billions of rows of data and difficult issues of interpretation/subjectivity) that often mean little to ‘non-quantitative’ people. Big numbers seem to give confidence, but in most cases what they really call for is more humility and a greater appreciation of the uncertainty that adheres to real-world data and data collection systems. And the ‘non-quants’ are often the people who know the most about the problem, so you need to use communications skills to present things clearly and simply, even if they’re not. You could, of course, also abuse those skills to ‘big up’ and obfuscate so as to avoid scrutiny.

At one extreme of the communications spectrum are the students who have a ‘Limitations’ section that is so self-flagellating and raises such monstrous issues of validity that it seem to call into question the whole purpose of the research in the first place. At the other end are those who say (as was said to me by a large company’s in-house data science team): “We have all the data, why do we need a model?” Somewhere in between is the right balance of “Look, there are some questions here about the data or our approach that should be factored into our next steps, but from the evidence that we have it looks like…”

6) What type of communication skills are important and why? Do you have examples of communication breakdowns you can share?

My sense is that people tend to think that Data Science communications is about data visualisation, but to my mind that’s like saying the only part of your journey that matters is the last 5mi. Listening is pretty bloody important: how do you know if you’re working on the right problem to begin with? How do you know if your ‘answer’ is going to be relevant to the question?

I see a lot of PhD students spend way too much time trying to find the ideal solution when all they need is a workable one: I’ve changed the way that I supervise as a result and we now follow a more ‘Agile’ approach of regular, small deliverables. This seems to help.

There are near-legendary Agent-Based Models that take a month to run since they are simulating most of a city at the household level: I can’t help wondering if anyone even uses the output from these models in the way it was intended. All models are wrong, but some are useful: useful because they allow us to explore scenarios; useful because they give an answer in a meaningful timeframe; useful because they point to boundary-conditions that need attention.

This is all much clearer in the corporate world, but parachuting in a ‘data scientist’ who is interested only in the algorithms, not the data (where it came from, how it was produced, etc) or the end-user is basically guaranteeing either failure or reinvention of the wheel.

Regardless of audience, writing skills are important: a choice metaphor, explaining a complex issue in layperson’s terms, selecting the evidence and justifying those selections.

Many years ago I was consulting to a large company: we had a very good relationship with their CRM team but had been so successful that their much larger marketing and IT teams worked together to dismember the smaller CRM group. IT announced that they didn’t see the value of our consultancy work and would bring our custom data integration toolset and platform in-house to save money.

We were many weeks into this migration process (handing over documentation, answering queries…) when one of their analysts blurted out in a meeting “wait, so you’re doing this every single night?” It was clear they hadn’t remotely understood what we (or their own CRM people) been saying the whole way through. That didn’t stop them proceeding with the project and pretending they’d replicate our functionality internally, but they hadn’t.

Lesson: politics matters more than anything else in most (large) organisations. We were ‘right’ and had a better product about which we (thought we) clearly communicated the benefits, but we couldn’t win the argument because it wasn’t really about the product. So I’d say that there were multiple types of communication failure here but that’s because it was the unspoken requirements that really mattered.

7) What technologies, software, and skill sets do you recommend for students looking to enter the field?

I think it would be near-impossible to enter the field without knowledge of Python or R. Being functional in both is better still since workplaces tend to use one or the other preferentially and bringing in a ‘switch hitter’ never hurts. I think academics have a slight preference for R while corporations have a slight preference for Python but both will use whichever language gets the job done.

After that, knowledge of database design and Postgres/PostGIS is an important bonus—other databases (eg MySQL) have come a long way but PostGIS is still the gold standard for spatial I think and would allow you to support QGIS users or to make prettier / more complex outputs via QGIS (I think, a bit unfairly, of QGIS as being like Photoshop for maps). Ability to use Git/GitHub to manage code, fork and pull, rollback, and test would presumably be seen as very important too.

Obviously a tech-company like Google is looking for something different from an analytics-focussed consulting company: I personally would rather see someone who could effectively use StackOverflow (note: I don’t just mean copy+paste) to get things done than someone who could sketch a b-tree on a whiteboard but has no idea how to actually look at real-world data (which is always messy and requires attention to detail), but that’s because I’m not writing the tools, I’m using them.

8) Aside from academic resources, what resources do you use (if any) to stay up to date on the software and technical advancements in the field?

Twitter and Medium (eg. Towards Data Science) are my main sources for ‘what is happening in field X’. I use StackOverflow all the time to update specific coding knowledge (‘how to do X’). I also use Pocket to collect information on a topic I am planning to research. Medium and Pocket both suggest related articles that support further learning.

Networking is also essential: meeting up (face to face or electronically—even via WhatsApp or Twitter DMs) is also a good way to find out what’s on the horizon, talk out ideas, get advice, etc. In my bit of academia it’s very collegial and there’s lots of willingness to share.

9) What are some of the emerging trends currently taking place in the field of spatial data science?

Spatially-aware ML seems like a big deal to me. Most ML ignores space as a ’special’ category of data even though you can find it embedded in clustering tools like k-Nearest Neighbours, DBSCAN, and so forth. It may not matter that much for many applications (someone tested Random Forests against GWR and didn’t find much difference) but it might matter a lot for some.

I’d say that textual data is also going to matter: NLP hasn’t been much-considered beyond basic sentiment analysis of Twitter. This is starting to change (see: Elizabeth Delmelle’s work on Charlotte and Emmanouil Tranos et al.’ stuff with the Internet Archive) but my guess is that, because NLP is hard and computationally intensive, it’s been neglected.

10) How do you envision that the field and its associated methods and technologies will evolve moving into the future?

See above.