{"id":422,"date":"2026-01-25T15:48:11","date_gmt":"2026-01-25T15:48:11","guid":{"rendered":"http:\/\/www.reades.com\/wp\/?p=422"},"modified":"2026-01-25T17:27:22","modified_gmt":"2026-01-25T17:27:22","slug":"last-night-a-db-saved-my-life-part-1","status":"publish","type":"post","link":"http:\/\/www.reades.com\/wp\/?p=422","title":{"rendered":"Last night a DB saved my life (Part 1)"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Every so often, a technology comes along that is magical in how well it works, and transformative in terms of its impact on your work. Wifi was one of those: I <em>still<\/em> remember the first time I used it and the computer that I used it on. There was no going back to cat-5 at my desk. Using DuckDB for the first time had that same magical feeling: which is why, over the past 12\u201318 months, DuckDB + Parquet has replaced Pandas, Dask <em>and<\/em> ad-hoc Postgres instances in most of my research workflow.<\/p>\n\n\n\n<figure class=\"wp-block-pullquote\"><blockquote><p>DuckDB has revolutionised my workflow in the &#8216;in-between&#8217; area where most of us spend most of our time: getting the data into a format where it&#8217;s useful. <\/p><\/blockquote><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For me, DuckDB deals with a class of really important but really annoying research data processing tasks that previously required either: a) the patience of a saint; or b) the massive overhead of setting up a database. In other words, DuckDB has revolutionised my workflow in that &#8216;in-between&#8217; area where most of us actually spend most of our time: getting the data into a format where it&#8217;s useful. To be clear, this is written from a research perspective: it&#8217;s for exploratory work, &#8216;large-but-not-big&#8217; data, reproducibility, and minimal infrastructure overhead.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>TL;DR<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">DuckDB:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Gives you disposable database infrastructure<\/li>\n\n\n\n<li>Turns your Parquet files into database tables<\/li>\n\n\n\n<li>Helps you use SQL as transparent workflow documentation<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How DuckDB changes workflow<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">What makes DuckDB so special? Two things: 1) there is no long-running server since queries execute either fully in-memory or directly over the specified files; and 2) it uses Parquet files natively<sup>[1]<\/sup>. Together, these allow you do the work of a fully-fledged database server without the need to ever set one up.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[1]. <em>You can create native DuckDB database files, but for my research workflows querying Parquet directly increases shareability and removes a layer of overhead that I don\u2019t need.<\/em><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">The Benefits of Parquet<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Parquet files (technically, Parquet+Arrow) have three major advantages over anything I&#8217;ve used before:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>They are columnar: you only retrieve the columns that use. A CSV file requires you to scan row-by-row even if you only want to read a couple of the <em>columns<\/em>. With parquet, can read in the 1st, 21st, and 195th columns of the file directly <em>as if<\/em> they are the only attributes in your data set. So your data loads and can be accessed much more quickly.<\/li>\n\n\n\n<li>They are compressed: your data is compressed and stored with rich column-level metadata so that it occupies the minimum amount possible in your computer&#8217;s memory. Things that might have previously maxed out the RAM on my laptop now barely register. This is a real boon when you&#8217;re using virtual machines (e.g. Docker, Podman) on a laptop.<\/li>\n\n\n\n<li>Each parquet file can be treated as a table in a database: if you want to give someone access to one of your &#8216;tables&#8217;, just send them the parquet file. If you want to give them a copy of your database, stick all of the parquet files on a network share (as we did) and everyone now has access.<\/li>\n\n\n\n<li>Parquet files that share the same format can be treated as a single table: I think this is less well-known, but so long as the layout of the files are the same, you can query multiple parquet files as if they were part of the same table.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">From this some other pretty amazing things follow:<\/p>\n\n\n\n<ol start=\"5\" class=\"wp-block-list\">\n<li>You can replace complex and time-consuming load+link steps in Pandas\/Python with a single SQL statement using LEFT\/RIGHT\/INNER JOIN commands.<\/li>\n\n\n\n<li>So long as you embed the SQL workflow in your Python using the <code>duckdb<\/code> library, you&#8217;ll never be left wondering how you generated a particular table or view since you can see each step directly in your repository workflow.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Every so often, a technology comes along that is magical in how well it works, and transformative in terms of its impact on your work. Wifi was one of those: I *still* remember the first time I used it and the computer that I used it on. There was no going back to cat-5 at my desk. Using DuckDB for the first time had that same magical feeling: which is why, over the past 12\u201318 months, DuckDB + Parquet has replaced Pandas, Dask *and* ad-hoc Postgres instances in most of my research workflow. <\/p>\n","protected":false},"author":1,"featured_media":430,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,3,4,9,11],"tags":[36,120,121],"class_list":["post-422","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-coding","category-data","category-fp","category-research","category-tech","tag-data-2","tag-duckdb","tag-parquet"],"_links":{"self":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts\/422","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=422"}],"version-history":[{"count":2,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts\/422\/revisions"}],"predecessor-version":[{"id":425,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/posts\/422\/revisions\/425"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=\/wp\/v2\/media\/430"}],"wp:attachment":[{"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=422"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=422"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.reades.com\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=422"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}