Small Data

Small data appears to be a very exciting movement that is moving the overton window away from Big Data onto much simpler and cheaper solutions.

Its main drivers are the following:

    1. organizations don’t use that much data
    1. hardware is getting really, really good

1. Orgs don’t use much data

Of queries that scan at least 1 MB, the median query scans about 100 MB. The 99.9th percentile query scans about 300 GB. Analytic databases like Snowflake and Redshift are “massively parallel processing” systems, but 99.9% of real world queries could run on a single large node. I did the analysis for this post using DuckDB, and it can scan the entire 11 GB Snowflake query sample on my Mac Studio in a few seconds. The small data hypothesis is true.

src: https://www.fivetran.com/blog/how-do-people-use-snowflake-and-redshift

When we think about new database architectures, we’re hypnotized by scaling limits. If it can’t handle petabytes, or at least terabytes, it’s not in the conversation. But most applications will never see a terabyte of data, even if they’re successful. We’re using jackhammers to drive finish nails. > As an industry, we’ve become absolutely obsessed with “scale”. Seemingly at the expense of all else, like simplicity, ease of maintenance, and reducing developer cognitive load

src: https://adriano.fyi/posts/2023-09-24-choose-postgres-queue-technology

But My Elasticity?

A common argument is that small data doesn’t scale, so if your app suddenly gets viral - it breaks. This is true in theory, but in practice most likely doesn’t apply.

Anyway, some people say it’s a good principle to design/test for ~10x your scale. This doesn’t sound like a bad idea, but there’s one big caveat - whether that number is 2x, 5x or 10x matters a ton based on your company’s current growth rate.

See the years of consistent growth rate it takes to get to 10x your current scale:

annual growthyears to hit 10× scale
10 %24.16 y
25 %10.32 y
50 %5.68 y
75 %4.11 y
100 %3.32 y
150 %2.51 y
200 %2.10 y

2.Modern Hardware is (super) good 👑

In my experience, a lot of senior engineers don’t seem to realize how much faster and cheaper hardware has become.1

SSDs

A 2022-era nVME (~8TB Samsung PM1743 costing around ~$1000) can do 2.5 million random 4KB reads a second, and 13 GB/s sequential reads.

The newest 2025-era SSDs (Micron P9650) are said to double that - ~28 GB/s sequential reads and 5.5 million random reads.

Just 10 years ago, a 2015-era Samsung 950 PRO could do 2.5GB/s sequential reads and 300k random reads. It also capped out at 512 GB capacity (~$350). In other words:

So here’s what changed:

  • price went down ~5.6x a TB ($700/TB vs $125/TB)
  • you can store 30x more in a single disk (512 GB vs 15.36 TB)
  • each disk is 11x faster in sequential reads (2.5 GB/s vs 28 GB/s) and 18x in random reads (300k vs 5.5 million)

CPUs

Modern CPUs are equally absurd. The newest 2025-era AMD EPYC “Turin Dense” (~$15,000) has 192 cores (!!!) and 384 threads. You can install 2x of these on a motherboard, reaching 384 cores / 768 threads for ~$30k.

It’s hard to compare CPUs, especially 10 years apart, but here’s a rough attempt. 10 years ago, a comparable monster setup would have been 8x Intel Xeon E7-8890 v3 ($7k each) leading to 144 cores and 288 threads for ~$56k. This is an awfully-naive comparison, because setting up an 8-socket Xeon rig would introduce a ton more additional costs in cooling, power, server chassis, rarer, more exotic motherboards and interconnect. Even if we remain hyper-conservative and not take these into account, the difference is staggering.

So here’s what changed:

  • core count went up 2.6x (144 vs 384)
  • price went down (at least) 5x per core ($389 vs $78)
  • each Turin core is also probably 2x-2.5x faster than the Intel one due to architectural improvements over the years

Parting Message 👋

Distributed Systems are more overkill than ever.

You should be thinking about what technologies you know well, are good enough and are the least toilsome for the task at hand.

And you should know that modern hardware can most likely handle your workload just fine.


References

Here are a bunch of links I find interesting regarding the Small Data movement:

Footnotes

  1. I don’t blame them - I didn’t know either! Everything is so well-abstracted today behind the cloud, that it’s perfectly reasonable to not care. It’s easy to not check your priors after having learned something a decade ago and never been challenged on it.