Small Data

Small data appears to be a very exciting movement that is moving the overton window away from Big Data onto much simpler and cheaper solutions.

Its main drivers are the following:

    1. organizations don’t use that much data
    1. hardware is getting really, really good

1. Orgs don’t use much data

Of queries that scan at least 1 MB, the median query scans about 100 MB. The 99.9th percentile query scans about 300 GB. Analytic databases like Snowflake and Redshift are “massively parallel processing” systems, but 99.9% of real world queries could run on a single large node. I did the analysis for this post using DuckDB, and it can scan the entire 11 GB Snowflake query sample on my Mac Studio in a few seconds. The small data hypothesis is true.

src: https://www.fivetran.com/blog/how-do-people-use-snowflake-and-redshift

When we think about new database architectures, we’re hypnotized by scaling limits. If it can’t handle petabytes, or at least terabytes, it’s not in the conversation. But most applications will never see a terabyte of data, even if they’re successful. We’re using jackhammers to drive finish nails. > As an industry, we’ve become absolutely obsessed with “scale”. Seemingly at the expense of all else, like simplicity, ease of maintenance, and reducing developer cognitive load

src: https://adriano.fyi/posts/2023-09-24-choose-postgres-queue-technology

Apache Kafka is frequently chosen as the go-to tool for pub/sub messaging, yet a majority of the users have throughput that’s kilobytes to a few megabytes per second. It is somewhat overkill to deploy Kafka to serve a workload that’s at most 5 MB/s in and 10 MB/s out. Not only that, but it’s also expensive. Some Kafka vendors would charge you $120k/year for that. I can get the same performance out of a single unoptimized (!!!) Postgres instance running on a beefy EC2 and it’d cost… $17k/year.

src: [my own experience]

But My Elasticity?

A common argument is that small data doesn’t scale, so if your app suddenly gets viral - it breaks. This is true in theory, but in practice most likely doesn’t apply.

Anyway, some people say it’s a good principle to design/test for ~10x your scale. This doesn’t sound like a bad idea, but there’s one big caveat - whether that number is 2x, 5x or 10x matters a ton based on your company’s current growth rate.

See the years of consistent growth rate it takes to get to 10x your current scale:

annual growthyears to hit 10× scale
10 %24.16 y
25 %10.32 y
50 %5.68 y
75 %4.11 y
100 %3.32 y
150 %2.51 y
200 %2.10 y

2.Modern Hardware is (super) good 👑

In my experience, a lot of senior engineers don’t seem to realize how much faster and cheaper hardware has become.1

You can fit a ton more on a single node than you may think you could have 10 years ago.

SSDs

A 2022-era nVME (~8TB Samsung PM1743 costing around ~$1000) can do 2.5 million random 4KB reads a second, and 13 GB/s sequential reads.

The newest 2025-era SSDs (Micron P9650) are said to double that - ~28 GB/s sequential reads and 5.5 million random reads.

Just 10 years ago, a 2015-era Samsung 950 PRO could do 2.5GB/s sequential reads and 300k random reads. It also capped out at 512 GB capacity (~$350). In other words:

So here’s what changed:

  • price went down ~5.6x a TB ($700/TB vs $125/TB)
  • you can store 30x more in a single disk (512 GB vs 15.36 TB)
  • each disk is 11x faster in sequential reads (2.5 GB/s vs 28 GB/s) and 18x in random reads (300k vs 5.5 million)

CPUs

Modern CPUs are equally absurd. The newest 2025-era AMD EPYC “Turin Dense” (~$15,000) has 192 cores (!!!) and 384 threads.

It’s hard to compare CPUs, especially 10 years apart, but here’s a rough attempt. 10 years ago, a comparable monster CPU would have been the Intel Xeon E7-8890 v3 with 18 cores & 36 threads. It came 2x cheaper, at ~$7,000.

So here’s what changed:

  • core count went up 10.6x (18 vs 192)
  • price went down 5x per core ($388 vs $78)
  • each Turin core is also probably 2x-2.5x faster than the Intel one due to architectural improvements over the years

Parting Message 👋

I’m not suggesting you run everything on one node - you still need some minimal distribution at a minimum for high availability and durability. But I am suggesting that specialized, complex distributed systems are more overkill than ever.

You should be thinking about what technologies you know well, are good enough and are the least toilsome for the task at hand.

And you should know that modern hardware can most likely handle your workload just fine.


References

Here are a bunch of links I find interesting regarding the Small Data movement:

Footnotes

  1. I don’t blame them - I didn’t know either! Everything is so well-abstracted today behind the cloud, that it’s perfectly reasonable to not care. It’s easy to not check your priors after having learned something a decade ago and never been challenged on it.