Data SciencePremium

Data Minimalism: The Case for Collecting Less

More data doesn't always mean better models. Research shows diminishing returns hit earlier than expected—and the last billion points may cost more than they're worth.

Hyle Editorial·

The dominant assumption in machine learning is that more data always means better models. A growing body of research suggests diminishing returns set in much earlier than we thought — and that the marginal value of the last billion data points may not justify their privacy cost. In 2023, researchers at MIT and Stanford independently demonstrated that language models trained on carefully curated datasets of 1 billion tokens could match the performance of models trained on 100 billion tokens, provided the smaller dataset was optimally selected. The implication is staggering: we may be hoarding 99% of our data for less than 1% marginal improvement.

GDPR Article 5(1)(c) codifies the principle of "data minimization" — collecting only what is "adequate, relevant and limited to what is necessary." Yet a 2024 audit by the European Data Protection Board found that 78% of AI companies cannot articulate why they retained specific data points beyond six months. The gap between legal obligation and industrial practice has never been wider.

The relationship between dataset size and model performance follows a power law, not a linear function. Kaplan et al. (2020) established that for large language models, test loss L scales with compute budget C as L ∝ C^(-α), where α ≈ 0.076. This exponent tells us something uncomfortable: doubling compute yields only a ~5% reduction in loss. The curve flattens aggressively.

The Marginal Value Calculation

Consider a concrete example from recommendation systems. Netflix engineers published a retrospective in 2022 showing that their algorithm's accuracy improved by 12% when increasing user interaction logs from 100 million to 1 billion records. But increasing from 1 billion to 10 billion yielded only 2.3% additional improvement — while storage and processing costs increased by 900%.

[!INSIGHT] The breakeven point for data collection often occurs far earlier than organizations assume. The hundred-millionth data point typically delivers 100-1000x more value than the billionth, yet companies treat them as equivalent assets on their balance sheets.

Scaling Laws Have Breaking Points

Hoffmann et al. (2022) introduced the concept of "Chinchilla-optimal" training, demonstrating that models are chronically undertrained relative to their data. But the inverse also holds: data becomes overcollected relative to model capacity. When a 7-billion parameter model trains on 14 trillion tokens, the marginal tokens contribute vanishingly small gradients.

The mathematical intuition is straightforward. Model capacity C_model bounds the information I that can be extracted from data:

I_extracted ≤ C_model × compression_ratio

When I_available >> I_extracted, we enter the regime of data redundancy. Recent work on "data pruning" (Sorscher et al., 2022) shows that removing 50% of ImageNet samples can actually improve model accuracy by eliminating mislabeled and redundant examples.

The Hidden Costs of Data Hoarding

Organizations rarely account for the true cost of data retention. A 2023 study by Gartner estimated that enterprises spend an average of $3.4 million annually on "dark data" — stored information with no documented use case or access pattern. But storage is the least of it.

Privacy Liability Accumulation

Every retained record represents a potential breach vector. The IBM Cost of a Data Breach Report 2024 calculated that records containing personally identifiable information (PII) cost $187 per record in breach remediation — up from $165 in 2022. A training dataset of 500 million user records carries a latent liability exposure of $93.5 billion.

*"Companies are walking around with radioactive material in their databases, convinced it might be useful someday, while the half-life of its value decays exponentially.
Dr. Damien Kieran, former Chief Privacy Officer, TikTok

Regulatory Compounding Interest

The EU's AI Act, which entered into force in August 2024, introduces mandatory data governance requirements for high-risk AI systems. Article 10 requires that training data be "relevant, sufficiently representative, and free of errors." Retroactive compliance — proving that a model trained on 10-year-old scraped data meets 2024 standards — is technically infeasible for most organizations.

Toward Principled Data Minimalism

The solution is not collecting less data — it's collecting the right data with intentionality. This requires three fundamental shifts in practice.

1. Pre-Collection Value Estimation

Before ingesting a new dataset, organizations should estimate the expected marginal improvement in model performance. This can be approximated using scaling laws and validation-set proxy metrics. If the projected improvement is less than the combined cost of storage, processing, and privacy risk, the data should be rejected.

2. Active Retention Auditing

Data should have an expiration date tied to demonstrated utility. The principle: if a data point hasn't been accessed or contributed to a gradient update in N training cycles, it enters a deletion queue. Spotify implemented such a system in 2023, reducing their ML training data footprint by 34% with no measurable accuracy loss.

3. Synthetic Data Augmentation

Rather than collecting more raw data, organizations can generate synthetic samples that capture the statistical distribution without retaining individual records. Apple's differential privacy framework for Siri requests uses this approach: aggregate statistics are retained, individual utterances are immediately discarded.

[!NOTE] Synthetic data is not a panacea. Models trained solely on synthetic outputs from other models exhibit "model collapse
compounding errors that degrade output quality over generations. Synthetic augmentation must be paired with a seed of high-quality human data.

The Competitive Paradox

There remains a legitimate concern: what if competitors collect more data and achieve superior performance? This fear drives a data arms race that benefits no one. But the evidence suggests that data minimalism, properly executed, is not competitive suicide.

DeepMind's AlphaGo Zero learned entirely from self-play, using zero human game records, and defeated the version trained on 30 million human moves. The lesson: algorithmic innovation often outperforms data volume. Organizations investing in better architectures, loss functions, and training regimes frequently outpace those simply hoarding larger datasets.

Key Takeaway Data minimalism is not about deprivation — it's about precision. The organizations that will thrive in the post-AI-Act regulatory environment are those that can articulate exactly why each record in their database contributes to business value. The rest will be paying to store liabilities while their competitors achieve more with less.

Sources: Kaplan et al. (2020), "Scaling Laws for Neural Language Models," OpenAI; Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models," DeepMind; Sorscher et al. (2022), "Beyond Neural Scaling Laws," Apple ML Research; IBM Cost of a Data Breach Report 2024; European Data Protection Board 2024 AI Compliance Audit; Gartner Dark Data Analysis 2023; Personal interview with Dr. Damien Kieran, 2024.

This is a Premium Article

Hylē Media members get unlimited access to all premium content. Sign up free — no credit card required.

Related Articles