Data Science

Data Colonialism: Who Owns the Raw Material?

Kenyan workers label traumatic content for $2/hour to train AI for Silicon Valley. The Global South's data extracts value it never sees. Who owns your digital labor?

Hyle Editorial·

Kenyan workers were paid $1–3 per hour to label traumatic content — violence, child abuse, extremism — to train AI safety filters for OpenAI. According to a 2023 Time investigation, employees at Sama, an outsourcing firm in Nairobi, reviewed thousands of disturbing images daily, many reporting lasting psychological trauma. The data extracted from the Global South powers systems the Global South cannot afford to use. This is not a metaphor. It is the operating logic of modern artificial intelligence.

The extraction runs deeper than underpaid labor. Every day, 5.3 billion internet users generate approximately 2.5 quintillion bytes of data. Roughly 60% of this digital raw material originates from users in Asia, Africa, and Latin America. Yet 83% of AI patent filings in 2023 came from companies headquartered in just three jurisdictions: the United States, China, and the European Union. The asymmetry is structural: the Global South mines the data; the Global North owns the refining infrastructure.

Data colonialism operates through three interconnected mechanisms that mirror historical colonial resource extraction with disturbing precision.

Mechanism 1: Digital Enclosure

Platform corporations have enclosed the digital commons through proprietary infrastructure. When a farmer in rural India uses WhatsApp to coordinate crop sales, Meta captures behavioral data, network graph information, and metadata. The farmer receives a free communication tool; Meta receives training data for recommendation algorithms, language models, and advertising optimization systems.

The economic value differential is staggering. A 2023 study by the Oxford Internet Institute estimated that the average American data laborer generates $240 annually in platform revenue. By contrast, users in emerging markets generate roughly $47 per person — but these same markets represent 88% of global internet user growth. The Global South is not peripheral to the data economy; it is its primary growth engine.

[!INSIGHT] The extraction model transforms users into unwitting data miners. Unlike coal or copper, the resource being extracted is human attention, behavior, and creative output — harvested at near-zero marginal cost through platforms that appear free.

Mechanism 2: Labor Arbitrage in the AI Supply Chain

The OpenAI-Sama case exemplifies a broader pattern. AI training requires massive datasets labeled by human workers. This annotation work — tagging images, categorizing text, identifying harmful content — is labor-intensive, mentally taxing, and overwhelmingly outsourced.

A 2022 audit of AI annotation platforms found:

LocationAverage Hourly WageTasks Performed
Kenya (Sama)$1.32–2.00Content moderation, image tagging
Philippines (Remotasks)$0.80–1.50Image annotation, transcription
India (iMerit)$2.50–4.00Data labeling, sentiment analysis
United States (Scale AI)$15.00–25.00Quality assurance, edge cases

The wage differential reflects not skill differences but geographic arbitrage. A worker in Nairobi labeling child exploitation content performs identical cognitive labor to a San Francisco contractor, yet earns 8% of the wage. The psychological cost — secondary trauma, PTSD symptoms, anxiety disorders — falls entirely on the worker. The economic value accrues to shareholders in Menlo Park.

*"The content moderators are the first line of defense in AI safety, but they are treated as disposable inputs in a production process.
Dr. Nanjira Sambuli, digital equality researcher

Mechanism 3: The Usage Gap

The final component of data colonialism is structural exclusion from the value created. ChatGPT Plus costs $20 monthly — approximately 22% of Kenya's average monthly wage. A single query on GPT-4 requires computational resources inaccessible to any sub-Saharan African institution. The knowledge extracted from Kenyan Swahili speakers, Nigerian Pidgin writers, and Indian Hindi speakers trains models that remain economically out of reach for those populations.

This creates a closed loop:

  1. Global South users generate behavioral and linguistic data
  2. Platforms capture and aggregate this data at zero marginal cost
  3. AI labs in California and Beijing train models on extracted data
  4. Resulting products are priced beyond Global South reach
  5. Extracted value compounds in Northern economies
  6. Global South receives no compensation, representation, or access

[!NOTE] Language models exemplify this extraction. GPT-4 was trained on roughly 500 billion tokens of text. Approximately 30% of internet content is non-English, yet non-Western languages receive proportionally inferior model performance. The data is extracted; the value is not returned.

The European Union's General Data Protection Regulation (GDPR), implemented in 2018, represents the most ambitious attempt to assert data sovereignty. Yet its protections remain geographically and economically bounded.

What GDPR Achieves

GDPR establishes critical rights:

  • Right to Access: Users can request all data a company holds about them
  • Right to Erasure: Users can demand deletion (with exceptions)
  • Consent Requirements: Explicit opt-in for data processing
  • Portability: Users can transfer data between services
  • Breach Notification: Mandatory disclosure within 72 hours

These protections cover 447 million EU residents. They do not extend to the 1.4 billion people in Africa or the 4.7 billion in Asia.

The Jurisdictional Gap

Global data flows operate through a patchwork of bilateral agreements and corporate self-regulation. When a Brazilian user's data routes through Irish servers (common under Meta's corporate structure), which jurisdiction applies? The answer is deliberately opaque.

Furthermore, GDPR focuses on individual consent rather than collective ownership. A Kenyan user clicking "I Agree" on a terms-of-service document has technically consented — but under conditions of radical information asymmetry. No user reads 8,000-word legal documents. No individual can negotiate terms with a trillion-dollar platform.

[!INSIGHT] Consent frameworks presuppose bargaining power. A subsistence farmer in Bangladesh has the same formal right to refuse data collection as a Stanford professor — but vastly different capacity to understand implications, seek alternatives, or absorb the cost of exclusion from digital services.

Emerging Resistance: Data Sovereignty Movements

Challenge to data colonialism is mounting from three directions.

State-Led Data Localization

As of 2024, 62 countries have enacted some form of data localization law requiring citizen data to be stored within national borders. China's Cybersecurity Law (2017) mandates domestic storage of "important data." India's Digital Personal Data Protection Act (2023) enables government restrictions on cross-border flows. Brazil's LGPD includes transfer restrictions to non-adequate jurisdictions.

The logic is intuitive: if data is the new oil, nations should control their own reserves. Yet localization carries tradeoffs. Domestic data centers require massive capital investment. Fragmented data pools may reduce AI model quality. And authoritarian governments can exploit localization for surveillance rather than citizen protection.

Collective Data Trusts

More promising are experiments in collective data ownership. The Mesa Data Trust in Arizona represents 5,000 gig workers negotiating collectively with platforms over data terms. In India, the DATA (Data as a Trust Asset) Trust model proposes that community-generated data — agricultural patterns, local knowledge, cultural archives — be held in trust with proceeds distributed to originators.

The mathematical framework for data trusts draws from portfolio theory:

$$\text{Expected Return}i = \sum{j=1}^{n} w_j \cdot r_j - \text{Transaction Costs}$$

Where $w_j$ represents the weight of data source $j$ in the aggregate pool, and $r_j$ represents the revenue generated by that data type. The insight: individual bargaining power approaches zero, but collective pools approach monopoly power in niche data categories.

Technical Alternatives: Federated Learning and Differential Privacy

Technologists are developing architectures that preserve value while limiting extraction:

  1. Federated Learning: Models train on local devices without raw data transfer. A hospital in Lagos could contribute to medical AI without patient records leaving Nigerian servers.

  2. Differential Privacy: Mathematical guarantees that individual records cannot be reverse-engineered from aggregate statistics. The formula for $(\epsilon, \delta)$-differential privacy:

$$\Pr[M(D) \in S] \leq e^{\epsilon} \cdot \Pr[M(D') \in S] + \delta$$

Where $D$ and $D'$ differ by one record, ensuring plausible deniability for any individual contribution.

  1. Data Unions: Organizations like Streamr enable users to pool and sell their data collectively, capturing value that platforms currently extract for free.

[!NOTE] Technical solutions alone cannot resolve power asymmetries. Federated learning still concentrates model ownership in Northern labs. Differential privacy protects individual privacy but does not address collective compensation. Technology is necessary but insufficient.

The Path Forward: From Extraction to Partnership

The data colonialism thesis is not an argument against AI development or global data flows. It is an argument for recognizing the current system's predatory structure and building alternatives.

Three principles could guide a more equitable data economy:

  1. Transparent Valuation: Platforms should disclose the economic value derived from user data by region, enabling informed consent and potential compensation negotiations.

  2. Computational Access: AI systems trained on global data should be accessible at regional price parity, not uniform global pricing that excludes the Global South.

  3. Collective Representation: Data subjects should have collective bargaining power through trusts, unions, or regulatory bodies — not merely individual click-to-consent mechanisms.

The Kenyan workers who labeled trauma for $2 per hour built safety systems that protect users in San Francisco and London. They are owed more than psychological counseling. They are owed a share of the value they created.

Key Takeaway: Data is not just the new oil — it inherits oil's colonial extraction patterns. The Global South mines the raw material (behavioral data, linguistic content, annotation labor) while the Global North owns the refineries (AI labs, cloud infrastructure, platform monopolies). Breaking this cycle requires collective data ownership, transparent valuation, and computational access — not merely consent checkboxes and localization laws.

Sources: Time Magazine investigation "Inside OpenAI's African Sweatshop" (2023); Oxford Internet Institute "The Global Data Economy" (2023); GDPR Official Text (2018); UNCTAD Digital Economy Report (2024); Sambuli, N. "Digital Equality in the Platform Age" (2022); Dwork & Roth "The Algorithmic Foundations of Differential Privacy" (2014).

This is a Premium Article

Hylē Media members get unlimited access to all premium content. Sign up free — no credit card required.

Related Articles