Sectors / Data Infrastructure

Data
Infrastructure

Data is the raw material of intelligence.

01 / Thesis

“Every AI application sits downstream of a data pipeline. When that pipeline drifts, breaks, or delivers stale data, the intelligence built on top of it fails in ways that are often silent and costly.”

The AI era has a raw material problem. Foundation models have commoditized. The bottleneck is clean, structured, timely data. Hyperscaler capital expenditure of $660-690 billion creates demand for data estate readiness. The ETL market is projected to reach $18.6 billion by 2030. Data labeling is growing from $3.7 billion to $17 billion by 2030. Streaming TAM exceeds $100 billion according to Confluent estimates.

The convergence of AI adoption and data infrastructure demand is creating a generational investment opportunity. Every enterprise that deploys AI must first solve its data problem. The pipelines must be reliable. The quality must be measurable. The freshness must match the application's latency requirements. This is not a nice-to-have. It is a prerequisite.

We invest in the data infrastructure layer: the pipelines, quality contracts, streaming systems, and labeling platforms that sit upstream of every AI application. The model layer will continue to commoditize. The data layer will continue to appreciate.

02 / Landscape

Current landscape
and key trends

03 / Sub-verticals

Where we invest within
data infrastructure

Data Contracts and Quality

Monte Carlo, Soda, Great Expectations, and Acceldata ($106 million raised) monitor data quality. Sifflet and Anomalo compete in anomaly detection. Data contracts formalize expectations between producers and consumers. As AI applications depend on data quality, the monitoring layer becomes a gating function for production deployment.

[Monte Carlo] [Soda] [Acceldata]

Streaming

Confluent (acquired by IBM for $11 billion) validated the category. Redpanda offers a Kafka-compatible alternative with 10x lower latency. Tinybird, Materialize, and RisingWave compete in real-time analytics. WarpStream provides a cost-optimized streaming layer. Real-time data is becoming the default architecture for AI-powered applications.

[Confluent] [Redpanda] [Tinybird]

Catalogs and Governance

Databricks Unity Catalog is becoming the default for Databricks users. Collibra, Alation, Atlan, and Secoda compete in standalone data catalog and governance. As regulatory requirements grow and data estates expand, the catalog layer becomes essential for compliance and discoverability.

[Unity Catalog] [Atlan] [Secoda]

Synthetic Data

Gretel.ai and Mostly AI generate synthetic datasets for privacy-preserving AI training. Synthesis AI focuses on computer vision training data. The market is projected to grow from $0.5 billion to $2.7 billion by 2030. As real data becomes scarce and regulated, synthetic data provides a scalable alternative for model training.

[Gretel.ai] [Mostly AI]

Labeling and Annotation

Scale AI at $20 billion+ valuation is the category leader with $870 million in 2024 revenue. Labelbox ($190 million raised, HIPAA and SOC 2 compliant) serves regulated industries. Surge AI and SuperAnnotate compete on quality and specialization. Meta's $14.3 billion investment in Scale AI signals that labeling is a strategic bottleneck for frontier AI development.

[Scale AI] [Labelbox] [Surge AI]
04 / Signals

“Rows processed is a better signal than early ARR.”

Workflow gravity

The tool becomes a gating check in the production process. When data cannot move to production without passing through your system, you have workflow gravity that creates structural retention.

Open-source origins with commercial conversion

MotherDuck raised $100 million for commercial DuckDB. Disciplined open-source-to-commercial conversion is the most capital-efficient go-to-market motion in data infrastructure. Community adoption reduces CAC. Commercial features capture value.

Protocol-layer potential

Defining interfaces and standards, not just building products. The companies that set the standard for how data contracts are enforced or how streaming APIs work will own the layer, not just occupy it.

AI-native architecture

Designed for unstructured data, vector embeddings, and semantic drift from the ground up. Data infrastructure built for the SQL era needs to be rebuilt for the AI era. The incumbents are adapting. The challengers are building natively.

Operator credibility in founding team

Former data platform leads who built internal systems at scale. The best data infrastructure founders have operational scars from managing petabyte-scale pipelines. They build for the failure modes they have lived through.

Usage at pipeline boundary

Rows processed, pipelines monitored, and data quality checks executed matter more than early ARR. Usage at the pipeline boundary is the leading indicator that the tool has become infrastructure rather than an experiment.

[Pipelines] [Data Contracts] [Streaming] [Governance] [Synthetic Data] [Labeling]

Building data
infrastructure?

Get in Touch