Data is the raw material of intelligence.
“Every AI application sits downstream of a data pipeline. When that pipeline drifts, breaks, or delivers stale data, the intelligence built on top of it fails in ways that are often silent and costly.”
The AI era has a raw material problem. Foundation models have commoditized. The bottleneck is clean, structured, timely data. Hyperscaler capital expenditure of $660-690 billion creates demand for data estate readiness. The ETL market is projected to reach $18.6 billion by 2030. Data labeling is growing from $3.7 billion to $17 billion by 2030. Streaming TAM exceeds $100 billion according to Confluent estimates.
The convergence of AI adoption and data infrastructure demand is creating a generational investment opportunity. Every enterprise that deploys AI must first solve its data problem. The pipelines must be reliable. The quality must be measurable. The freshness must match the application's latency requirements. This is not a nice-to-have. It is a prerequisite.
We invest in the data infrastructure layer: the pipelines, quality contracts, streaming systems, and labeling platforms that sit upstream of every AI application. The model layer will continue to commoditize. The data layer will continue to appreciate.
01
Fivetran and dbt Labs merged in an all-stock transaction combining roughly $600 million in revenue. This consolidation validates the thesis that the data pipeline layer is maturing into essential infrastructure. The combined entity owns the extract-load-transform workflow end to end, creating the first integrated data platform at scale.
02
Confluent's Q3 2025 revenue hit $298.5 million with Cloud representing 56% of subscription revenue. IBM's acquisition at $11 billion validates real-time streaming as critical enterprise infrastructure. The thesis: batch processing is insufficient for AI applications that require real-time context. Streaming is becoming the default data architecture.
03
dbt model contracts, Soda, and Monte Carlo are extending data quality guarantees to unstructured data. Data contracts formalize the interface between data producers and consumers, creating accountability in the data supply chain. As AI applications depend on data quality, contracts become the enforcement mechanism.
04
Scale AI generated $870 million in 2024 revenue, tracking toward $2 billion in 2025 at a $20 billion valuation. Meta's investment signals that data labeling and curation are strategic priorities for the largest AI companies. The data labeling market is projected to grow from $3.7 billion to $17 billion by 2030 as AI training data demands compound.
The plumbing that moves data from source to destination. The Fivetran-dbt Labs merger created an integrated extract-load-transform platform. Airbyte provides the open-source alternative with 350+ connectors. dlt offers a lightweight Python library for pipeline-as-code. Informatica continues to serve the enterprise segment. The pipeline layer is where data reliability begins.
Monte Carlo, Soda, Great Expectations, and Acceldata ($106 million raised) monitor data quality. Sifflet and Anomalo compete in anomaly detection. Data contracts formalize expectations between producers and consumers. As AI applications depend on data quality, the monitoring layer becomes a gating function for production deployment.
Confluent (acquired by IBM for $11 billion) validated the category. Redpanda offers a Kafka-compatible alternative with 10x lower latency. Tinybird, Materialize, and RisingWave compete in real-time analytics. WarpStream provides a cost-optimized streaming layer. Real-time data is becoming the default architecture for AI-powered applications.
Databricks Unity Catalog is becoming the default for Databricks users. Collibra, Alation, Atlan, and Secoda compete in standalone data catalog and governance. As regulatory requirements grow and data estates expand, the catalog layer becomes essential for compliance and discoverability.
Gretel.ai and Mostly AI generate synthetic datasets for privacy-preserving AI training. Synthesis AI focuses on computer vision training data. The market is projected to grow from $0.5 billion to $2.7 billion by 2030. As real data becomes scarce and regulated, synthetic data provides a scalable alternative for model training.
Scale AI at $20 billion+ valuation is the category leader with $870 million in 2024 revenue. Labelbox ($190 million raised, HIPAA and SOC 2 compliant) serves regulated industries. Surge AI and SuperAnnotate compete on quality and specialization. Meta's $14.3 billion investment in Scale AI signals that labeling is a strategic bottleneck for frontier AI development.
“Rows processed is a better signal than early ARR.”
The tool becomes a gating check in the production process. When data cannot move to production without passing through your system, you have workflow gravity that creates structural retention.
MotherDuck raised $100 million for commercial DuckDB. Disciplined open-source-to-commercial conversion is the most capital-efficient go-to-market motion in data infrastructure. Community adoption reduces CAC. Commercial features capture value.
Defining interfaces and standards, not just building products. The companies that set the standard for how data contracts are enforced or how streaming APIs work will own the layer, not just occupy it.
Designed for unstructured data, vector embeddings, and semantic drift from the ground up. Data infrastructure built for the SQL era needs to be rebuilt for the AI era. The incumbents are adapting. The challengers are building natively.
Former data platform leads who built internal systems at scale. The best data infrastructure founders have operational scars from managing petabyte-scale pipelines. They build for the failure modes they have lived through.
Rows processed, pipelines monitored, and data quality checks executed matter more than early ARR. Usage at the pipeline boundary is the leading indicator that the tool has become infrastructure rather than an experiment.
Building data
infrastructure?