Data Infrastructure -- True North Fund

01 / Thesis

“Every AI application sits downstream of a data pipeline. When that pipeline drifts, breaks, or delivers stale data, the intelligence built on top of it fails in ways that are often silent and costly.”

The AI era has a raw material problem. Foundation models have commoditized. The bottleneck is clean, structured, timely data. Hyperscaler capital expenditure of $660-690 billion creates demand for data estate readiness. The ETL market is projected to reach $18.6 billion by 2030. Data labeling is growing from $3.7 billion to $17 billion by 2030. Streaming TAM exceeds $100 billion according to Confluent estimates.

The convergence of AI adoption and data infrastructure demand is creating a generational investment opportunity. Every enterprise that deploys AI must first solve its data problem. The pipelines must be reliable. The quality must be measurable. The freshness must match the application's latency requirements. This is not a nice-to-have. It is a prerequisite.

We invest in the data infrastructure layer: the pipelines, quality contracts, streaming systems, and labeling platforms that sit upstream of every AI application. The model layer will continue to commoditize. The data layer will continue to appreciate.

02 / Landscape

Current landscape
and key trends

01

Consolidation at the Modern Data Stack layer

Fivetran and dbt Labs merged in an all-stock transaction combining roughly $600 million in revenue. This consolidation validates the thesis that the data pipeline layer is maturing into essential infrastructure. The combined entity owns the extract-load-transform workflow end to end, creating the first integrated data platform at scale.

02

IBM's $11 billion Confluent acquisition validates streaming

Confluent's Q3 2025 revenue hit $298.5 million with Cloud representing 56% of subscription revenue. IBM's acquisition at $11 billion validates real-time streaming as critical enterprise infrastructure. The thesis: batch processing is insufficient for AI applications that require real-time context. Streaming is becoming the default data architecture.

03

Data contracts entering Gartner Hype Cycle

dbt model contracts, Soda, and Monte Carlo are extending data quality guarantees to unstructured data. Data contracts formalize the interface between data producers and consumers, creating accountability in the data supply chain. As AI applications depend on data quality, contracts become the enforcement mechanism.

04

Meta's $14.3 billion Scale AI investment redefines the ceiling

Scale AI generated $870 million in 2024 revenue, tracking toward $2 billion in 2025 at a $20 billion valuation. Meta's investment signals that data labeling and curation are strategic priorities for the largest AI companies. The data labeling market is projected to grow from $3.7 billion to $17 billion by 2030 as AI training data demands compound.

03 / Sub-verticals

Where we invest within
data infrastructure

Pipelines and ETL

The plumbing that moves data from source to destination. The Fivetran-dbt Labs merger created an integrated extract-load-transform platform. Airbyte provides the open-source alternative with 350+ connectors. dlt offers a lightweight Python library for pipeline-as-code. Informatica continues to serve the enterprise segment. The pipeline layer is where data reliability begins.

[Fivetran+dbt] [Airbyte] [dlt]

Data Contracts and Quality

Monte Carlo, Soda, Great Expectations, and Acceldata ($106 million raised) monitor data quality. Sifflet and Anomalo compete in anomaly detection. Data contracts formalize expectations between producers and consumers. As AI applications depend on data quality, the monitoring layer becomes a gating function for production deployment.

[Monte Carlo] [Soda] [Acceldata]

Streaming

Confluent (acquired by IBM for $11 billion) validated the category. Redpanda offers a Kafka-compatible alternative with 10x lower latency. Tinybird, Materialize, and RisingWave compete in real-time analytics. WarpStream provides a cost-optimized streaming layer. Real-time data is becoming the default architecture for AI-powered applications.

[Confluent] [Redpanda] [Tinybird]

Catalogs and Governance

Databricks Unity Catalog is becoming the default for Databricks users. Collibra, Alation, Atlan, and Secoda compete in standalone data catalog and governance. As regulatory requirements grow and data estates expand, the catalog layer becomes essential for compliance and discoverability.

[Unity Catalog] [Atlan] [Secoda]

Synthetic Data

Gretel.ai and Mostly AI generate synthetic datasets for privacy-preserving AI training. Synthesis AI focuses on computer vision training data. The market is projected to grow from $0.5 billion to $2.7 billion by 2030. As real data becomes scarce and regulated, synthetic data provides a scalable alternative for model training.

[Gretel.ai] [Mostly AI]

Labeling and Annotation

Scale AI at $20 billion+ valuation is the category leader with $870 million in 2024 revenue. Labelbox ($190 million raised, HIPAA and SOC 2 compliant) serves regulated industries. Surge AI and SuperAnnotate compete on quality and specialization. Meta's $14.3 billion investment in Scale AI signals that labeling is a strategic bottleneck for frontier AI development.

[Scale AI] [Labelbox] [Surge AI]

04 / Signals

“Rows processed is a better signal than early ARR.”

Workflow gravity

The tool becomes a gating check in the production process. When data cannot move to production without passing through your system, you have workflow gravity that creates structural retention.

Open-source origins with commercial conversion

MotherDuck raised $100 million for commercial DuckDB. Disciplined open-source-to-commercial conversion is the most capital-efficient go-to-market motion in data infrastructure. Community adoption reduces CAC. Commercial features capture value.

Protocol-layer potential

Defining interfaces and standards, not just building products. The companies that set the standard for how data contracts are enforced or how streaming APIs work will own the layer, not just occupy it.

AI-native architecture

Designed for unstructured data, vector embeddings, and semantic drift from the ground up. Data infrastructure built for the SQL era needs to be rebuilt for the AI era. The incumbents are adapting. The challengers are building natively.

Operator credibility in founding team

Former data platform leads who built internal systems at scale. The best data infrastructure founders have operational scars from managing petabyte-scale pipelines. They build for the failure modes they have lived through.

Usage at pipeline boundary

Rows processed, pipelines monitored, and data quality checks executed matter more than early ARR. Usage at the pipeline boundary is the leading indicator that the tool has become infrastructure rather than an experiment.

[Pipelines] [Data Contracts] [Streaming] [Governance] [Synthetic Data] [Labeling]

Previous Developer Tools Next AI Infrastructure

All Sectors

Building data
infrastructure?

Get in Touch

DataInfrastructure

Current landscapeand key trends