Pre-Seed · Confidential

Cortex

The Compression Layer for AI

30x smaller models. Same intelligence.

Gaurav Gandhi gaurav@nonlinear.technology March 2026

01

AI's biggest cost isn't training.
It's running.

Training (one-time)Inference (ongoing)

GPT-4 class~$100M~$1B+/year

Enterprise (10 models)~$50M~$500M/year

80% of inference cost is moving weight bytes from memory — not computing.

14 GBloaded per token (7B model)

140 GBloaded per token (70B model)

500 GBloaded per token (250B model)

Now multiply by every domain specialist an enterprise needs: medical, legal, finance, support, compliance...

100 specialized 7B models = 1.4 TB of weights.

On a single GPU? Impossible.

The industry spends $50B+/year on inference infrastructure. Most of it is moving bytes, not doing math.

02

Existing solutions fall short.

ApproachCompressionQualityFundamental Limit

Quantization (INT4)4xReasoning degradesCan't go below 4 bits

Pruning2-3xRemoves capacityAccuracy drops fast

Distillation1xSmaller model is still that sizeLost expressiveness

LoRA adapters1xAdds on topStill loads full 14 GB base

Prior weight sharing2-3xModerate quality lossPost-hoc, structural limits

The gap: No existing method achieves >4x compression while preserving downstream task quality on reasoning benchmarks.

We close that gap. By 10-20x.

03

What we built.

Proprietary training method producing models with 8-57x fewer stored parameters while maintaining full performance on standard benchmarks.

Standard 7BCompact 7BReduction

Model storage (FP16)14.0 GB749 MB19x

Model storage (INT4)3.5 GB200 MB18x

100 domain specialists1,400 GB1.2 GB1,170x

Specialist swap timeseconds12 msInstant

Composes with quantization (our compression x INT4 = multiplicative)

No decompression at inference — weights never fully materialized

Compute overhead: ~6% (negligible vs memory savings)

04

168M stored parameters.
Beats a 1.5B standard model.

Both models trained on identical data (5B tokens, OpenWebText). Fair comparison.

ModelStored ParamsHellaSwagARC-EARC-CMMLUGSM8K

1.5B Standard1,410M38.0%43.0%20.4%25.3%1.0%

7B Compact (SFT)168M38.8%46.0%23.6%26.4%1.0%

The compact 7B stores 88% fewer parameters than the 1.5B standard yet beats it on all six benchmarks.

Compression grows with scale

ScaleCompressionStandardCompact

7B31x10 GB329 MB

70B (projected)78x118 GB1.5 GB

250B (projected)99x493 GB5.0 GB

At 250B: Standard needs 7 interconnected chips ($200K). Compact fits on one chip.

05

6 compression levels. 6 benchmarks.
Independently reproducible.

Results at 1.5B scale. These are expected baselines for models trained on 5B tokens — the point is parity at 8-57x compression, not absolute score.

ModelCompressionHellaSwagARC-EARC-CMMLUGSM8K

Standard (baseline)1x38.0%43.0%20.4%25.3%1.0%

Compact8x37.4%40.2%21.2%26.7%0.0%

Compact13x35.6%40.6%20.0%25.8%1.5%

Compact25x32.4%37.2%20.4%22.9%0.5%

Compact57x31.4%39.0%19.4%25.1%0.0%

Bold = matches or beats standard.

At 8x: matches on every benchmark, beats on MMLU (+1.4pp) and ARC-C (+0.8pp)

At 25x: ties Standard on ARC-Challenge — the hardest reasoning test

At 57x: matches Standard on MMLU despite storing 57x fewer parameters

06

Not just language.

Validated across 4 architectures and 3 modalities.

DomainArchitectureMetricStandardCompact (~5x)Delta

Language (1.5B)GPT-NeoXMMLU25.3%26.7%+1.4pp

Language (7B)GPT-NeoXARC-C20.4%22.4%+2.0pp

Language (0.8B)Qwen 3.5HellaSwag25.6%28.6%+3.0pp

VisionViT-SmallTop-1 Acc73.2%72.5%-0.7pp

Image GenDiTFID13.016.3+3.3

This is a universal property of multi-layer transformers.

07

One base model. Unlimited experts.
5 MB each. 100 for $50.

SpecialistDomain BenchmarkGainGeneral Loss

MedicalMedMCQA+8.4 pp-0.8pp

MathGSM8K8x improvement-0.6pp

ScienceMMLU Science+4.0 pp-2.6pp

Traditional

1,400 GB

LoRA

30 GB

Compact

1.2 GB

Same quality. 1,170x less storage. 12 ms hot-swap. Per-request domain routing from one GPU.

08

Why can't someone just replicate this?

Training from scratch required

Converting existing models to compact form fails catastrophically (1,000x worse). Models must be trained with our method from the beginning.

The method is not obvious

Prior work achieves 2-3x. We achieve 8-57x. The gap is 10-20x. Peer-reviewed publication forthcoming.

Trained weights are the product

Training a 70B compact model costs $500K+ in compute. Our pre-trained model zoo is the moat.

Ecosystem lock-in

Cortex SDK + model zoo + specialist marketplace. Once customers build on our format, switching means retraining everything.

We have working 7B models today. Replication from a paper takes 6-12 months minimum.

09

Three layers of revenue.

Cortex Cloud (API)

Multi-tenant inference. 100 specialists on 1 GPU.
$0.10/M tokens (10x cheaper at same quality). 80%+ gross margin.

Model Compression Service

Bring your architecture, we train it compact.
$50K-500K per engagement. 10-20% of original training cost.

Specialist Training Platform

Upload domain data, get a 5 MB specialist module.
Self-service. $50-5K per specialist.

Unit Economics

TraditionalCompact

Base model training$200K$200K

Each additional specialist$200K$50

100 specialists$20M$205K

Inference GPU cost (100 models)$200K/yr (100 GPUs)$2K/yr (1 GPU)

Future: License compact-optimized ASIC pipeline design to chip makers. Royalty per chip. Cycle-accurate simulation validated, 0% overhead.

10

$50B+ AI inference market.
Growing 40% YoY.

$20B

Cloud AI serving

GPU cost is #1 expense. 30x fewer GPUs per model.

$10B

Enterprise multi-model

Need 10-100 specialists. 1 base + 5 MB each.

$8B

On-device AI

Models too large for phones/IoT. 7B in 200 MB.

$5B

AI silicon

Memory bandwidth is the wall. 30x less data to move.

$5B

Regulated industries

Can't send data to cloud. Full model on local hardware.

Beachhead: Enterprise multi-tenant serving in regulated verticals (healthcare, finance, legal) where domain specialists are needed and data can't leave the building.

11

Competitive position.

Max CompressionQuality?Multi-Tenant?

Nonlinear (us)8-57xYes (beats on some)5 MB per specialist

Quantization (GPTQ/AWQ)4xDegrades on reasoningNo

LoRA/QLoRA1xYes160 MB per adapter

ByteDance (LoopLM)2-3xModerate PPL gapNo

Google DeepMind2xPost-hoc, limitedNo

Neural Magic (IBM)2-4xModerateNo

10-20x ahead of nearest competitor. Our compression stacks with quantization: Compact + INT4 = 120x total reduction.

12

Built, not planned.

7B compact model (30x compression)

1.5B models (6 variants, 8-57x)

Cross-modal validation (ViT, DiT, Qwen)

Domain specialists (math, medical, science)

Cortex demo (web UI + live inference)

ASIC pipeline simulator (cycle-accurate)

ESP32 MCU prototype ($5 chip)

Academic paper (ready for arXiv)

Next milestones

Q2 2026Paper on arXiv, 70B training begins

Q3 202670B results (projected 78x), first enterprise pilot

Q4 2026Cortex Cloud API launch, ASIC partnership

Q1 202710 enterprise customers, Series A ready

13

Team.

Gaurav Gandhi — Founder & CEO

PhD in Infobionics (Nonlinear Networks & Chaos Theory) — Budapest, 2008, Summa Cum Laude

Research collaboration with Leon Chua, UC Berkeley (memristor dynamics, IEEE Cover)

Leaders in Innovation Fellow, Royal Academy of Engineering

2 US Patents in computing architectures

Built mLabs developer community (20,000+ members)

20 years applying nonlinear mathematics to real-world systems

Hiring with this round: 2-3 ML engineers (systems + training), 1 product engineer

14

The Ask.

Seed Round · 2026

AllocationAmountOutcome

70B compact model training$500KHeadline: 70B quality in 1.5 GB

MoE experiment (8x7B)$500KDeepSeek V3-class in ~5 GB

Engineering team (3 hires, 18 mo)$1.5MCortex SDK, Cloud API, model zoo

Compute + infrastructure$500KCustomer training, serving

Operations + GTM$500K-1MFirst enterprise customers

Comparable outcomes

MosaicML$1.3Bacq. Databricks

Together AI$1.25Bvaluation

Modular/Mojo$600Mvaluation

OctoAI~$250Macq. NVIDIA

Neural Magic~$200Macq. IBM

15

Why now. Why us.

Why now

Models getting bigger every quarter. Compression value grows with scale.

Inference cost now exceeds training cost industry-wide.

On-device AI exploding. Models are too big. We make them fit.

Multi-tenant demand has no efficient solution today.

Why us

20 years of nonlinear mathematics research led to this — not accidental.

Working models, not theory. Benchmarked across 4 architectures.

10-20x ahead of nearest published work.

Replication takes 6-12 months even with the paper.

1 model per GPU. $30K per deployment. → 100 models per GPU. $300 per deployment.

70B needs a server rack. $200K. → 70B on one chip. $2K.

7B won't fit on a phone. → 7B in 200 MB. Runs everywhere.

New specialist = full retrain. $200K. → New specialist = 5 MB module. $50.

Nature runs the human brain — 100 trillion synaptic connections — on 20 watts. Not by making neurons bigger. By making structure recursive.

We apply the same principle to artificial intelligence.

nonlinear.technology

Investor Materials