Mistral's Leanstral 1.5 found 5 real bugs scanning open-source repos — here's what formal verification in production looks like

Mistral released Leanstral 1.5 yesterday. It's an open-source model trained specifically for Lean 4, the formal verification language used to prove mathematical theorems and verify code correctness. The benchmark numbers look good — miniF2F-valid 74.4%, ProofNet 60.8% — but the real news is the bug-finding run.

Mistral scanned 57 open-source repositories with Leanstral 1.5. The model found 5 previously unknown bugs. Not edge cases in toy problems — real bugs in production Lean codebases that human reviewers missed.

That's the shift. Formal verification used to be academic infrastructure — theorem provers grinding through proof obligations for aerospace firmware or cryptographic protocols. Leanstral 1.5 is the first model we've seen ship as a bug-finding tool you can point at an arbitrary repo and get actionable results.

What formal verification actually is

Lean 4 is a proof assistant. You write code, then you write a formal proof that the code satisfies a specification. The Lean compiler checks the proof mechanically. If the proof type-checks, the code is correct with respect to the spec. No runtime fuzzing, no statistical confidence — mathematical certainty.

The problem has always been labor. Writing formal proofs takes 10× to 100× longer than writing the code itself. That's why formal verification stayed in niches where correctness justifies cost: seL4 (verified OS kernel), CompCert (verified C compiler), cryptographic libraries.

Leanstral 1.5 changes the economics. Instead of a human writing the proof, the model proposes proof tactics. The Lean compiler still checks everything — the model can't fake a proof — but the human labor drops from weeks to hours.

The bug-finding run

Mistral didn't just benchmark Leanstral 1.5 on academic test sets. They pointed it at Mathlib4 (the standard Lean math library), the Lean 4 compiler itself, and 55 other repos pulled from GitHub. The model generated proofs, flagged inconsistencies, and surfaced 5 bugs that no one had reported.

We don't have the bug details yet — Mistral's announcement doesn't link to specific issues or PRs. But the fact that they ran this scan at all is the signal. Formal verification is moving from "prove this function correct" to "scan this codebase and find what's wrong."

That's a different use case. Most production codebases don't have formal specs. Leanstral isn't replacing unit tests or linters — it's finding logic errors that those tools miss. The model reads code, infers invariants, and flags violations.

Open weights

Leanstral 1.5 is Apache 2.0. You can download the weights, run inference locally, fine-tune on your own proof corpus. That matters because formal verification pipelines need to be auditable. If you're using a model to verify critical code, you want to inspect the training data, reproduce the benchmark results, and understand exactly what the model learned.

Mistral has been positioning itself as the open-weights alternative to Anthropic and OpenAI. Leanstral 1.5 is the clearest example yet of what that means in practice. Claude Code and GPT Codex can generate Lean proofs if you prompt them carefully, but you can't audit their training, you can't fine-tune them on domain-specific theorems, and you can't run them airgapped.

What we're deploying

We haven't deployed Leanstral 1.5 for any client yet. Formal verification isn't on the roadmap for SMB voice agents or CRM tooling. But we're watching the model for one specific reason: proof-of-correctness for agent tool calls.

Right now, when an agent calls a tool — book_appointment(date, time, service) — we validate the arguments with Zod schemas and runtime assertions. Those checks catch type errors and out-of-range values, but they don't prove the tool call satisfies the business logic. Leanstral opens the door to formal specs for tool schemas. Define the invariants once, let the model generate proofs that every tool call respects them.

That's still research, not production. But the gap between "academic curiosity" and "deployed next quarter" just narrowed.

Mistral's bug-finding run is the proof of concept. If the model can scan arbitrary Lean repos and surface real bugs, it can scan agent logs and surface tool-call violations. The question is whether writing formal specs for agent tools is cheaper than the runtime bugs they prevent. For most SMB agents, probably not. For high-stakes workflows — healthcare scheduling, financial transactions, logistics coordination — maybe yes.

Mistral's Leanstral 1.5 found 5 real bugs scanning open-source repos — here's what formal verification in production looks like

What formal verification actually is

The bug-finding run

Open weights

What we're deploying

Cloudflare just shipped a monetization gateway for the agentic internet — here's what x402 means for AI traffic

Tell us what you'd like built.