All articlesField notes

Mozilla used Claude Mythos Preview to find hundreds of Firefox vulnerabilities — here's what changed

Mozilla's security team got early access to Claude Mythos Preview and used it to locate hundreds of real vulnerabilities in Firefox. The write-up is worth reading for what it says about the gap between AI slop and production tooling.

May 8, 2026 4 min read
securityclaude-mythosvulnerability-detectionfirefox

Mozilla published a detailed write-up yesterday on their use of Claude Mythos Preview to hunt vulnerabilities in Firefox. The results are specific enough to be interesting.

The security team ran Mythos against Firefox's codebase and got back hundreds of bug reports. Most were real. That's the headline, but the context matters more.

The slop problem they cite

Mozilla's post opens with a reference to the AI-generated security report problem that's been plaguing open-source projects for the last year. Automated tools scan repos, generate plausible-sounding vulnerability reports, and maintainers waste hours triaging false positives. The signal-to-noise ratio has been bad enough that some projects now auto-close AI-generated issues.

Mythos changed that math for Mozilla. They write: "Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop. Dealing with reports that look plausibly correct but are wrong imposes a real cost on open source maintainers. Suddenly, the bugs are very good."

The shift happened because Mythos has a longer context window (1M+ tokens) and better code reasoning than the models that produced the slop. Mozilla fed it entire subsystems of Firefox at once. The reports that came back were specific, referenced actual call chains, and pointed to real memory safety issues.

What they actually did

Mozilla's security team used Mythos in two modes. First, they ran it against high-risk areas of the Firefox codebase — media parsers, network stack, graphics subsystems. The model identified several hundred potential vulnerabilities. The team triaged all of them. Most were legitimate.

Second, they used Mythos to review patches before merge. A developer would submit a fix for a known vulnerability, and the security team would run the diff through Mythos to check if the fix introduced new issues or missed edge cases. That caught problems that human reviewers missed.

The write-up includes example reports. One involved a use-after-free in the media decoder. Mythos flagged the pattern, traced the object lifecycle across three files, and pointed to the specific line where the dangling pointer could be dereferenced. Mozilla confirmed it, wrote the patch, shipped it.

The gap this exposes

The interesting part is the gap between what Mozilla could do with early Mythos access and what most teams can do with publicly available models. Claude 3.5 Sonnet (the current production model) has a 200K token context window. That's enough for single-file analysis, maybe a few related files, but not enough to trace call chains across a large codebase.

Mythos Preview (and presumably the upcoming public release) has 5× that capacity. The difference shows up in the quality of the reports. A model that can hold more of the system in context can reason about interactions between components, not just isolated functions.

This matters for production deployments because most agentic systems we build have the same problem. A voice agent that only sees the current call transcript can't reason about patterns across all calls this month. A CRM tool that can only load one customer record at a time can't spot trends. Context window size is still a constraint, even as the models improve.

What we'd change if we had Mythos access

We run evals on every agent deployment, but the evals are limited by what we can fit in context. A typical eval run for Goldie (Best Limo NY's voice agent) involves feeding the model 10-15 example transcripts and checking if it handles each scenario correctly. We can't feed it all 800 calls from last month and ask it to find the edge cases.

If we had Mythos-level context, we'd rewrite the eval pipeline. Load every call from the last 90 days, every CRM note, every dispatch log. Ask the model: what patterns did the agent miss? What customer requests didn't get routed correctly? What edge cases show up in real usage but not in our test scenarios?

That's the same analysis Mozilla did with Firefox. Find the real problems by looking at the whole system, not just the unit tests.

The timing question

Mozilla had early access to Mythos Preview. The public release timeline is unclear. Anthropic announced at Code w/ Claude yesterday that Mythos is coming, but didn't give a date. Mozilla's write-up implies they've been using it for months.

That creates a window where teams with early access (enterprises, research labs, Anthropic's direct customers) can do vulnerability analysis and system-wide debugging that everyone else can't. The gap will close when Mythos ships publicly, but for now it's real.

We're waiting for public access. The use case is obvious. Load VioX OS (our internal tooling platform) into Mythos, ask it to find integration bugs we missed. Load all of Goldie's call logs, find the edge cases. Load every client's deployment config, check for security issues.

Mozilla's write-up is the first detailed public case study of what Mythos can do on a real production codebase. The results are specific enough to be credible. If the public release matches what Mozilla describes, it's going to change how we run evals and debug production systems.

/ 06 — Start hereOne business day response

Tell us what you'd like built.

Send us a paragraph about the workflow, phone line, or tool you want built. We'll reply within one business day with a one-page plan, a fixed price, and a delivery date you can put on a calendar.

  • 30-min scoping call, free
  • Written proposal within 48 hours
  • Fixed price before we start
  • Most builds delivered in 2–8 weeks