The Threshold Report - methodology

The Threshold Report tracks one question: what crossed the line from AI demo, rumor, or roadmap into something real?

It is not a hype meter and not an "AGI progress" gauge. It is a ledger. Every point on the index traces to a logged event with a primary source you can click. This page is the complete scoring system. The model suggests entries, but the code assigns points from the fixed rubric.

The six lanes

Lane	Covers	In plain English
Text	Language models, reasoning, coding, long-context work, research/document analysis, writing, translation	AI is getting better at turning messy instructions into usable work.
Audio	Speech-to-text, text-to-speech, voice agents, real-time translation, music, dubbing	AI is getting better at listening, speaking, translating, and sounding human.
Video	Text/image-to-video, editing, avatars, scene consistency, motion and physics realism	AI video is moving from weird clips toward usable scenes.
Agents	Tool use, computer/browser use, digital task completion, autonomous workflows, cross-modal reasoning that is not primarily physical robotics	AI is moving from answering questions to doing multi-step digital tasks.
Robots	AI-enabled physical robots, including humanoids and non-humanoids, drones, warehouse robots, domestic robots, manipulation, navigation, dexterity, and real-world fleet deployment	AI is moving from screens into physical machines.
Other	Chips, accelerators, inference efficiency, data centers, cooling, power supply, grid deals, energy infrastructure, space or unconventional data centers, networking, memory, storage, safety/security tooling, standards, and other deployment bottlenecks	The enabling layer that determines whether AI becomes cheaper, faster, more reliable, and easier to deploy.

Each lane has its own cumulative index, starting at 100 on the launch date. Lanes are not comparable to each other. Video at 187 versus Text at 162 means more logged movement since launch, not "video AI is more capable than text AI."

The one rule that defines the publication

Demos and announcements score zero until they become real under that lane's rule.

For Text, Audio, Video, and Agents, "real" means the public can use the capability today, including paid access or open weights. Demos, papers, waitlists, private previews, and coming-soon posts score zero.

For Robots, the standard is physical-world availability rather than consumer availability. A robot event can score only when the product or capability is orderable, shipping, commercially deployed, in a real customer or partner pilot, or operating in a real-world field deployment. A keynote demo, lab video, staged prototype reveal, or paper-only result scores zero even if it looks impressive.

For Other, the standard is concrete deployment or capacity. An event can score if infrastructure is live, capacity is online, hardware is shipping, a model-serving efficiency improvement is available, a power or data-center deal is signed, construction has started, or a deployment-relevant tool or standard is released for real use. A concept, feasibility study, unsourced rumor, or general roadmap scores zero.

Items that matter but cannot score yet sit on the public "not yet" shelf at 0 points. If they later become usable, deployed, signed, online, shipping, or otherwise concrete, the later event can score.

Event tiers and points

Every qualifying event is classified into one tier. Points are fixed. The tier decides the score, never vibes.

Tier	Points	Definition	Test
Milestone	+10	A genuinely new capability, deployment, or enabling capacity that is usable, deployed, or concrete now	"Did this create a new kind of thing users, operators, customers, or the AI ecosystem can rely on this week?"
Notable	+5	A clear, verified improvement to an existing capability: major quality/reliability jump, frontier-class model release, materially better robot deployment, meaningful infrastructure/cost/power/capacity progress, or tracked public benchmark movement tied to public access	"Is the same task or bottleneck now meaningfully better for real use?"
Minor	+2	Access expansion, significant price cut, open-weights release of near-frontier ability, existing capability reaching a major new platform/language/customer segment, smaller deployed robot progress, or smaller infrastructure efficiency/access improvement	"Can meaningfully more people or operators use what already existed, or use it more cheaply?"
Regression	-3	A shipped capability, deployment, infrastructure path, or access path is withdrawn, materially degraded, delayed, or shown to fail in ways that change what users can rely on	Requires primary evidence

Caps to prevent launch-week inflation

Big weeks should look big, but a coordinated PR blitz should not distort the index.

Per lane per week: at most 1 milestone, 2 notable, and 3 minor events score.
Per lane per week: total delta is capped at +20.
Events past the cap are still logged at 0 points so the record stays complete.

Weekly delta bands

The week's movement per lane is described in plain words:

Delta	Band
0-2	Quiet week
3-5	Incremental
6-11	Meaningful movement
12-19	Major leap
20 (cap)	Breakthrough week

Verification policy

An event only scores if it is backed by a primary source: an official announcement, release notes, model card, pricing page, tracked public leaderboard, customer deployment post, signed-deal announcement, product page, regulatory filing, or similar direct evidence.

Items surfaced only by newsletters, aggregators, or Gmail are logged as unverified at 0 points unless they include a primary link. If a primary source is then found, the entry can be promoted by editing its tier and points. The promotion is visible in git history.

The newsletter includes a Sources by segment list at the end of every issue. Claims without source support should not appear.

What we deliberately skip

Funding rounds, hiring news, executive predictions, hype cycles, vague benchmark claims with no public access, "AGI is near/far" discourse, and general roadmap promises. None of that changes what can be used, deployed, or relied on today.

Corrections

The ledger is append-only in spirit: if an entry was wrong, a correction entry is added that references the original id and reverses its points, with a note. The git history of data/ledger.json is the permanent audit trail.

The human gate

A model drafts; a human ships. The automation gathers, classifies against this rubric, drafts the issue, and schedules the Kit broadcast. The operator keeps a veto window before send time and can delete the scheduled broadcast in Kit if anything looks wrong.