How the audit works

The honest read on what we see, what we miss, and how it's been calibrated.

The BEC-12 audit is generated by a language model reading your page's public content and scoring it against twelve behavioral economics principles. This page describes that process — including the parts that don't always work — so you can read the result with the right amount of skepticism.

What we read

The page is read as text first.

When you submit a URL, we fetch the page through a markdown extractor (Jina's reader, with a Firecrawl fallback for pages that block bots). The extractor produces a plain-text version of the page — headings, paragraphs, link labels, list items, alt-text where present — and the audit scores against that text.

For pages that look visually-heavy (lots of images, thin text, known widget vendors, or an obvious case like a section heading followed only by a picture), we also capture a viewport screenshot via Firecrawl. The screenshot is passed alongside the text for the final pass so the model can see what a visitor sees — especially trust badges, testimonial photos, and pricing tables that are sometimes rendered as a single image.

What we can't see

The limits we know about.

Most miscalls trace to gaps between the rendered page and the extracted text. We've calibrated for these — but not all of them are fully solved, and the audit will tell you when it fell back.

Content rendered as images.Pricing tables, comparison grids, or feature matrices that are baked into a single PNG/JPG. When we detect this pattern, the audit either reads it via the screenshot pass or says “rendered as an image — we couldn't interpret its contents” instead of guessing.
JavaScript widgets that don't server-render. Trustpilot embeds, Calendly schedulers, third-party review blocks. We detect known vendors and route through the screenshot pass when their content is missing from the text extract.
Carousel duplication. Testimonial sliders sometimes render adjacent slides in the DOM simultaneously, which makes the same testimonial appear twice in the extract. We dedupe near-duplicate paragraphs and tell the model to be cautious about specific name-to-quote pairings since carousel DOM ordering can scramble attribution.
Interactive states.Hover effects, after-click modals, accordion expansions, paywalls, A/B-tested variants, and personalization the page applies to logged-in users. What we read is the page's default first-paint state to an anonymous visitor.
Per-principle uncertainty.When the audit falls back to text-only on a visually-heavy page, the principles most sensitive to that gap (trust badges, social proof photos, founder imagery, layout density, try-before-you-buy configurators) carry a per-card “text-only read” note. Treat those scores as the model's best read of the copy, not a full visual review.

How we score

Twelve principles, calm ladder, no traffic-light alarms.

Each principle is scored 0–10 against the framework's rubric, with intent-recognition for cases where the principle's absence is a deliberate brand choice (a premium institutional brand legitimately avoids pressure tactics; the audit recognizes that and excludes rather than penalizes).

Status labels are intentionally calm: Standout (9–10) · Strong (7–8) · On track (5–6) · Watch (3–4) · Needs work(0–2). No “Critical.” The harshest word in the vocabulary is “Watch.” The intent is to read like a client report, not a security scanner.

The verdict word at the top and the per-card status pills key off the same ladder, so the visual rhythm matches whether you're skimming or reading deep.

What's been calibrated

The cases that broke us, and how we fixed them.

The model writes faithfully from what it's given — most confidently-wrong claims trace to bad input, not hallucination. A few concrete examples that shipped fixes:

“The pricing section is visibly empty.” A long-form sales page rendered its entire pricing comparison as one PNG. The extractor preserved the image reference but no text. We now detect any conversion-critical section heading (cost / price / compare / plans / features) followed only by image references and fire the screenshot pass, plus a prompt rule that bans the “section is empty” claim and substitutes “rendered as an image — we couldn't interpret its contents.”
“The same testimonial appears twice in the carousel.” A testimonial slider DOM-rendered adjacent slides simultaneously for transition effects, so the same quote appeared twice in the linear text extract. The audit now dedupes near-duplicate paragraphs at the extraction layer and the model is told to suspect “duplicate content” claims as extraction artifacts unless the page structurally exposes the duplication.
“Mostly positive” with a 4-sentence enumeration. The summary used to list every problem the page had instead of picking the dominant one. We capped the summary at two sentences with explicit GOOD/BAD examples in the prompt, and split the “if you only do one thing” field to enforce a singular action rather than a chained list.

The pattern: when the audit speaks confidently about something you can see on the page that it claims isn't there, send the URL — those are diagnosable failures, not opinion disagreements, and they tend to be the same shape across other pages.

Source

It's an LLM. Read it like a competent colleague's first draft.

The audit is generated by Claude Sonnet (Anthropic's mid-tier model) using a structured prompt that requires the model to call back to specific page content for any high score and to give the page the benefit of the doubt when its absence of a technique looks like a deliberate brand choice. Each principle card carries the primary academic citation for the underlying mechanism — that's where the source weight comes from.

For low-stakes calls (which CTA copy to try, where to add a trust band, whether the choice paradox is biting), the audit is well-calibrated. For high-stakes decisions (six-figure pricing changes, full-page rewrites, removing core guarantees), treat the audit as a competent first read and confirm with your own data — your real conversion numbers are the source of truth, this is a structured second opinion.

When you're ready

Score my page →See previous audits