Your beta users are lying to you — and your synthetic ones won't

"Get it in front of users as fast as possible."

Correct advice. Almost universally misapplied.

Most technical founders hear "get it in front of users" and reach for the same list: their waitlist, their Twitter followers, the three founders from the last cohort who owe them a favor. These people opted in before they saw the product. They're going to be polite about the parts that are broken.

What you actually need is signal from the people who tried your product and left, evaluated it and said no, and who'd warn a friend against it. They know more about your failure modes than your most engaged customers ever will.

The problem: they won't take your Calendly link. They've already moved on.

The three patterns that compound each other

Confirmation bias by selection. You recruit from your waitlist. Your waitlist signed up because they believed the pitch. You interpret enthusiasm as validation. You're asking people who already agree with you to agree again.

Optimizing for the early adopter who isn't your customer. Early adopters tolerate broken onboarding and missing docs. A reliability engineer at a 20-person company and a developer who "just wants to tinker" have different trust thresholds. Build for the tinkerer who stayed, churn the engineer who should have converted.

The silent abandonment problem. NPS is calculated from people who use your product. It says nothing about people who evaluated it last month and never activated — or activated, hit one silent failure, and quietly canceled. Those are your most important data points. You have no mechanism to reach them.

The sampling frame is the insight engine

This is not a survey. Not a chatbot asking leading questions. Not "ask an LLM to pretend to be your user."

The mechanism: declare a falsifiable hypothesis, define a sampling frame that forces inclusion of people who would never talk to you, run parallel synthetic interviews, score each hypothesis against what came back.

The hypothesis has to be falsifiable. Not "who are my users" — that's a topic. Not "what do users think of our onboarding" — that's a survey question. A falsifiable hypothesis takes a position:

"Do early adopters of multi-agent workflow tools abandon after initial setup because agent coordination failures erode trust faster than successful automations build it?"

The sampling frame mandates minimum representation from every adoption posture: adopters, partial adopters, abandoners, evaluators who rejected, people who've never tried it, people who actively warn others against tools like yours. The last three categories are the ones real user research structurally excludes. They're also the ones who know exactly what's broken.

Running this against a multi-agent tool hypothesis produced four behavioral archetypes:

Audit-Burdened Reliability Engineers — treat silent partial correctness as strictly worse than a crash
Solo Operators Retreating to Single-LLM Scripts — tried multi-agent orchestration, concluded it multiplies hallucination surface without leverage, went back to single calls
Pre-Refusal Pattern-Matchers — recognize the silent-failure pattern from prior exposure, won't trial without a correctness story
Hedged Partial Adopters — quarantine the tool to low-risk workflows, keep paying, never expand

These aren't demographics. They're behavioral archetypes defined by relationship to failure.

The decision report output: "Abandonment is almost always triggered by plausible-but-wrong output cosigned by a downstream agent, not by crashes."

Hypothesis verdicts, archetype profiles, prioritized interview questions — 30 minutes, under $2.

What Playwright can't tell you

In 2025, backend functionality ships at near-zero cost. The bottleneck isn't building — it's knowing whether what you built serves the person you built it for.

Playwright doesn't have goals. It doesn't feel friction. It passes tests on pages that would cause a real user to close the tab.

Behavioral persona evaluation uses the same browser automation substrate — but instead of testing "does the button click," it tests: "Does Priya — a reliability engineer who won't trust a green checkmark over a probabilistic output — achieve her goal in this session?"

I converted the four behavioral archetypes into canonical eval personas — each with mindset, constraint profile, failure mode, and AI adoption history — then ran them as browser agents against the live product. They navigated the onboarding flow, selected goals, described their first tasks, tried to reach the core product.

What they found was not what the research had predicted.

What the agents actually found

The research predicted: trust erosion at agent coordination failures — the moment a multi-agent pipeline produces a plausible-but-wrong output.

The eval found: trust erosion happening at a different inflection point entirely.

Priya navigated 11 pages deep: through /onboarding/goals, /onboarding/provider, /onboarding/first-task, into the authenticated product — Projects, Your Team, Workflows, Library. The navigation structure was intuitive. What she found inside:

No SOC2, HIPAA, or BAA documentation visible anywhere in the UI — a hard blocker for regulated-enterprise procurement
Your Team page rendering a loading spinner with no agents to inspect
Workflow creation returning a 500 error — no Anthropic key configured, nothing executes

Maya (solo operator) got into /workflows/new?flow=brief. Her assessment: "I could see how someone would input a task like 'Triage incoming support tickets' and have the system generate a plan. That's the core value prop and it's conceptually sound."

She never got to use it.

The kill shot: after completing onboarding and clicking "Create a Plan," the app redirects to /login and discards all progress.

This happens after maximum user investment — 5 onboarding steps, a described workflow, a primary CTA click. The session dies with nothing saved. Two of four personas hit it independently:

"I just spent 10 minutes filling out my workflow and it dumped me back to login? If it loses my work I'm gone."

"I clicked 'Create a Plan' after filling in my first task and got sent back to login. That's not a feature gap — that's a broken flow."

A third finding, from Priya on the homepage:

"Wait — this says 'scan subscriptions for charges'? I came here for research workflow automation. Am I on the wrong site?"

The homepage headline described a subscription scanner. Every new user evaluating a multi-agent automation platform read it and questioned whether they'd reached the right product.

Playwright's verdict on all three: ✅ page renders, ✅ button exists, ✅ form submits.

Behavioral eval scores: overall 3–4/10. Trust: 1–2/10. Efficiency: 1–2/10. Would pay: 0/4. High retention risk: 4/4.

Top three backlog outputs:

Fix session loss on "Create a Plan" — users who complete onboarding must land in their workspace, not at /login
Homepage headline — the copy describes a subscription scanner; every persona who saw it questioned whether they'd reached the right product
Pre-populated sandbox project — every persona independently asked for proof-of-output before committing credentials or OAuth access

The research → eval loop

1. State a falsifiable hypothesis (15 min)
2. Run synthetic research: 12 subjects, 4 segments (~30 min, ~$2)
3. Get hypothesis verdicts + behavioral archetypes
4. Bridge archetypes to eval personas (automated)
5. Run behavioral eval against your running app
6. Get prioritized backlog from 4 behavioral angles
7. Recruit real users — targeted questions, not open-ended discovery

Step 7 is where the shift happens. Instead of scheduling open-ended discovery calls with whoever takes the Calendly link, you recruit against specific archetypes (find someone who matches the Solo Operator pattern) and ask specific questions the research generated:

"Walk me through the last time a multi-agent tool produced a plausible-but-wrong output."
"What did you try before? Why didn't it stick?"
"If this worked perfectly, what would you be able to do that you can't today?"

If 2 of 3 real users contradict a synthetic finding, the finding is wrong. If 2 of 3 confirm it, you have a validated assumption — in 48 hours instead of 6 weeks.

The constraints

A language model simulating user behavior doesn't know your specific market, your NPS distribution, or your support tickets. It generates confident-sounding outputs about uncertain things. Every session produces a SYNTHETIC-DATA-NOTICE.md: "Treat every finding as a hypothesis to test, not a conclusion to act on."

The right use: synthetic research maps the hypothesis space fast enough to act before committing three months of engineering to the wrong direction. Real users prove whether the map is accurate.

On timing

Real user testing is expensive in calendar time, structurally biased toward people who opted in, and opaque about failure modes — churned users don't explain why they left.

Synthetic focus groups are cheap, fast, and structurally include the users who would never accept an interview request.

Use synthetic focus groups to decide what to validate. Use real users to validate it.

The most important user research is with people who don't want to talk to you. Synthetic simulation is how you hear from them first — before the product is public, before the pitch deck is final, before anyone knows you're building it.