Index /KPM Launch / 11

What shipping a WebAuthn flow actually looks like

Phase 1 piece #4 and Phase 1b-C were both marked complete based on unit tests. Then I drove a real browser ceremony and five bugs cascaded out at once.

RETROSPECTIVE. This post describes historical decisions and past state. Current behavior may differ.

Phase 1 piece #4 and Phase 1b-C were both marked complete based on unit tests. The day I tried to actually drive a real browser ceremony against the live cluster, five bugs cascaded out across the CLI, the server, and the embedded JavaScript.

This is a record of what they were, where they lived, and why every single one of them had passing tests.

Bug 1: bearer token never attached

The four WebAuthn HTTP helpers in internal/kpm/webauthn.go checked client.token directly without calling ensureAuth(ctx) first. client.token is the in-memory field. ensureAuth is what reads the persisted session from ~/.kpm/sessions/current.json and loads it into memory if the field is empty.

Every authenticated request died with a 401 before the browser opened.

The unit tests passed because they hand-set the token field directly on a freshly constructed client. No test exercised the “freshly built binary + session already persisted on disk from a previous login” path — which is the only path a real user ever hits. The test was testing the test setup, not the code.

Fix: add ensureAuth(ctx) at the top of each WebAuthn helper, the same way every other authenticated command does it.

Bug 2: browser URL used 127.0.0.1

WebAuthn RPID validation requires that the page’s effective domain be a registrable suffix of the configured RPID. The RPID was localhost. The browser URL was http://127.0.0.1:<port>/....

127.0.0.1 is not a registrable suffix of localhost. The WebAuthn spec is unambiguous about this — the ceremony cannot complete from an origin whose host violates RPID matching. I caught this by re-reading the spec while debugging Bug 1, not by observing the failure directly. After the bearer fix landed in the same patch as the localhost URL rewrite, I never saw the failure mode this bug would have produced. The spec violation was real and the fix was cheap, so I shipped it on principle.

The listener still binds to 127.0.0.1 (binding to 0.0.0.0 is a wider attack surface for a local-only flow). But the URL handed to the browser needs the host rewritten to localhost. Small helper that swaps the host component before the URL reaches the browser opener.

Bug 3: the embedded JS assumed a flat options object

The server returns a JSON object shaped like {"publicKey": {...}}. The embedded HTML/JavaScript was written as if the options object were flat — accessing opts.challenge, opts.rp, opts.user directly.

opts.challenge was undefined. Calling .replace() on undefined threw a TypeError before the biometric prompt ever appeared. The browser console had the error; the terminal running kpm webauthn register showed nothing useful.

Two-line fix in the embedded JavaScript: detect the envelope and unwrap it before any field access — if (opts && opts.publicKey) opts = opts.publicKey;. The rest of the script works unchanged. The embedded template was written by someone (me) who had read the go-webauthn docs but not checked the actual wire format the library emits.

Bug 4: duplicate TokenService inside one server

The rebase against origin/main accidentally re-introduced a duplicate auth.NewTokenService call in the server startup path. Each call generates its own random HMAC key internally — two TokenService instances meant two different keys.

The auth handler that mints tokens at /auth/session was wired to one instance. The middleware that validates tokens on every authenticated request was wired to the other. Tokens minted at login could not be verified anywhere else. Every authenticated request returned 401 with “invalid or expired session token” — moments after a successful login that had been served by the same pod.

I had assumed the bug was multi-pod state. Three replicas, each with its own random key, client requests landing on different pods. So I scaled the deployment down to one replica to test that hypothesis. The bug persisted at one replica. That is how I found the duplicate instantiation — the within-pod state was the inconsistent part, not the cross-pod state.

The rebase diff was forty lines across three files. The duplicate NewTokenService was four lines and I did not notice it during the conflict resolution. The fix: delete the second call. One instance, constructed once, passed as a dependency to both consumers.

Bug 5: /register/finish got the wrong envelope

The server’s registration completion handler was passing kpm’s entire request envelope to protocol.ParseCredentialCreationResponseBytes. go-webauthn’s parse function expects only the inner attestation object — the response field from the browser’s PublicKeyCredential.

Every browser-completed ceremony died at the finish endpoint with “Parse error for Registration” — a go-webauthn error that surfaces the JSON parse failure but not which field was wrong.

Fix: extract req.Response from the envelope before calling the parse function. The handler needed one additional struct field and one dereference to get the shape right.

What the pattern is

Every one of these had unit-test coverage that was passing. None of them were caught by tests.

The bugs lived in the interfaces between layers:

  • The seam between the persisted-session disk file and the in-memory Client struct.
  • The seam between Go’s net/url rewriting and the browser’s WebAuthn domain validator.
  • The seam between go-webauthn’s wire-format JSON and the embedded JavaScript’s expectations.
  • The seam between two independent initializations of the same constructor inside one process.
  • The seam between kpm’s request envelope and go-webauthn’s expected input shape.

Unit tests within each layer were clean. The bugs only existed at the points where two layers touched. Integration testing — actually building the binary, running the server, opening the browser, completing the ceremony — is where they appeared.

The operational conclusion is unglamorous: mark a feature complete against a live integration test, not against a unit-test green light. A test that constructs a client and sets the token field directly is not testing what a user experiences when they run the binary. It is testing what happens when the client is set up the way the test author expected it to be set up. Those are different things, and they produce the same green checkmark.

All five bugs are fixed. The flow works end-to-end. Part 9 covers what that looks like from the outside (the airport-recovery scenario). The v0.5 unification later added convenient admin tooling (admin inviteuser + enroll --invite) as an ergonomic layer for the exact “invite a user or enroll a new machine tied to my userspace” use cases, while preserving the secure webauthn + bootstrap foundation described here.