Integrating OpenAI without making your product feel bolted-on

Most AI features fail not because the model is wrong, but because the UI is. They sit in a sidebar nobody opens, behind a button that screams demo. The features that earn their pixels start with the user’s task and walk backwards to the model — never the other way around.

Design from the user task backwards

When a team adds OpenAI to a product, the conversation usually starts with what the model can do: “It can summarize,” “it can rewrite,” “it can answer questions.” That’s the wrong end. The right end is what the user is already trying to do that’s slow, and where a model could remove a step. If your customers spend forty seconds drafting a weekly status update, the AI feature is “draft my weekly status” — not “summarize.” The verb belongs to them, not the model.

Concretely: every AI feature I’ve shipped that stuck has had three properties. It has a name that’s a user’s verb. It lives in the flow where the user is already doing that task. And it produces output the user can edit in place. None of them needed a separate “AI assistant” panel. The good ones make the AI invisible — what stays visible is the user’s task, slightly faster.

Latency budgets and skeleton states

GPT-4-class calls regularly take 3–8 seconds. That’s a lifetime in a UI built around 100ms responses. You have three real options:

Stream — best for long-form generation where the user reads as it appears.
Skeleton — short outputs that fill an existing slot. The field looks like it’s loading, then content appears.
Background — the user moves on, a notification comes back when ready. Right when the output isn’t urgent.

Whichever you pick, instrument the latency from the user’s perspective, not the model’s. The “model returned in 4.2s” metric is useless if your code does another second of post-processing before painting. The number that matters is “time from the user’s click to a usable result on screen.”

Eval, don’t trust

The reason most AI features regress silently is that nobody set up evaluations. The output looks fine on the demo input, ships, then a customer pastes input the team never tried, and the response is wrong in a way that erodes trust permanently. The cheap fix: a spreadsheet of 30 representative inputs, a column of expected outputs, and a script that runs them on every prompt change. It catches 80% of the regressions without any ML infrastructure.

This is the discipline most teams skip — and the one most worth keeping. Without it, every prompt tweak is a roll of the dice. With it, the team can iterate on prompts the way they iterate on code: with a green/red signal that means something.

When the AI feature is invisible, latency is honest, and you have an eval set you trust, the product stops feeling like it has AI in it — and starts feeling like the AI is the product. That’s the bar. If you want an opinionated walkthrough of building this kind of integration into a real SaaS, I do this every quarter for product teams shipping their first AI feature.