How Claude-style streaming UI is actually built

Claude recently updated its interactive UI so that an interactive page can be streamed into the chat box claude:

At the same time I came across a project called Generative-UI-MCP. The author's idea is straightforward: use the MCP protocol to clone the whole “Claude can generate interactive UI” trick.

The project itself is not complicated, but it breaks down something that is usually hard to explain: what is actually new about Claude's interactive UI; why it is not as simple as “the AI wrote some front-end code for you”; and the first thing that has to be solved behind it—whether it is a rendering problem or a protocol problem.

I later re-dissected the whole process against a real SSE streaming log. After comparing the two sides, it becomes clearer: Claude-style streaming UI is essentially not “the model outputs HTML”, but “the model continuously outputs a UI protocol that the host can consume reliably”.

This article is about exactly that.

It is a continuously interactive UI page

The really new part of Claude's interactive UI / Artifacts is not “the AI generated a page”—people have done that long ago. The new part is that the generated thing can still be used, can continue to interact with the model, can mount tools, and update state.

This is completely different from “the AI wrote a piece of front-end code and you copy it out and run it”.

In the old way, the model is the starting point and ends after generation. In the new way, the model is a continuous participant in this UI session. User actions in the interface can go back to the model, and the new content returned by the model can partially update the interface, back and forth.

The closed loop looks like this:

CodeBlock Loading...

Only when this cycle can run is it truly “interactive”. A piece of HTML with a few buttons is not interactive; events can flow back, that is.

A real streaming output makes this very clear

If you look at a real SSE streaming output, you will find that the model does not spit out a complete page at once, but continuously outputs different types of content blocks in the stream. The front-end receives and assembles them piece by piece, and then hands them to the corresponding tool for rendering.

After disassembling, there are roughly five steps.

Step 1: Load the UI generation spec first

At the beginning, the model does not generate a widget directly, but calls a tool similar to visualize:read_me, with very short input parameters:

JSON


{

"modules": ["diagram", "interactive"]

}

This step is critical. It shows that before the model really starts to “draw the interface”, it first goes to get a runtime UI specification. In other words, generation is not naked; the model must first know what rules to follow this time.

Step 2: The tool returns a whole design system and streaming constraints

There are several particularly critical sections in this returned content.

First, module description:

TEXT


Call read_me again with the modules parameter to load detailed guidance:

- `diagram` — SVG flowcharts, structural diagrams, illustrative diagrams

- `mockup` — UI mockups, forms, cards, dashboards

- `interactive` — interactive explainers with controls

- `chart` — charts, data analysis, geographic maps (Chart.js, D3 choropleth)

- `art` — illustration and generative art

Then role definition:

TEXT


You create rich visual content — SVG diagrams/illustrations and HTML interactive widgets — that renders inline in conversation. The best output feels like a natural extension of the chat.

Next are its most critical constraints:

TEXT


### Philosophy

- Seamless: Users shouldn't notice where claude.ai ends and your widget begins.

- Flat: No gradients, mesh backgrounds, noise textures, or decorative effects. Clean flat surfaces.

- Compact: Show the essential inline. Explain the rest in text.

- Text goes in your response, visuals go in the tool.

There are also order rules specifically for streaming rendering:

TEXT


### Streaming

Output streams token-by-token. Structure code so useful content appears early.

- HTML: <style> (short) → content HTML → <script> last.

- SVG: <defs> (markers) → visual elements immediately.

- Prefer inline style="..." over <style> blocks.

- Gradients, shadows, and blur flash during streaming DOM diffs. Use solid flat fills instead.

On the surface, this looks like a design specification; but from a runtime perspective, it is more like a generation protocol that “lets the model output stable UI messages”.

It does several things:

Specifies what should appear in the tool and what should appear in natural language
Specifies the order in which code should be streamed out
Specifies which visual effects will destroy the streaming experience, so they are forbidden
Specifies that components must adapt to the host environment, such as CSS variables, dark mode, controlled script capabilities

In other words, this is not “giving the model some aesthetic advice”, but drawing a narrow track for the model.

The real key is not HTML, but the structure of the model output

Then the model starts to call another tool, such as visualize:show_widget. This part of the stream is most easily misunderstood, because it looks like a bunch of fragments:

TEXT


event: content_block_delta

data: {"type":"content_block_delta","index":2,"delta":{"type":"input_json_delta","partial_json":"-2-2L"}}

Looking at such fragments alone, they are almost unreadable. But they are not garbled, but part of the tool call parameters. The host will stitch together the continuously arriving partial_json under the same block, and finally restore a complete JSON.

In this case, after reassembly it looks like:

JSON


{

"title": "ui_icons_outline",

"loading_messages": [

"Sketching icon paths...",

"Adding hover magic...",

"Lining up the grid..."

],

"i_have_seen_read_me": true,

"widget_code": "..."

}

Every field here is interesting.

title is the identifier of this widget. loading_messages is not decoration, but turns “waiting” into a perceivable generation process. i_have_seen_read_me is like a state confirmation, indicating that the model generated under the premise of having read the spec.

And the real interface is all put into widget_code.

This step reveals the core fact of streaming UI: the model does not directly output the final page, but outputs a UI message that the host can consume.

Why the `Generative-UI-MCP` project looks small but is valuable

I originally thought that to clone Claude's interactive UI, a lot of things were needed: custom renderer, state management, complete front-end runtime, component library, DSL.

As a result, the core of Generative-UI-MCP is extremely minimal:

A load_ui_guidelines tool that loads UI generation specs on demand
A system prompt resource that injects the most basic output constraints in advance

There is no large and comprehensive component system, nor complex DSL.

This trade-off actually illustrates the problem very well: to clone Claude's interactive UI, the first thing to solve is not “how to render”, but “how to make the model output a structure that the host can stably consume”. Rendering is a later matter.

So this is first a protocol problem, not a rendering problem

Claude's interactive UI has a strong experience feature: the appearance of widgets is stable and predictable. It will not become a code block this time and natural language mixed with HTML next time.

To achieve this, you cannot rely on the model's “self-discipline”, only on protocol.

What Generative-UI-MCP exposes is exactly this layer:

Widgets must be wrapped with a dedicated fence
The fence must contain structured JSON
The widget_code field contains HTML or SVG
Explanatory text must be written outside the widget block
Multiple widgets must be split into multiple blocks
The output order must be suitable for streaming rendering

These constraints stacked up are essentially already close to a lightweight UI message protocol.

What the host does is essentially route between these messages.

Why the spec must be loaded on demand, not stuffed into the prompt all at once

The clone project splits the UI spec into several modules: interactive, chart, mockup, diagram, art, and loads what is needed.

This is not just to save tokens, but more importantly: different UI types have different constraints.

Charts have chart rules
Forms have form rules
Mockups have mockup rules
Diagrams have diagram rules
Art is another generation method

If you stuff them all into one big prompt, the model will be polluted by many irrelevant constraints. The value of on-demand loading is that only when a certain type of UI is really going to be generated, the model gets that type of rules, and the output is more stable.

This is the same idea as the real stream first passing:

JSON


{

"modules": ["diagram", "interactive"]

}

The difficulty of streaming is never “faster”, but “framing while streaming”

Many people's understanding of streaming stays at “display faster”. But from the real output and the clone project, it can be seen that what the host really needs to solve is the parser.

The host cannot just print tokens one by one. It must know:

Whether the current is plain text or a widget block
Whether the current is at the beginning or middle fragment of a block
Whether the JSON is fully closed
When text can be displayed directly
When to enter collection mode
When to hand the complete widget_code to the renderer

The whole process is roughly like this:

CodeBlock Loading...

Claude's feeling of “widgets emerging naturally” is technically not magic, but the parser doing frame-by-frame segmentation while streaming.

Why even the order of `<defs>`, `style`, shadows and other details must be managed

When you first see such specs, it is easy to feel that they are too detailed:

In SVG, <defs> must precede graphics
In HTML, style first, script last
Try to avoid gradients, shadows, blur

But these rules are not about aesthetics, but about whether every frame the user sees is valid.

The reason is simple:

If <defs> has not arrived yet and graphics come out first, markers and clipPath will be wrong first and then correct
If style arrives too late, the user will first see the naked UI and then see the style suddenly filled in
Gradients, blur, and shadows are more prone to cross-frame inconsistency during streaming patch

Claude's widget output rarely has obvious jitter, not only because the model is stronger, but also because this set of constraints suppresses the instability of intermediate states.

In that real output, the model finally generated an interactive panel of “25 common UI line icons”. It displays icons by category, clicking can highlight, and gives feedback at the bottom.

From the generated widget_code, several very clear trade-offs can be seen.

First, the layout is very simple, the core is a stable grid, rather than complex responsive tricks.

Second, the style is extremely light, all based on the CSS variables given by the host, no hard-coded colors, naturally adapts to dark mode.

Third, icons are directly inline SVG, no image resources, so it is easy to stream output, and easy to switch colors in hover and active states.

Fourth, JS is very short, only does local interaction, no complex state management, no network requests, no framework introduced.

This shows that such streaming UI is more like an “instant interactive shell in conversation”, not a complete front-end application. Complex logic is left to the model, local interaction stays on the front-end.

It is very suitable for:

Icon panels
Comparison cards
Lightweight filters
Small charts
Interactive explainers
Embedded mockups

But not very suitable for:

Super complex business forms
Large multi-page applications
Heavy state back-office systems
Strong real-time collaborative editors

Because its advantage is instant generation, instant embedding, instant interaction, not a long-running large application shell.

In the real stream, there is another important signal at the end: text and UI must divide labor

After the tool finishes rendering the widget, the system returns another prompt:

TEXT


Content rendered and shown to the user. Please do not duplicate the shown content in text because it's already visually represented.

The value of this prompt is great. It clearly tells the model: what has been rendered, don't repeat it.

The natural language that the model then adds is also very restrained, only doing three things:

Summarize what this widget is
Tell the user how to operate
Prompt the user what else they can ask the model to do next

This and the earlier readme sentence:

TEXT


Text goes in your response, visuals go in the tool.

form a closed loop.

In other words, Claude-style streaming UI not only “can render widgets”, but also manages the responsibility boundary between text and visuals.

The parts that `Generative-UI-MCP` can't see are actually the hardest to productize

Honestly, after reading this clone project, it becomes clearer what parts of the original system cannot be filled by protocol alone.

1. Sandbox

The HTML/JS generated by the model cannot run naked. There must be iframe isolation, CDN whitelist, script capability limits, resource permission boundaries.

Otherwise, as long as the model generates a piece of malicious script, the host will have problems.

2. Action protocol

What happens after the user clicks cannot be left to the model to write onclick and freely decide the logic. A mature design is more like the host first defines a unified action schema, such as:

filter_changed
submit_form
request_refresh
select_item

The widget only sends actions, the host decides whether to handle locally, call tools, or ask the model again.

3. Incremental patch

Claude updates widgets in multi-turn conversations, often not regenerating the whole block, but partially updating. This requires the host to maintain state, and also requires the model to know when to return a patch and when to return a full replacement.

The biggest gap between demo and product-level experience is probably here.

The one sentence really worth remembering

The biggest takeaway from Generative-UI-MCP is not learning some new trick, but realizing more clearly:

Building interactive UI is never about solving rendering first, but about solving protocol first.

Once the protocol is stable, the rest follows:

Streaming parser
Widget rendering
Sandbox execution
Event reflux
Tool mounting
Incremental update

Claude has basically run through this chain, so it doesn't feel like a demo. Generative-UI-MCP open-sourced the first part of this chain, so for the first time this matter becomes sufficiently understandable, discussable, and dismantlable.

Looking back at those seemingly fragmented streaming snippets, especially the recurring input_json_delta, widget_code, tool_use, they are no longer just noise. They are exactly the traces left by this entire generative UI protocol at runtime.

Appendix: The original prompts that actually appeared in this stream

The following is not a cleaned-up template, but the original prompts and spec text restored from the real streaming output.

1. Module selection

JSON


{

"modules": [

"diagram",

"interactive"

]

}

2. Spec text returned by `visualize:read_me`

TEXT


# Imagine — Visual Creation Suite



## Modules

Call read_me again with the modules parameter to load detailed guidance:

- `diagram` — SVG flowcharts, structural diagrams, illustrative diagrams

- `mockup` — UI mockups, forms, cards, dashboards

- `interactive` — interactive explainers with controls

- `chart` — charts, data analysis, geographic maps (Chart.js, D3 choropleth)

- `art` — illustration and generative art

Pick the closest fit. The module includes all relevant design guidance.



**Complexity budget — hard limits:**

- Box subtitles: ≤5 words. Detail goes in click-through (`sendPrompt`) or the prose below — not the box.

- Colors: ≤2 ramps per diagram. If colors encode meaning (states, tiers), add a 1-line legend. Otherwise use one neutral ramp.

- Horizontal tier: ≤4 boxes at full width (~140px each). 5+ boxes → shrink to ≤110px OR wrap to 2 rows OR split into overview + detail diagrams.



If you catch yourself writing "click to learn more" in prose, the diagram itself must ACTUALLY be sparse. Don't promise brevity then front-load everything.



You create rich visual content — SVG diagrams/illustrations and HTML interactive widgets — that renders inline in conversation. The best output feels like a natural extension of the chat.



## Core Design System



These rules apply to ALL use cases.



### Philosophy

- **Seamless**: Users shouldn't notice where claude.ai ends and your widget begins.

- **Flat**: No gradients, mesh backgrounds, noise textures, or decorative effects. Clean flat surfaces.

- **Compact**: Show the essential inline. Explain the rest in text.

- **Text goes in your response, visuals go in the tool** — All explanatory text, descriptions, introductions, and summaries must be written as normal response text OUTSIDE the tool call. The tool output should contain ONLY the visual element (diagram, chart, interactive widget). Never put paragraphs of explanation, section headings, or descriptive prose inside the HTML/SVG. If the user asks "explain X", write the explanation in your response and use the tool only for the visual that accompanies it. The user's font settings only apply to your response text, not to text inside the widget.



### Streaming

Output streams token-by-token. Structure code so useful content appears early.

- **HTML**: `<style>` (short) → content HTML → `<script>` last.

- **SVG**: `<defs>` (markers) → visual elements immediately.

- Prefer inline `style="..."` over `<style>` blocks — inputs/controls must look correct mid-stream.

- Keep `<style>` under ~15 lines. Interactive widgets with inputs and sliders need more style rules — that's fine, but don't bloat with decorative CSS.

- Gradients, shadows, and blur flash during streaming DOM diffs. Use solid flat fills instead.



### Rules

- No `<!-- comments -->` or `/* comments */` (waste tokens, break streaming)

- No font-size below 11px

- No emoji — use CSS shapes or SVG paths

- No gradients, drop shadows, blur, glow, or neon effects

- No dark/colored backgrounds on outer containers (transparent only — host provides the bg)

- **Typography**: The default font is Anthropic Sans. For the rare editorial/blockquote moment, use `font-family: var(--font-serif)`.

- **Headings**: h1 = 22px, h2 = 18px, h3 = 16px — all `font-weight: 500`. Heading color is pre-set to `var(--color-text-primary)` — don't override it. Body text = 16px, weight 400, `line-height: 1.7`. **Two weights only: 400 regular, 500 bold.** Never use 600 or 700 — they look heavy against the host UI.

- **Sentence case** always. Never Title Case, never ALL CAPS. This applies everywhere including SVG text labels and diagram headings.

- **No mid-sentence bolding**, including in your response text around the tool call. Entity names, class names, function names go in `code style` not **bold**. Bold is for headings and labels only.

- The widget container is `display: block; width: 100%`. Your HTML fills it naturally — no wrapper div needed. Just start with your content directly. If you want vertical breathing room, add `padding: 1rem 0` on your first element.

- Never use `position: fixed` — the iframe viewport sizes itself to your in-flow content height, so fixed-positioned elements (modals, overlays, tooltips) collapse it to `min-height: 100px`. For modal/overlay mockups: wrap everything in a normal-flow `<div style="min-height: 400px; background: rgba(0,0,0,0.45); display: flex; align-items: center; justify-content: center;">` and put the modal inside — it's a faux viewport that actually contributes layout height.

- No DOCTYPE, `<html>`, `<head>`, or `<body>` — just content fragments.

- When placing text on a colored background (badges, pills, cards, tags), use the darkest shade from that same color family for the text — never plain black or generic gray.

- **Corners**: use `border-radius: var(--border-radius-md)` (or `-lg` for cards) in HTML. In SVG, `rx="4"` is the default — larger values make pills, use only when you mean a pill.

- **No rounded corners on single-sided borders** — if using `border-left` or `border-top` accents, set `border-radius: 0`. Rounded corners only work with full borders on all sides.

- **No titles or prose inside the tool output** — see Philosophy above.

- **Icon sizing**: When using emoji or inline SVG icons, explicitly set `font-size: 16px` for emoji or `width: 16px; height: 16px` for SVG icons. Never let icons inherit the container's font size — they will render too large. For larger decorative icons, use 24px max.

- No tabs, carousels, or `display: none` sections during streaming — hidden content streams invisibly. Show all content stacked vertically. (Post-streaming JS-driven steppers are fine — see Illustrative/Interactive sections.)

- No nested scrolling — auto-fit height.

- Scripts execute after streaming — load libraries via `<script src="https://cdnjs.cloudflare.com/ajax/libs/...">` (UMD globals), then use the global in a plain `<script>` that follows.

- **CDN allowlist (CSP-enforced)**: external resources may ONLY load from `cdnjs.cloudflare.com`, `esm.sh`, `cdn.jsdelivr.net`, `unpkg.com`. All other origins are blocked by the sandbox — the request silently fails.



### CSS Variables

**Backgrounds**: `--color-background-primary` (white), `-secondary` (surfaces), `-tertiary` (page bg), `-info`, `-danger`, `-success`, `-warning`

**Text**: `--color-text-primary` (black), `-secondary` (muted), `-tertiary` (hints), `-info`, `-danger`, `-success`, `-warning`

**Borders**: `--color-border-tertiary` (0.15α, default), `-secondary` (0.3α, hover), `-primary` (0.4α), semantic `-info/-danger/-success/-warning`

**Typography**: `--font-sans`, `--font-serif`, `--font-mono`

**Layout**: `--border-radius-md` (8px), `--border-radius-lg` (12px — preferred for most components), `--border-radius-xl` (16px)

All auto-adapt to light/dark mode. For custom colors in HTML, use CSS variables.



**Dark mode is mandatory** — every color must work in both modes:

- In SVG: use the pre-built color classes (`c-blue`, `c-teal`, `c-amber`, etc.) for colored nodes — they handle light/dark mode automatically. Never write `<style>` blocks for colors.

- In SVG: every `<text>` element needs a class (`t`, `ts`, `th`) — never omit fill or use `fill="inherit"`. Inside a `c-{color}` parent, text classes auto-adjust to the ramp.

- In HTML: always use CSS variables (--color-text-primary, --color-text-secondary) for text. Never hardcode colors like color: #333 — invisible in dark mode.

- Mental test: if the background were near-black, would every text element still be readable?



### sendPrompt(text)

A global function that sends a message to chat as if the user typed it. Use it when the user's next step benefits from Claude thinking. Handle filtering, sorting, toggling, and calculations in JS instead.



### Links

`<a href="https://...">` just works — clicks are intercepted and open the host's link-confirmation dialog. Or call `openLink(url)` directly.



## When nothing fits

Pick the closest use case below and adapt. When nothing fits cleanly:

- Default to editorial layout if the content is explanatory

- Default to card layout if the content is a bounded object

- All core design system rules still apply

- Use `sendPrompt()` for any action that benefits from Claude thinking

3. Real parameters of `visualize:show_widget`

JSON


{

"title": "ui_icons_outline",

"loading_messages": [

"Sketching icon paths...",

"Adding hover magic...",

"Lining up the grid..."

],

"i_have_seen_read_me": true,

"widget_code": "<style>...</style><div>...</div><script>...</script>"

}

4. Real prompt after tool rendering

TEXT


Content rendered and shown to the user. Please do not duplicate the shown content in text because it's already visually represented.



[This tool call rendered an interactive widget in the chat. The user can already see the result — do not repeat it in text or with another visualization tool.]

How Claude-style streaming UI is actually built

How Claude-style streaming UI is actually built

It is a continuously interactive UI page

A real streaming output makes this very clear

Step 1: Load the UI generation spec first

Step 2: The tool returns a whole design system and streaming constraints

The real key is not HTML, but the structure of the model output

Why the Generative-UI-MCP project looks small but is valuable

So this is first a protocol problem, not a rendering problem

Why the spec must be loaded on demand, not stuffed into the prompt all at once

The difficulty of streaming is never “faster”, but “framing while streaming”

Why even the order of <defs>, style, shadows and other details must be managed

From a real widget, what kind of front-end this method favors

In the real stream, there is another important signal at the end: text and UI must divide labor

The parts that Generative-UI-MCP can't see are actually the hardest to productize

1. Sandbox

2. Action protocol

3. Incremental patch

The one sentence really worth remembering

Appendix: The original prompts that actually appeared in this stream

1. Module selection

2. Spec text returned by visualize:read_me

3. Real parameters of visualize:show_widget

4. Real prompt after tool rendering

Why the `Generative-UI-MCP` project looks small but is valuable

Why even the order of `<defs>`, `style`, shadows and other details must be managed

The parts that `Generative-UI-MCP` can't see are actually the hardest to productize

2. Spec text returned by `visualize:read_me`

3. Real parameters of `visualize:show_widget`