Where do agents live?

2026 · 06 · 1,500 words

Since 2022, when people think about AI, and more recently when they think about agents, they picture a chatbox. And what could be more natural than that? Language is the first human API, the one we have all been training on since birth, and we got so good at it that we now reach for it everywhere. ChatGPT, Claude, Gemini, Deepseek: whichever one you open, it's the same page and the same kind of interaction, a box that waits for your words and answers in kind. But a box is only one place an agent can live, and where it lives turns out to decide what a good one even looks like.

Another article about the next revolution of chatboxes.

Rest assured, this is not a rant about how every interface looks the same these days, and it isn't an article where I reinvent the chatbox. I have no interest in doing either, and I'll happily admit that chatboxes are very good at the one thing they were built to do: turning your words into more words.

But is that really all we want our agents to do?

Where agents already live

A chatbox is a home, but it's a small one. The agent sits in a column off to the side, waits for a message, and replies; it never actually touches the thing you're working on. Yet agents already live in much stranger places than that. In Claude Code they live down in your filesystem, editing files you never watch them open and leaving the result behind as if nothing had happened. In a browser agent they live inside the page itself, clicking and scrolling exactly where you would have clicked and scrolled.

Each of these homes changes what a good agent even looks like, and there's a pattern hiding in the difference: the deeper an agent sits inside its environment, the less you notice it working at all. So what happens if you give one the richest environment we have? I mean a canvas: a space with position, proximity, structure, and things you can physically pick up and move around.

What an agent could actually do

Agents are capable of far more than replying: calling tools lets them execute code, compute things, and reach into your documents or your settings. If you set out to put one inside a spatial interface, the first intuition is almost always the same. You make the agent walk over to the document it wants to work on, pop up a little dialog bubble, wait for the user to answer its question, play a small "thinking..." animation, and then drop back into idle once it's finished.

It looks alive, and that's the appeal: a little character wandering around your canvas, busy on your behalf. It's also exhausting to watch, and it's slow. The best agents, it turns out, are invisible. They tag a document the moment it lands on the canvas, and they answer your questions a beat before you think to ask them. The work simply appears, and most of the time the result is the only thing the agent ever says.

drop a document anywhereinboxmockup.figinvoice.pdfdata.csvcontract.pdf
Drop a document anywhere on the canvas. The agent reads what it is, then tags and colors it where it lands.

This is the quiet end of the spectrum, where the action itself is the trace. A tag is cheap, obvious, and trivial to undo, so there is nothing to ask about and nothing worth announcing. Doing it is just faster than describing it.

When invisible stops being safe

But not everything is a tag. Some actions are loud: reorganising a whole board, merging two clusters, drawing a connection that quietly implies a decision, throwing out the one thing that didn't fit. The further an action sits from "trivially undoable," the more it costs you to be wrong about it.

So the real rule here is reversibility. If something is cheap and reversible, the agent should just do it and let the result speak for itself. If it's expensive, or genuinely hard to walk back, the agent should show its intent before it commits to anything: a ghost of the change, sitting there and waiting for your nod.

buttonlogindeploymodalcachelogstoastqueuebuild
Reorganizing is hard to undo, so the agent does it for real but waits: keep it, or discard to put everything back.
Why not just confirm everything?

Because confirmation is just narration wearing a disguise. Ask the user to approve the tag, then the connection, then the recolor, every small certainty along the way, and you have quietly rebuilt the chatbox out of dialog boxes. The whole point is to spend the user's attention only in the places where being wrong is actually expensive.

Showing what it sees

There's a second thing an agent can do on a canvas that it could never manage in a column of text: it can show you that it understands the space. Not by announcing "I see three clusters here," but by making those three clusters glow softly when you reach toward them, or by drawing the line it thinks connects two notes, lightly, before you have even asked whether they belong together.

This is the open question I don't yet have a clean answer to. What is the visual language of an agent's attention? Maybe it's highlighting, maybe grouping, maybe a transient overlay that fades the moment you look away. Whatever it turns out to be, it has to share the canvas with your own work without ever shoving that work aside.

Underneath all of it sits a stack of context that the agent reads from, narrowest to widest: the object you just touched, the rest of the canvas around it, the data living inside the app, and finally the open web. The same gesture can mean completely different things depending on what's in that stack: drop a CSV onto the canvas and it charts, drop a contract and it tags the parties, drop a screenshot and it pulls the text straight out. The behaviour isn't hardcoded anywhere; it's chosen from what the thing actually is and what happens to be near it.

All of it, at once

None of these moves really lives alone, so here they all are in one place. You select an object, and the agent offers the single action the environment believes that thing is for. You ask for it, however the surface happens to let you ask, and the agent takes a beat to think. Then one of two things happens. If the work is cheap and obvious, it simply does it: a bubble shows its hand for a second, just the one word for what it's about to commit to, and then the result is the part that stays. The linked notes glow, a chart or a summary drops onto the canvas, the scattered board pulls itself into a tidy grid.

But some objects don't have a single obvious answer. A screenshot could become text, or a note, or a search for other things that look like it, and nothing about the file itself tells you which one you actually wanted. When the call is genuinely yours to make, the agent doesn't guess, and it certainly doesn't reorganise your whole canvas just to start a conversation. It asks you, once, in a thin bar floating at the bottom of the screen.

select an object, then press playnotes.mddata.csvtasks.mdshot.pngclip.urlideathe agent asks here when the call is yours
Select an object and press play. Clear work just happens, the result is the only thing it says. When the choice is yours, the agent surfaces it in the bar instead of guessing.

Notice where the talking went. It didn't disappear; it stopped being the default. The chat bar is still there, hovering quietly over the work, and it opens from either side: the agent surfaces it when there's a decision only you can make, and you can call it up yourself the moment words are the quickest way to say what you want. Language is still the most powerful thing either of you has; it just stopped being the only thing in the room. That's the real inversion: the box that used to hold the entire interaction is now one surface among many, reached for when it suits the moment instead of because nothing else is there.

Show, don't tell

Figma has gotten closer to this than anyone. Their new agent lives on the canvas instead of in a sidebar, and it reads your actual component library instead of guessing at what might be in one. Here is how they announced it:

Now we're going further, with a Figma agent available directly on the canvas and in the left rail.

That's exactly the right instinct: native to the space, and aware of the elements already in it. And yet notice how the same announcement tells you to use it: "Ask Figma to summarize feedback, identify themes, and turn input into next steps." "Chat with Figma to update typography across a file." Every example is a sentence you say to it, and their own demos play out the same way each time: a prompt goes in, a result comes back, and the agent's half of the exchange is words.

Figma's agent, prompted to explore style: “Give me 3 style options for this design: one that's organic, one modern, and one retro.”
Figma's agent, prompted to act on feedback: “Sort these comments by theme. Create a new rev of my design that incorporates the feedback for this profile.”

It still mostly talks. You prompt it, it narrates back, it tells you what it just did.

So here is the rule from storytelling that every spatial interface should quietly steal: show, don't tell. A text agent can only ever tell you things. "I tagged your document." "I found three related notes." "I reorganised your board." A spatial agent can show you instead. The tag is simply there. The notes are already glowing. The board is already grouped and waiting for your nod. The discomfort you feel around text-based agents was never that they're bad at language, because they are genuinely great at language. It's that the medium is fighting the message. You built yourself a space, and then sat a narrator down in the corner to describe that space back to you.

Where they actually live

Not in a box bolted onto the side. They live in the canvas itself, mostly out of sight, doing the cheap and reversible work without ever starting a conversation about it, and surfacing only to show you three things: what they are about to change, what they already understand about your space, and the one question they genuinely can't answer for you. That last one is when the chat bar speaks on its own, though it stays there for you to open first, any time language is simply the cleaner way to say what you mean.

So no, I didn't reinvent the chatbox. I moved it off center and kept it a keystroke away for the moments when words really are the best way through, then handed the rest of the screen back to your work. The best agent you'll ever use is the one you'll barely remember was there at all.