Accessibility model
Most software treats accessibility as a compliance pass at the end: ship the feature, then check whether a screen reader can use it. Autonomy inverts that order. Accessibility state is read before an agent decides how to act, and that state is what selects the agent’s behavior — not the other way around. This page is the mental model for why that ordering is the design driver, not an add-on.
The north star
Autonomy’s guiding rule, repeated at every layer from product strategy down to the skill files agents load at session start: disabled users should not have to adapt to agents — agents should adapt to the user’s real assistive technology, permissions, preferences, and pace. Concretely, that means an agent working through Autonomy is expected to check what the user’s setup actually is before choosing how to proceed, every time, rather than assuming a sighted mouse-and-keyboard user on the other end.
Assistive modes: the mechanism
Autonomy exposes accessibility state as a runtime read (accessibility_state)
and lets an agent select an explicit mode based on it, rather than burying the
adaptation in prose the agent might skip:
| Mode | Selected when | Changes |
|---|---|---|
voiceover_aware | VoiceOver is running | Prefer AX labels/roles/values and focused navigation over pointer clicks; announce before irreversible actions |
switch_safe | Switch Control is running | Avoid surprise pointer jumps; prefer keyboard focus and AX actions; pause before disorienting focus changes |
keyboard_first | No AT-specific mode wins | Prefer keyboard shortcuts and focus movement over mouse movement; batch low-risk steps, still ask before destructive ones |
speech_enabled | Microphone/Speech Recognition/Personal Voice are authorized | Speech input and output become available; a non-speech fallback stays available when permissions are denied or undetermined |
visual_verification | Screen Recording is authorized | Screenshots/OCR become a legitimate fallback for verification; never the default read path |
high_consent | The task is authenticated, destructive, spends money, shares data, or changes settings | Explicit consent is required regardless of any other mode |
A mode is a runtime decision an agent makes explicitly, not an assumption baked into a tool. The same tool call behaves differently depending on which mode is active, because the mode changes how the agent gets there — AX read versus pointer click, announced versus silent, verified versus assumed.
Mental model
Accessibility state is read once, up front, and shapes every downstream choice — which is the opposite of a generic agent that clicks first and discovers the user’s setup, if ever, through failure.
Screen-off use as the proving ground
The clearest test of this model is a screen-off session: a blind user working with a coding agent with no visual monitoring at all. The agent registers a spoken identity, sends frequent and concise spoken updates through one deterministic audio channel, explains what it’s checking and what it found, and asks for a decision whenever consent or intent is required. A local, redacted “last N events” timeline lets that same user ask “what happened recently?” without depending on a cloud trace — because a screen-off user needs a way to review history that doesn’t require having watched it happen.
Low-vision verification is the visual-side analog: a user running Zoom,
increased contrast, reduced motion, or reduced transparency asks an agent to
verify an on-screen result. The agent reads display preferences from
accessibility state first, only uses visual_verification once Screen
Recording is actually authorized, and reports what changed in text and
structured state — not color alone, since color-only status is exactly the
kind of signal a low-vision or colorblind setup can miss.
The pitfall: delivery is not perception
Spoken output is where the accessibility model is strictest, because it’s the easiest place to overclaim. A tool call succeeding only proves a delivery channel was used — it does not prove the user heard it. Autonomy’s evidence model keeps these distinct on purpose:
A successful announcement call means voiceover_direct_requested,
tts_fallback_used, or speech_queue_entered — not “the user heard this.”
That upgrades only on explicit user confirmation. Treat
user_heard_unverified as the honest default, not a bug.
The same discipline applies more broadly: everything the runtime reports is tagged as observed, inferred, user-declared, unknown, or not measured. An agent (or a document) that collapses those into one undifferentiated “it worked” is making a claim the evidence doesn’t support.
Common misconceptions
- “Accessible labels are enough.” A control having a label is necessary but not sufficient — grouping, focus order, announced state changes, and verified outcomes all matter, which is why mode selection exists instead of a single “accessible: yes/no” flag.
- “Louder or more frequent narration is more accessible.” Concise, relevant updates beat a firehose; a screen-off user pays a real time cost for every spoken word.
- “Visual verification is the default check.” It’s a fallback, gated on an actual permission grant — AX and semantic state come first.
For how high-risk actions get gated once a mode is selected, see
safety classes. For the daemon mechanics behind
accessibility_state and mode selection, see the runtime.
To try a screen-off session yourself, see
your first screen-off task.