Skip to Content
ConceptsAccessibility model

Accessibility model

Most software treats accessibility as a compliance pass at the end: ship the feature, then check whether a screen reader can use it. Autonomy inverts that order. Accessibility state is read before an agent decides how to act, and that state is what selects the agent’s behavior — not the other way around. This page is the mental model for why that ordering is the design driver, not an add-on.

The north star

Autonomy’s guiding rule, repeated at every layer from product strategy down to the skill files agents load at session start: disabled users should not have to adapt to agents — agents should adapt to the user’s real assistive technology, permissions, preferences, and pace. Concretely, that means an agent working through Autonomy is expected to check what the user’s setup actually is before choosing how to proceed, every time, rather than assuming a sighted mouse-and-keyboard user on the other end.

Assistive modes: the mechanism

Autonomy exposes accessibility state as a runtime read (accessibility_state) and lets an agent select an explicit mode based on it, rather than burying the adaptation in prose the agent might skip:

ModeSelected whenChanges
voiceover_awareVoiceOver is runningPrefer AX labels/roles/values and focused navigation over pointer clicks; announce before irreversible actions
switch_safeSwitch Control is runningAvoid surprise pointer jumps; prefer keyboard focus and AX actions; pause before disorienting focus changes
keyboard_firstNo AT-specific mode winsPrefer keyboard shortcuts and focus movement over mouse movement; batch low-risk steps, still ask before destructive ones
speech_enabledMicrophone/Speech Recognition/Personal Voice are authorizedSpeech input and output become available; a non-speech fallback stays available when permissions are denied or undetermined
visual_verificationScreen Recording is authorizedScreenshots/OCR become a legitimate fallback for verification; never the default read path
high_consentThe task is authenticated, destructive, spends money, shares data, or changes settingsExplicit consent is required regardless of any other mode

A mode is a runtime decision an agent makes explicitly, not an assumption baked into a tool. The same tool call behaves differently depending on which mode is active, because the mode changes how the agent gets there — AX read versus pointer click, announced versus silent, verified versus assumed.

Mental model

Accessibility state is read once, up front, and shapes every downstream choice — which is the opposite of a generic agent that clicks first and discovers the user’s setup, if ever, through failure.

Screen-off use as the proving ground

The clearest test of this model is a screen-off session: a blind user working with a coding agent with no visual monitoring at all. The agent registers a spoken identity, sends frequent and concise spoken updates through one deterministic audio channel, explains what it’s checking and what it found, and asks for a decision whenever consent or intent is required. A local, redacted “last N events” timeline lets that same user ask “what happened recently?” without depending on a cloud trace — because a screen-off user needs a way to review history that doesn’t require having watched it happen.

Low-vision verification is the visual-side analog: a user running Zoom, increased contrast, reduced motion, or reduced transparency asks an agent to verify an on-screen result. The agent reads display preferences from accessibility state first, only uses visual_verification once Screen Recording is actually authorized, and reports what changed in text and structured state — not color alone, since color-only status is exactly the kind of signal a low-vision or colorblind setup can miss.

The pitfall: delivery is not perception

Spoken output is where the accessibility model is strictest, because it’s the easiest place to overclaim. A tool call succeeding only proves a delivery channel was used — it does not prove the user heard it. Autonomy’s evidence model keeps these distinct on purpose:

A successful announcement call means voiceover_direct_requested, tts_fallback_used, or speech_queue_entered — not “the user heard this.” That upgrades only on explicit user confirmation. Treat user_heard_unverified as the honest default, not a bug.

The same discipline applies more broadly: everything the runtime reports is tagged as observed, inferred, user-declared, unknown, or not measured. An agent (or a document) that collapses those into one undifferentiated “it worked” is making a claim the evidence doesn’t support.

Common misconceptions

  • “Accessible labels are enough.” A control having a label is necessary but not sufficient — grouping, focus order, announced state changes, and verified outcomes all matter, which is why mode selection exists instead of a single “accessible: yes/no” flag.
  • “Louder or more frequent narration is more accessible.” Concise, relevant updates beat a firehose; a screen-off user pays a real time cost for every spoken word.
  • “Visual verification is the default check.” It’s a fallback, gated on an actual permission grant — AX and semantic state come first.

For how high-risk actions get gated once a mode is selected, see safety classes. For the daemon mechanics behind accessibility_state and mode selection, see the runtime. To try a screen-off session yourself, see your first screen-off task.

Last updated on