OS Level Actions: Closing the Last-Mile Gap in Browser Automation for AI Agents
TL;DR
- AWS Bedrock AgentCore Browser now exposes OS Level Actions via the InvokeBrowser API, letting AI agents control native system UI (dialogs, prompts, context menus) as well as the browser DOM.
- Agents operate with an action → screenshot → model reasoning → next action loop: send one OS-level mouse/keyboard action, capture a full-desktop screenshot (base64 PNG), run vision-model reasoning, and decide the next step.
- This removes a frequent production blocker for AI automation but requires careful governance: IAM least privilege, screenshot redaction/ephemeral handling, observability, and performance trade-offs.
Why this matters for business
AI agents automating web workflows often fail at system dialogs, print windows, certificate pickers and other OS-rendered UI that sit outside the browser DOM. Playwright and the Chrome DevTools Protocol (CDP) can manipulate the page, but not the OS-layer modal that suddenly appears and stops the process. OS Level Actions remove that last-mile friction. For enterprises, that means fewer manual handoffs, more reliable end-to-end automation, and broader classes of workflows that can be fully automated—customer support flows, back-office reconciliations, and system integrations included.
“The web automation layer has a hard boundary — native OS-rendered UI is invisible to CDP and Playwright, creating real-world blockers in production.”
Concrete examples (mini case studies)
Print dialog — dismiss without a human
A workflow calls window.print() and a native print dialog appears. Previously the agent could see the dialog in a screenshot but had no way to click Cancel. Now: the agent captures the desktop, a vision model identifies the Cancel button coordinates, and the agent issues a mouseClick via InvokeBrowser to dismiss the dialog—no human required.
Certificate picker during secure login
An enterprise SSO flow triggers a certificate chooser. The agent detects the picker, selects the correct certificate by coordinates identified by a vision model, and continues the login flow—preventing a failed automation run that would otherwise need manual intervention.
File-open dialogs and permission prompts
File selection and OS permission modals stop headless automation cold. With OS Level Actions, agents can open, navigate, and confirm or cancel these native dialogs, which is critical for workflows that bridge web and desktop behavior.
How it works: the action → screenshot → model reasoning loop
At a high level the pattern is intentionally simple:
- Invoke one OS-level action (mouse or keyboard) via InvokeBrowser.
- Optionally or obligatorily capture a full-desktop screenshot (base64 PNG).
- Send the screenshot to a vision model (Bedrock InvokeModel or similar).
- Parse the model output (UI element coordinates or intent) and issue the next action.
- Repeat until the workflow completes.
Supported actions include: mouseClick, mouseMove, mouseDrag, mouseScroll, keyType, keyPress, keyShortcut, screenshot. Non-screenshot actions return SUCCESS or FAILED (with an error); the screenshot action returns base64-encoded PNG image data.
Coordinate mapping is tied to the browser session viewport you create when starting the session (for example, a 1920×1080 viewport means x in 0–1919 and y in 0–1079). Out-of-range coordinates will raise a ValidationException.
Integration and architecture notes
- Two clients coordinate the flow: bedrock-agentcore-control (control plane) and bedrock-agentcore (data plane).
- Sessions are identified by the
x-amzn-browser-session-idheader so OS actions target the correct browser instance. - Playwright and CDP remain the tools for DOM automation; InvokeBrowser fills the gap for OS-rendered UI that those tools cannot control.
- Vision-model decisions typically use Bedrock InvokeModel or other vision-capable models (Amazon Nova Act is one integration pattern noted by the team).
Engineer note: The API is deliberately explicit: one action per request simplifies auditing and reduces surprising side-effects. Screenshot is the only action that returns image data so vision-model input is a discrete event you can control and log.
Pseudocode sequence (practical pattern)
- Create browser resource and start browser session with viewport size.
- Loop:
- InvokeBrowser(action = mouseClick/mouseMove/…)
- If you need to observe the result: InvokeBrowser(action = screenshot) → base64 image
- Send image to vision model → parse coordinates or next-intent
- Decide next action (repeat or finish)
- Stop browser session and clean up resources.
Production considerations: trade-offs and mitigations
Security & compliance
Screenshots capture the entire desktop and can expose sensitive information. Essential mitigations:
- IAM least privilege: grant only
bedrock-agentcore:InvokeBrowser,StartBrowserSession,StopBrowserSessionand related permissions to specific roles. - Make screenshots ephemeral: redact PII before storage or use inferences that run inline and discard images immediately.
- Encrypt and audit all screenshot and model invocation events; retain logs for compliance but mask sensitive pixels where possible.
- Apply data handling policies when sending images to external models or services.
Performance & cost
Each action→screenshot→model loop adds latency and inference cost. To manage that:
- Use smaller, task-specific models for binary or coordinate detection when possible.
- Batch inference or cache common UI detections to avoid repeated full-model calls.
- Choose sampling frequency: not every action needs a screenshot—balance risk vs. speed.
Reliability & OS variation
Native UI varies across OS versions, localizations, and accessibility settings. Robustness strategies:
- Fine-tune or augment vision models with screenshots from target environments and locales.
- Maintain UI templates and fallback heuristics (pixel anchors, relative positions).
- Include retry and fallback flows—if visual detection fails, escalate or route to a human-approved path.
Concurrency, focus, and virtualization
When running many sessions in parallel, ensure session isolation so cursor and focus don’t bleed between sessions. Recognize that some native behaviors may differ under virtualization; test in the exact managed environment you plan to run.
Observability & testing
Monitor per-action latency, screenshot sizes, inference time, action success rates, and the number of manual handoffs. Test across OS versions, locales, accessibility modes, and in high-concurrency scenarios.
Pilot checklist and KPIs
- Choose a low-risk workflow that previously failed due to native dialogs (print dialog, certificate picker, file-open).
- Instrument: action latency, inference latency, success rate, manual handoffs avoided, cost per workflow.
- Implement security controls: least-privilege IAM, ephemeral screenshots, encryption, and audit logs.
- Run horizontal tests across OS versions and locales; gather failure modes and retrain/tune models as needed.
- Estimate cost and latency at projected scale; explore model optimizations or caching.
Quick FAQ
Will screenshots expose PII?
Yes, they can. Use redaction, ephemeral handling, and strict access controls before storing or sending screenshots to models.
Does this replace DOM automation?
No. Playwright/CDP still handle DOM-level tasks. OS Level Actions extend automation to native UI that lives outside the DOM.
How are sessions tied to actions?
Actions target the correct browser session via thex-amzn-browser-session-idheader; create/start the session with the control/data plane clients.
How to get started now
- Clone the AgentCore samples on GitHub (awslabs/agentcore-samples) and run the companion notebook to reproduce examples and patterns.
- Set up the control and data plane clients (bedrock-agentcore-control and bedrock-agentcore), grant the IAM execution role the necessary permissions, and start a browser session with a defined viewport.
- Pilot a single low-risk workflow, instrument metrics, and iterate on vision-model accuracy and cadence.
Key takeaway: OS Level Actions close a practical gap between web and OS automation, enabling AI agents to observe and act across the full desktop surface. That unlocks more reliable enterprise automation—if organizations pair the capability with strong security, observability, and performance practices.
Contributors on this capability include Evandro Franco, Phelipe Fabres, Saurav Das, Yanda Hu, Cristiano Scandura, and Joshua Samuel from the AWS team. Try the samples on GitHub and consider a small pilot to measure real-world ROI and risk before scaling.