Inspiration

The idea for SkyVoice was sparked when Uyen witnessed her father’s growing frustration with the digital world. As his vision declined, even simple tasks became a source of immense difficulty. Most modern websites are visual mazes. They are cluttered with pop-ups and complex sidebars that are a nightmare for traditional screen readers. Our team saw that it was not just about reading the text. It was about the friction of the interface itself. Therefore, we built SkyVoice so that anyone with a visual impairment could simply say, "Check if I have any new messages from my doctor" or "Find the grocery store hours." SkyVoice handles the scrolling, clicking, and filtering on their behalf. SkyVoice is not just a screen reader. It is a digital companion that turns a complex UI into a natural, stress-free conversation.

What it does

SkyVoice is an autonomous AI concierge that transforms complex web applications into intuitive, voice driven experiences. Instead of forcing users to navigate through menus and buttons, SkyVoice allows them to simply speak their intent through our Speak Then Act pipeline. This system parses natural conversation into precise browser actions, executing clicks, searches, and data entry on any website via the Nova Act layer.

The experience is defined by human like interaction and high speed performance. Using Nova Sonic TTS, SkyVoice provides voice responses with sub second latency, ensuring the conversation feels immediate and fluid. We integrated smart Voice Activity Detection so the AI understands natural pauses, allowing users to speak at their own pace without interruption. Furthermore, the system maintains persistent context through Supabase, remembering user preferences and histories across different sessions so the assistant grows more helpful over time.

How we built it

We built SkyVoice on the foundation of the Amazon Nova ecosystem. We utilized Nova 2 Sonic for multimodal reasoning and high speed, bidirectional voice streaming to keep latency at a minimum. For the critical execution layer, we integrated Nova Act to handle the heavy lifting of autonomous browser interactions. This allows the agent to interpret a visual page and interact with it just as a human would. The technical architecture is built with a backend consisting of Node.js, TypeScript, and Python. We containerized the entire stack using Docker and deployed it on AWS ECS/Fargate to ensure scalable, low latency performance. On the frontend, we developed a sleek Chrome Extension using React and Tailwind CSS that lives in the side panel for constant accessibility. Supabase serves as our PostgreSQL database, managing the session persistence that allows for a seamless user experience across the web.

Challenges we ran into

The biggest technical hurdle was managing real time communication and synchronization. Integrating Voice Activity Detection with bidirectional streaming was a significant difficulty. We had to ensure the AI would not talk over the user while still responding almost instantly to maintain the feeling of a real concierge. This required fine tuning the balance between processing speed and conversational awareness.

We also dedicated significant effort to prompt engineering for the Nova Act engine. Many modern websites are highly dynamic and Javascript heavy, which often makes them inaccessible to standard automation tools. Ensuring that the AI could precisely identify and interact with elements on crowded or poorly structured news portals and medical sites was a major hurdle. We had to iterate constantly to make sure clicks and filters landed exactly where they were intended.

Accomplishments that we're proud of

We are incredibly proud of achieving a sub second response time for our voice interactions. Creating a flow that feels like a natural conversation rather than a robotic command system was a primary goal for our team. It was a major milestone to see the agent successfully navigate a multi step process, such as filtering specific news topics or checking a secure portal, based entirely on a single voice command. This proved that our autonomous browsing layer could truly handle the complexity of the modern web.

What we learned

Through this process, we learned that true accessibility is about agency rather than just description. Making a website readable for a screen reader is only half the battle. Users with visual impairments deserve the same speed and efficiency as everyone else. We also gained deep technical experience in orchestrating Amazon Bedrock and managing complex state between a browser extension and a cloud backend. This project reinforced the importance of building tools that adapt to the user rather than forcing the user to adapt to the tool.

What's next for SkyVoice

We plan to expand SkyVoice to handle multi tab coordination, allowing the concierge to pull information from one site to use in another for complex multitasking. We also intend to implement visual verification features where the AI can see and describe complex images, graphs, or visual data for the user. Our ultimate goal is to turn SkyVoice into a universal digital companion that removes the visual barriers of the internet and restores independence to those with vision loss.

Built With

Share this project:

Updates