Stark

Inspiration

I wanted to build a personal AI assistant that actually works — not just answers questions, but controls my computer, knows my location, sends WhatsApp messages, opens apps, navigates maps, reads my screen, and responds to my voice in real time. Inspired by JARVIS from Iron Man, I built STARK — a voice-first AI Operating System powered by Gemini Live for intelligent multimodal understanding. The inspiration came from a simple frustration: existing voice assistants (Siri, Cortana, Google Assistant) are limited to basic tasks and cannot perform complex agentic actions like reading your screen, identifying people in images, sending messages autonomously, or planning travel step by step.

What it does

STARK is a fully voice-controlled personal AI OS that:

Speaks and listens continuously — always ready, no wake word needed Controls your PC — volume, brightness, screenshots, app launching, scrolling Sends WhatsApp messages and makes calls — voice triggered, no typing Navigates Google Maps — finds nearby places using real GPS location Reads your screen — OCR + Gemini vision to identify text, people, and objects Searches the web using a RAG agent architecture with 5 data sources Plays Spotify and YouTube — voice controlled Answers questions like ChatGPT — detailed, structured, accurate answers Health monitoring — water reminders, break reminders, sleep reminders Identifies images — point camera at anything, STARK tells you what it is Plans travel — step by step travel guides, visa info, famous places Manages contacts and alarms reminders timer by voice Active Coding Partner: I monitor your code in real-time. If I detect an error (which I'll track—usually catching them 3–5 times before intervening), I will politely guide you or provide the fix.

File Management: With your permission, I can directly edit, add, or delete code within your file explorer to keep your workflow clean.

Meeting Assistant: During Zoom or online meetings, I listen for questions. Instead of speaking, I’ll discreetly display answers on your screen for you to read, ensuring you stay in the lead without interruption.

How we built it

Voice Input → Speech Recognition (Google STT) ↓ Intent Router → Tool Selector ↓ Agent System (RAG Pipeline) ├── Web Search (Google + DuckDuckGo + Wikipedia) ├── TMDB (movies) ├── Weather API └── Location (Windows GPS + IP) ↓ Gemini Live / Groq LLM → Answer ↓ TTS Output (pyttsx3) + Action Execution

Tech Stack: Gemini 1.5 Flash — image identification, vision tasks, complex reasoning Groq + Llama 3.1 — fast text responses Python — core system SpeechRecognition + pyttsx3 — voice I/O pyautogui + Selenium — computer control Rich — beautiful terminal UI pywhatkit — WhatsApp automation Windows Location API — real GPS detection OpenStreetMap Nominatim — reverse geocoding

Challenges we ran into

TTS blocking voice loop — pyttsx3 on Windows must run on main thread. Solved with threading queues. Wrong location from IP — ISP shows different city. Solved with Windows GPS chip API via PowerShell. Web search returning 0 results — DuckDuckGo scraping unreliable. Built 5-source RAG agent. AI hallucinating facts — LLM inventing wrong movie names, wrong ages. Added strict no-hallucination prompt + Wikipedia verification. Speech-to-text errors — "STARK" heard as "dark" or "torque". Added auto-correction layer. Selenium Chrome conflicts — WebDriver crashing. Replaced with webbrowser + pyautogui.

Accomplishments that we're proud of

Built a complete voice-controlled AI OS in Python that actually works Real GPS location detection using Windows Location API RAG agent system with caching — same question answered instantly from cache Gemini vision integration for real-time image identification Health monitoring that speaks reminders without blocking voice loop Intent router that correctly classifies 15+ different command types WhatsApp messaging and calling via voice — no manual interaction needed

What we learned

Agent architecture matters more than model quality — a weak model with good tools beats a strong model with no tools Windows system integration (GPS, brightness, volume) requires platform-specific PowerShell commands TTS on Windows has strict threading requirements RAG with multiple sources gives much better answers than single-source search Intent classification should happen before web search, not after

What's next for Stark

Gemini Live streaming — real-time two-way audio conversation Face recognition — identify people by face using camera Smart home control — lights, AC, TV via IoT Calendar integration — schedule meetings by voice Proactive suggestions — STARK notices patterns and suggests actions Mobile app — STARK on Android/iOS Memory system — remembers preferences and past conversations

Built With

duckduckgo-search-api
google-gemini-api
groq
ollama-(local-llms)
opencv
openrouter
pyautogui
python
selenium
speechrecognition
system
tesseract-ocr
whisper/vosk
windows

Submitted to

Gemini Live Agent Challenge

Created by

I built STARK entirely from scratch — a voice-controlled Personal AI
Operating System. I designed the full agent architecture including the
intent router, RAG pipeline with 5 search sources, and tool selection
system. I integrated Google Gemini for image identification and screen
analysis, built real GPS location detection using Windows Location API,
and connected 20+ real-world tools including WhatsApp, Google Maps,
Spotify, and YouTube. This was my first time building a complete AI
agent system and I learned how RAG, intent routing, and tool
orchestration work together to create a truly intelligent assistant.

Balu Thota

Updates

Balu Thota started this project — Mar 16, 2026 07:48 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.