Inspiration
I wanted to build a personal AI assistant that actually works — not just answers questions, but controls my computer, knows my location, sends WhatsApp messages, opens apps, navigates maps, reads my screen, and responds to my voice in real time. Inspired by JARVIS from Iron Man, I built STARK — a voice-first AI Operating System powered by Gemini Live for intelligent multimodal understanding. The inspiration came from a simple frustration: existing voice assistants (Siri, Cortana, Google Assistant) are limited to basic tasks and cannot perform complex agentic actions like reading your screen, identifying people in images, sending messages autonomously, or planning travel step by step.
What it does
STARK is a fully voice-controlled personal AI OS that:
Speaks and listens continuously — always ready, no wake word needed Controls your PC — volume, brightness, screenshots, app launching, scrolling Sends WhatsApp messages and makes calls — voice triggered, no typing Navigates Google Maps — finds nearby places using real GPS location Reads your screen — OCR + Gemini vision to identify text, people, and objects Searches the web using a RAG agent architecture with 5 data sources Plays Spotify and YouTube — voice controlled Answers questions like ChatGPT — detailed, structured, accurate answers Health monitoring — water reminders, break reminders, sleep reminders Identifies images — point camera at anything, STARK tells you what it is Plans travel — step by step travel guides, visa info, famous places Manages contacts and alarms reminders timer by voice Active Coding Partner: I monitor your code in real-time. If I detect an error (which I'll track—usually catching them 3–5 times before intervening), I will politely guide you or provide the fix.
File Management: With your permission, I can directly edit, add, or delete code within your file explorer to keep your workflow clean.
Meeting Assistant: During Zoom or online meetings, I listen for questions. Instead of speaking, I’ll discreetly display answers on your screen for you to read, ensuring you stay in the lead without interruption.
How we built it
Voice Input → Speech Recognition (Google STT) ↓ Intent Router → Tool Selector ↓ Agent System (RAG Pipeline) ├── Web Search (Google + DuckDuckGo + Wikipedia) ├── TMDB (movies) ├── Weather API └── Location (Windows GPS + IP) ↓ Gemini Live / Groq LLM → Answer ↓ TTS Output (pyttsx3) + Action Execution
Tech Stack: Gemini 1.5 Flash — image identification, vision tasks, complex reasoning Groq + Llama 3.1 — fast text responses Python — core system SpeechRecognition + pyttsx3 — voice I/O pyautogui + Selenium — computer control Rich — beautiful terminal UI pywhatkit — WhatsApp automation Windows Location API — real GPS detection OpenStreetMap Nominatim — reverse geocoding
Challenges we ran into
TTS blocking voice loop — pyttsx3 on Windows must run on main thread. Solved with threading queues. Wrong location from IP — ISP shows different city. Solved with Windows GPS chip API via PowerShell. Web search returning 0 results — DuckDuckGo scraping unreliable. Built 5-source RAG agent. AI hallucinating facts — LLM inventing wrong movie names, wrong ages. Added strict no-hallucination prompt + Wikipedia verification. Speech-to-text errors — "STARK" heard as "dark" or "torque". Added auto-correction layer. Selenium Chrome conflicts — WebDriver crashing. Replaced with webbrowser + pyautogui.
Accomplishments that we're proud of
Built a complete voice-controlled AI OS in Python that actually works Real GPS location detection using Windows Location API RAG agent system with caching — same question answered instantly from cache Gemini vision integration for real-time image identification Health monitoring that speaks reminders without blocking voice loop Intent router that correctly classifies 15+ different command types WhatsApp messaging and calling via voice — no manual interaction needed
What we learned
Agent architecture matters more than model quality — a weak model with good tools beats a strong model with no tools Windows system integration (GPS, brightness, volume) requires platform-specific PowerShell commands TTS on Windows has strict threading requirements RAG with multiple sources gives much better answers than single-source search Intent classification should happen before web search, not after
What's next for Stark
Gemini Live streaming — real-time two-way audio conversation Face recognition — identify people by face using camera Smart home control — lights, AC, TV via IoT Calendar integration — schedule meetings by voice Proactive suggestions — STARK notices patterns and suggests actions Mobile app — STARK on Android/iOS Memory system — remembers preferences and past conversations
Log in or sign up for Devpost to join the conversation.