Inspiration

Language learning apps today are built around repetition and memorization: flashcards, fill-in-the-blank, streaks. But the most effective way humans acquire vocabulary is through context, encountering an unfamiliar word in the real world at the moment it's relevant.

Our project, Fluency, is built on that premise and takes it further. Instead of studying language in isolation, you live it. Point your phone at any object around you, and Fluency identifies it, translates it into your target language, lets you hear the pronunciation, and quizzes you, all in real time. It's the language teacher that follows you everywhere, turning your entire environment into an immersive classroom.

What it does

Fluency is a mobile AR language learning app with three core modes:

Scan Mode: Open the camera and Fluency automatically identifies objects every few seconds using Google Cloud Vision. Each detected object appears as a learning card with the English word, a hidden translation you can reveal, and a pronunciation button. You can check yourself two ways: speak the word aloud or type it. The AI grades your pronunciation in real time and gives specific feedback ("You said 'botela' but it's 'botella', the double-L in Spanish sounds like 'ya'").

Quiz Mode: After scanning 5+ objects, Otto the Octopus (our mascot) pops up and invites you to test your memory. The quiz has two phases. Phase 1 is quick recall: Otto shows you each English word and you type or speak the translation. Phase 2 is contextual sentences, powered by the Llama 3.3 70B model. The AI generates a practical, real-world sentence using each term you learned (like "Can I order a water bottle" for "water bottle"), and you translate the full sentence. Both phases include a progressive hint system that reveals letters or words one at a time.

My Words: Every object you scan gets collected into a personal word bank. You can review your collection anytime, and export your entire word list directly to Quizlet or any preferred software for continued study with flashcards.

Fluency currently supports six target languages: Spanish, French, Portuguese, Mandarin, Japanese, and Korean.

How we built it

Frontend: React Native with Expo (Expo Go for rapid testing across devices). Expo Router for navigation, expo-camera for the live camera feed, expo-speech for text-to-speech pronunciation, and expo-av for audio recording when users practice speaking.

Backend: FastAPI (Python) server that bridges the app and our AI services.

Object Detection: Google Cloud Vision API's label detection endpoint, returning up to 10 labels per frame. We sort by topicality (contextual relevance to the image) rather than the default confidence score, this surfaces concrete nouns like "fork" instead of generic categories like "kitchen utensil." A stopword filter rejects abstract labels and walks further down the ranked list until a learnable noun is found.

Translation: Google Cloud Translation API v2, called in sequence with Vision within the same backend request. Both services share a single project-scoped API key; one round-trip from the client returns both the detected object and its translation.

AI Quiz & Pronunciation Grading: We utilized Llama 3.3 70B for sentence generation and answer grading, and Whisper Large V3 Turbo for speech-to-text transcription in the target language. When a user speaks a word or sentence, the audio is sent to Whisper for transcription, then Llama evaluates how close their pronunciation was and provides specific, encouraging feedback.

Mascot: Otto the Octopus, with four emotion states (smiling, happy, thinking, upset) that react to quiz performance, animated with React Native's Animated API for floating, bouncing, and shaking effects.

Challenges we ran into

Object detection accuracy was our first major hurdle. Google Cloud Vision's default sorting by confidence score returned useless labels like "liquid" for a water bottle and "electronic device" for a laptop. We discovered that sorting by the topicality field instead, which measures relevance to the image rather than detection confidence, dramatically improved results. We also built a filter list of overly generic terms to skip past.

API rate limiting was another consideration for our team. Our initial approach using Gemini for sentence generation hit the free tier's 15 requests-per-minute limit when the quiz tried to generate sentences for all words simultaneously. After burning through our daily quota with failed retries, we pivoted to Groq's API (Llama 3.3 70B), which offers 14,400 requests per day for free and responds faster. We also redesigned our quiz to batch all sentence generation into a single API call instead of one-per-word.

Audio session conflicts on iOS were the most frustrating bug. After using the microphone for pronunciation practice, all audio playback through expo-speech became nearly silent. We traced this to iOS's audio session switching between recording mode and playback mode. The recording session was setting the audio route to the earpiece speaker instead of the main speaker, and it persisted even after recording stopped. We went through several iterations of audio mode management before finding a stable fix.

Accomplishments that we're proud of

We're proud of our accomplishment to make language learning feel accessible, practical, and immediate. With Fluency, learners can instantly get translations, practice pronunciation, and reinforce words they'll actually need to know in their daily lives. Technically, we're proud of our fully functional integration of computer vision, translation, speech transcription, and AI-powered feedback into one smooth flow. We're also proud of the architecture: the AI pipeline on the backend, the phone sending camera frames/audio, and the server handling intelligence. This means swapping in different models, adding languages, or changing the camera source in the future will be seamless.

What we learned

We debated between using YOLO, TensorFlow.js, Gemini Vision, and Google Cloud Vision for object detection. We went with Cloud Vision because it offered the lowest latency of any detection option and paired with our topicality sorting approach, it delivered the specificity we needed for learnable vocabulary. The time we saved on setup went into features that differentiate the app. Switching from Gemini to Llama mid-hackathon was stressful but taught us the value of keeping API calls behind a single backend endpoint that the frontend doesn't care about. We also learned how to leverage Llama to evaluate pronunciation and sentence attempts, providing richer feedback than simple string matching. The model can say "close, but you dropped a syllable" instead of just "incorrect." We saw how that made the learning experience feel supportive rather than punishing.

What's next for Fluency

We built this on mobile because everyone has a phone. But architecturally, the video processing pipeline is completely decoupled from the input source. If you own Meta Ray-Ban glasses, you swap one line (the camera source) and now you're learning Spanish while grocery shopping, walking through a museum, or cooking dinner. No phone in your hand. The glasses see what you see, and your AI teacher whispers the vocabulary in your ear. That's the $300 upgrade that turns a study tool into actual immersion. Meta's Wearables Device Access Toolkit is in developer preview right now, and our backend is already compatible.

Built With

Share this project:

Updates