Colyni

Inspiration

We wanted to run a large language model locally, better privacy, no API costs, no dependency on Big Tech. But our laptops couldn't handle it. The model either refused to load or crawled at 2 tokens per second, basically unusable.

So we went back to ChatGPT and Claude. But that felt wrong too. Every query we send hits a data center running thousands of GPUs at full tilt, 24 hours a day, even at 3am when demand is low. The carbon footprint of centralized AI inference is real and growing. We were stuck choosing between "too weak to run locally" and "burns energy we can't see."

That contradiction is what inspired Colyni.

What We Built

Colyni is a distributed LLM inference network, a compute co-op where contributors share idle GPU power, earn Colyni tokens, and spend those tokens to run models that no single device could handle alone.

At the demo, we ran Qwen2.5-32B, a 32 billion parameter model, across three MacBooks with a combined 56GB of unified memory. No cloud. No data center. None of the three laptops could run the model alone. Together, they could.

The math is simple: a model at 4-bit quantization requires roughly $$\frac{32 \times 10^9 \times 0.5}{1024^3} \approx 14.9\text{ GB}$$ of memory. Spread across three devices, that becomes tractable on hardware most people already own.

How We Built It

Distributed inference layer handles peer discovery across devices on the local network, automatically sharding the model based on available memory and compute using tensor parallelism
FastAPI backend tracks node heartbeats, logs which nodes served each request, and manages the token ledger in SQLite
React frontend shows live node status, animates token earnings in real time, and provides a clean interface for submitting prompts and comparing solo vs. cluster performance
Token distribution is proportional to the layers each node served, more compute contributed means more earned

Challenges

Getting three machines to discover each other on WiFi was the first real wall. The peer discovery protocol uses UDP broadcast, which many managed networks block.

Model download size was the other bottleneck. Qwen2.5-32B at 4-bit is nearly 20GB. We had to pre-download models on each machine before arriving, because pulling them on-site would have eaten hours of hacking time.

What We Learned

The most sustainable GPU is one that already exists. Every laptop sitting idle at night has compute going to waste. Colyni is a bet that you don't need to build new infrastructure to democratize AI, you just need to coordinate what's already there.