The Problem We Were Actually Solving
The standard way to connect professionals is search: you enter skills, you get a list of people. LinkedIn does this. Every job board does this. It works well enough for recruiting, where you're searching for a specific role with specific requirements.
But founding a startup with someone is not the same as hiring them. You're not looking for someone who has a certain skill — you're looking for someone who complements what you already are. A technical founder doesn't need another technical founder. They need someone who can talk to customers, close deals, or raise money. A business founder doesn't need another person who can build decks — they need someone who can build software.
Keyword search can't solve this. If both users have "Flutter" in their skills, keyword search puts them together. But they probably shouldn't be. A Flutter developer looking for a business co-founder and a Flutter developer looking for a technical co-founder are opposites of what each other needs.
This is the problem Fellow Founder needed to solve: not who has these skills, but who would make this person's founding team stronger.
Why Not Just Tags and Filters?
The honest answer is: we tried tags and filters first.
Version 1 of Fellow Founder had a classic filter system. Users set their role (technical/business/design), their industry focus, their stage (idea/MVP/growth), and their looking-for (what type of co-founder they needed). Filters ran a simple database query. It worked in five minutes of development time.
The problem surfaced immediately in user testing. Two users would match on every filter but have completely incompatible visions. Two users who seemed mismatched on paper would have an extraordinary conversation because of a shared context neither had put in their filter fields.
The thing that actually predicted a good match wasn't structured attributes. It was the unstructured text — the description of what they were building, why they were building it, and what they were looking for in a partner.
That's a language understanding problem. Filters can't solve language understanding. Embeddings can.
System Architecture: The 30-Second Overview
┌─────────────────────────────────────┐
│ Flutter App │
│ - Profile UI │
│ - Discovery Feed │
│ - Real-time Chat (WebSocket) │
└──────────────┬──────────────────────┘
│ REST + WebSocket
▼
┌─────────────────────────────────────┐
│ FastAPI Backend │
│ - Auth & User Management │
│ - Match Ranking API │
│ - WebSocket Connection Manager │
│ - Background Job Scheduler │
└──────────┬──────────────────────────┘
│ Internal HTTP
▼
┌─────────────────────────────────────┐
│ AI Microservice │
│ - Profile Embedding Generator │
│ - Cosine Similarity Engine │
│ - LLM Summarization (on-demand) │
└─────────────────────────────────────┘Three services, three responsibilities. Flutter doesn't know the AI service exists. The AI service doesn't know Flutter exists. FastAPI sits in the middle, orchestrating, caching, and serving.
Embeddings: Turning Profiles into Vectors
An embedding is a mathematical representation of text — a list of numbers (a vector) that captures the semantic meaning of the text. Two pieces of text that mean similar things produce vectors that are close together in space. Two pieces of text that mean different things produce vectors far apart.
Here's the key insight: complementary profiles are not similar vectors. They're vectors that, when combined, point toward a successful founding team. A technical founder's embedding and a business founder's embedding should produce a high match score, even though the embeddings themselves point in different directions.
We solved this with a two-stage approach:
Stage 1 — Profile embedding: Each user's full profile text (bio, what they're building, what they're looking for) is embedded using a sentence-transformer model. This happens once when the profile is created or updated, and the vector is stored in PostgreSQL with the pgvector extension.
# AI microservice — profile embedding
from sentence_transformers import SentenceTransformer
from fastapi import FastAPI
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
app = FastAPI()
@app.post('/embed')
async def embed_profile(profile: ProfileRequest):
text = f"""
Building: {profile.building_description}
Looking for: {profile.looking_for}
Background: {profile.background}
Stage: {profile.stage}
"""
vector = model.encode(text)
return {'vector': vector.tolist(), 'dim': len(vector)}Stage 2 — Complementarity scoring: We don't match similar embeddings. We match embeddings where user A's "looking for" description is semantically close to user B's "background" description. If your "looking for" says "business-minded co-founder who can talk to customers" and my background says "5 years in enterprise sales and BD", those two text chunks have high similarity — and that's a real signal.
The FastAPI Backend: Where the Logic Lives
FastAPI was the right choice for this backend, and not just because of Python's ML ecosystem. The async-first design meant our recommendation endpoint — which fires off multiple DB queries and occasionally calls the AI microservice — never blocks. Under load, this matters.
# FastAPI matching endpoint
@router.get('/discover')
async def get_matches(
current_user: User = Depends(get_current_user),
db: AsyncSession = Depends(get_db),
limit: int = 20
):
# Check Redis cache first
cache_key = f'matches:{current_user.id}'
cached = await redis.get(cache_key)
if cached:
return json.loads(cached)
# Get pre-computed matches from DB
matches = await db.execute(
select(Match)
.where(Match.user_id == current_user.id)
.order_by(Match.score.desc())
.limit(limit)
)
result = [match.to_dict() for match in matches.scalars()]
# Cache for 5 minutes
await redis.setex(cache_key, 300, json.dumps(result))
return resultThe matching itself doesn't happen at request time. It happens in a background scheduler that runs every 5 minutes, recomputing match scores for users whose profiles changed recently. This is the key architectural decision that made latency acceptable.
Flutter: Consuming the API and WebSocket
On the Flutter side, two parallel data flows run simultaneously: REST for the main content, WebSocket for real-time events.
The discovery feed is a REST call — the backend returns a list of pre-computed matches, sorted by score. No real-time needed here.
Real-time is used for: new match notifications, message delivery in chat, and profile view alerts. These fire through a WebSocket connection that the app maintains while in the foreground.
// WebSocket service — managed as a Riverpod provider
class WebSocketService {
WebSocket? _socket;
final _eventController = StreamController<SocketEvent>.broadcast();
Stream<SocketEvent> get events => _eventController.stream;
Future<void> connect(String token) async {
_socket = await WebSocket.connect(
'${AppConfig.wsBaseUrl}/ws?token=$token',
);
_socket!.listen(
(data) => _eventController.add(SocketEvent.fromJson(jsonDecode(data))),
onDone: () => _reconnect(token),
onError: (_) => _reconnect(token),
);
}
void _reconnect(String token) {
Future.delayed(const Duration(seconds: 3), () => connect(token));
}
}The auto-reconnect on onDone and onError is critical. Mobile connections drop constantly — background switches, weak signal, app suspension. Without reconnect logic, users would miss real-time events after any interruption. With it, the reconnect is invisible to the user.
In the UI, match notifications are surfaced through Riverpod's stream providers:
@riverpod
Stream<MatchNotification> matchNotifications(MatchNotificationsRef ref) {
final wsService = ref.watch(webSocketServiceProvider);
return wsService.events
.where((e) => e.type == 'new_match')
.map((e) => MatchNotification.fromMap(e.data));
}
// In any widget:
ref.listen(matchNotificationsProvider, (_, notification) {
ScaffoldMessenger.of(context).showSnackBar(
SnackBar(content: Text('New match: ${notification.userName}')),
);
});Latency: From 800ms to 120ms
The first version of the recommendation endpoint took 800ms to return results. On a phone, an 800ms wait between tap and content feels like an eternity. Here's exactly what was causing it and how we fixed it.
Problem 1: Computing matches at request time. The original implementation ran the similarity calculation on every API call. Fix: pre-compute matches in a background job every 5 minutes, store results in DB. Request time drops to a simple DB read.
Problem 2: No caching. Even with pre-computed results, every request hit the database. Fix: Redis cache with 5-minute TTL. Cache hit rate reached 87% within the first week. Average response time for a cache hit: 12ms.
Problem 3: Returning too much data. The initial response included full user profiles in the match list. Fix: return only the data needed for the card view (name, photo, one-line summary, score). Full profile loads on demand when the user taps a card.
After these three fixes, p50 latency dropped from 800ms to 42ms. p95 from 2.1s to 180ms. The app felt instant.
Cost: From $500/month to $80/month
The first production deployment was expensive. The AI microservice was calling an LLM API for every match computation. With 500+ daily active users each getting matches recomputed multiple times per day, API costs were running $500+/month.
Three changes fixed this:
1. Switch from LLM to embedding models for matching. LLM calls ($0.002 per 1k tokens) replaced with sentence-transformer inference ($0 on our own server, once the model is loaded). Matching accuracy was nearly identical — embeddings capture semantic meaning just as well as LLM responses for this use case.
2. LLM calls only for profile summarization. We kept LLM calls for one thing: generating a 2-sentence summary of each profile for display in the discovery feed. This runs once when a profile is created or meaningfully updated — not on every match computation.
3. Batched embedding updates. Instead of re-embedding a profile the moment it's updated, we batch updates and process them every 15 minutes. 95% of users don't notice the delay. The 5% who update their profile and immediately check their matches see slightly stale results, which is acceptable.
The Lesson About Premature AI Optimisation
The biggest mistake in the first version wasn't the latency or the cost. It was over-engineering the AI layer before we had evidence it was needed.
I spent three weeks building a custom fine-tuned model before we had any users to validate whether the base model worked. The fine-tuned model performed about 6% better in synthetic tests. In production, with real users, the difference was immeasurable — users couldn't tell the difference between the fine-tuned and base model recommendations.
Three weeks of engineering for zero measurable user benefit.
The lesson I took from this: measure user outcomes, not model metrics. A recommendation system's accuracy on a benchmark dataset is not the same as whether users are finding good co-founders. Ship the simpler model. Measure real outcomes. Optimise based on what you can actually measure.
This is premature AI optimisation, and it's as real as premature code optimisation. Build the simple thing. Prove it works. Then improve it.
What I'd Build Differently Today
Looking back on the Fellow Founder AI stack, three things I'd change:
1. pgvector from day one. We started with cosine similarity computed in Python. Adding pgvector to PostgreSQL meant we could query the 1000 nearest vectors in SQL directly — no Python, no roundtrip to the AI service. This change alone simplified the architecture by removing an entire service call from the hot path.
2. Structured output from LLMs. Our LLM summarization prompt returns free text. Parsing it requires handling edge cases, hallucinations, and format variations. Using a structured output format (JSON mode, or function calling) would have eliminated an entire class of bugs in the parsing layer.
3. A/B test the recommendation algorithm earlier. We optimised for similarity scores for 6 months before A/B testing whether higher similarity scores actually correlated with users having better conversations. They did — but knowing that earlier would have validated the approach and let us focus optimisation on the things that actually mattered.
Frequently Asked Questions
What embedding model does Fellow Founder use for matching?
We use the all-MiniLM-L6-v2 sentence-transformer model for profile embeddings. It's lightweight (80MB), fast (under 50ms per embedding on CPU), and produces 384-dimensional vectors that capture semantic meaning well enough for our use case. We self-host it to avoid per-call API costs.
Why FastAPI over Node.js or Django for the AI backend?
FastAPI is async by default, integrates naturally with Python ML libraries (numpy, sentence-transformers, scikit-learn), and auto-generates OpenAPI docs. For any backend that involves ML inference, Python is the practical choice — the ecosystem advantage is decisive. FastAPI specifically is faster than Django for async workloads and less verbose than raw ASGI frameworks.
How does the Flutter app handle WebSocket reconnection?
The WebSocket service listens to onDone and onError callbacks and triggers an automatic reconnect after 3 seconds. The reconnect loop is transparent to the user. The service is managed as a Riverpod provider, so the connection lifecycle is tied to the app state and does not require manual management in individual screens.
What is the latency of the recommendation API in production?
P50 latency is around 42ms for a cache hit (87% of requests). P95 is around 180ms. Cold cache misses (DB reads) take 80-120ms. Match computation itself happens in a background job every 5 minutes — it is never on the request path.
How do you store and query embedding vectors?
Vectors are stored in PostgreSQL using the pgvector extension, which adds a native vector column type and supports ANN (approximate nearest neighbor) queries in SQL. This eliminated the need for a separate vector database and keeps the infrastructure simple — one PostgreSQL instance handles both structured user data and vector similarity queries.
What's the monthly cost of running the AI stack?
Currently around $80/month for 500+ daily active users. Costs are: self-hosted embedding inference ($0 — runs on a $30/mo VPS), LLM API calls for profile summarization (~$15/mo), PostgreSQL with pgvector ($35/mo), Redis ($10/mo), and miscellaneous bandwidth. The key cost reduction was switching from LLM-per-match to embedding-per-match.