The Problem We Were Actually Solving

The standard way to connect professionals is search: you enter skills, you get a list of people. LinkedIn does this. Every job board does this. It works well enough for recruiting, where you're searching for a specific role with specific requirements.

But founding a startup with someone is not the same as hiring them. You're not looking for someone who has a certain skill — you're looking for someone who complements what you already are. A technical founder doesn't need another technical founder. They need someone who can talk to customers, close deals, or raise money. A business founder doesn't need another person who can build decks — they need someone who can build software.

Keyword search can't solve this. If both users have "Flutter" in their skills, keyword search puts them together. But they probably shouldn't be. A Flutter developer looking for a business co-founder and a Flutter developer looking for a technical co-founder are opposites of what each other needs.

This is the problem Fellow Founder needed to solve: not who has these skills, but who would make this person's founding team stronger.

Why Not Just Tags and Filters?

The honest answer is: we tried tags and filters first.

Version 1 of Fellow Founder had a classic filter system. Users set their role (technical/business/design), their industry focus, their stage (idea/MVP/growth), and their looking-for (what type of co-founder they needed). Filters ran a simple database query. It worked in five minutes of development time.

The problem surfaced immediately in user testing. Two users would match on every filter but have completely incompatible visions. Two users who seemed mismatched on paper would have an extraordinary conversation because of a shared context neither had put in their filter fields.

The thing that actually predicted a good match wasn't structured attributes. It was the unstructured text — the description of what they were building, why they were building it, and what they were looking for in a partner.

That's a language understanding problem. Filters can't solve language understanding. Embeddings can.

System Architecture: The 30-Second Overview

┌─────────────────────────────────────┐
│           Flutter App               │
│  - Profile UI                       │
│  - Discovery Feed                   │
│  - Real-time Chat (WebSocket)        │
└──────────────┬──────────────────────┘
               │  REST + WebSocket
               ▼
┌─────────────────────────────────────┐
│         FastAPI Backend             │
│  - Auth & User Management           │
│  - Match Ranking API                │
│  - WebSocket Connection Manager     │
│  - Background Job Scheduler         │
└──────────┬──────────────────────────┘
           │  Internal HTTP
           ▼
┌─────────────────────────────────────┐
│       AI Microservice               │
│  - Profile Embedding Generator      │
│  - Cosine Similarity Engine         │
│  - LLM Summarization (on-demand)    │
└─────────────────────────────────────┘

Three services, three responsibilities. Flutter doesn't know the AI service exists. The AI service doesn't know Flutter exists. FastAPI sits in the middle, orchestrating, caching, and serving.

Embeddings: Turning Profiles into Vectors

An embedding is a mathematical representation of text — a list of numbers (a vector) that captures the semantic meaning of the text. Two pieces of text that mean similar things produce vectors that are close together in space. Two pieces of text that mean different things produce vectors far apart.

Here's the key insight: complementary profiles are not similar vectors. They're vectors that, when combined, point toward a successful founding team. A technical founder's embedding and a business founder's embedding should produce a high match score, even though the embeddings themselves point in different directions.

We solved this with a two-stage approach:

Stage 1 — Profile embedding: Each user's full profile text (bio, what they're building, what they're looking for) is embedded using a sentence-transformer model. This happens once when the profile is created or updated, and the vector is stored in PostgreSQL with the pgvector extension.

# AI microservice — profile embedding
from sentence_transformers import SentenceTransformer
from fastapi import FastAPI
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')
app = FastAPI()

@app.post('/embed')
async def embed_profile(profile: ProfileRequest):
    text = f"""
        Building: {profile.building_description}
        Looking for: {profile.looking_for}
        Background: {profile.background}
        Stage: {profile.stage}
    """
    vector = model.encode(text)
    return {'vector': vector.tolist(), 'dim': len(vector)}

Stage 2 — Complementarity scoring: We don't match similar embeddings. We match embeddings where user A's "looking for" description is semantically close to user B's "background" description. If your "looking for" says "business-minded co-founder who can talk to customers" and my background says "5 years in enterprise sales and BD", those two text chunks have high similarity — and that's a real signal.

The FastAPI Backend: Where the Logic Lives

FastAPI was the right choice for this backend, and not just because of Python's ML ecosystem. The async-first design meant our recommendation endpoint — which fires off multiple DB queries and occasionally calls the AI microservice — never blocks. Under load, this matters.

# FastAPI matching endpoint
@router.get('/discover')
async def get_matches(
    current_user: User = Depends(get_current_user),
    db: AsyncSession = Depends(get_db),
    limit: int = 20
):
    # Check Redis cache first
    cache_key = f'matches:{current_user.id}'
    cached = await redis.get(cache_key)
    if cached:
        return json.loads(cached)

    # Get pre-computed matches from DB
    matches = await db.execute(
        select(Match)
        .where(Match.user_id == current_user.id)
        .order_by(Match.score.desc())
        .limit(limit)
    )

    result = [match.to_dict() for match in matches.scalars()]

    # Cache for 5 minutes
    await redis.setex(cache_key, 300, json.dumps(result))
    return result

The matching itself doesn't happen at request time. It happens in a background scheduler that runs every 5 minutes, recomputing match scores for users whose profiles changed recently. This is the key architectural decision that made latency acceptable.

Flutter: Consuming the API and WebSocket

On the Flutter side, two parallel data flows run simultaneously: REST for the main content, WebSocket for real-time events.

The discovery feed is a REST call — the backend returns a list of pre-computed matches, sorted by score. No real-time needed here.

Real-time is used for: new match notifications, message delivery in chat, and profile view alerts. These fire through a WebSocket connection that the app maintains while in the foreground.

// WebSocket service — managed as a Riverpod provider
class WebSocketService {
  WebSocket? _socket;
  final _eventController = StreamController<SocketEvent>.broadcast();

  Stream<SocketEvent> get events => _eventController.stream;

  Future<void> connect(String token) async {
    _socket = await WebSocket.connect(
      '${AppConfig.wsBaseUrl}/ws?token=$token',
    );

    _socket!.listen(
      (data) => _eventController.add(SocketEvent.fromJson(jsonDecode(data))),
      onDone: () => _reconnect(token),
      onError: (_) => _reconnect(token),
    );
  }

  void _reconnect(String token) {
    Future.delayed(const Duration(seconds: 3), () => connect(token));
  }
}

The auto-reconnect on onDone and onError is critical. Mobile connections drop constantly — background switches, weak signal, app suspension. Without reconnect logic, users would miss real-time events after any interruption. With it, the reconnect is invisible to the user.

In the UI, match notifications are surfaced through Riverpod's stream providers:

@riverpod
Stream<MatchNotification> matchNotifications(MatchNotificationsRef ref) {
  final wsService = ref.watch(webSocketServiceProvider);
  return wsService.events
    .where((e) => e.type == 'new_match')
    .map((e) => MatchNotification.fromMap(e.data));
}

// In any widget:
ref.listen(matchNotificationsProvider, (_, notification) {
  ScaffoldMessenger.of(context).showSnackBar(
    SnackBar(content: Text('New match: ${notification.userName}')),
  );
});

Latency: From 800ms to 120ms

The first version of the recommendation endpoint took 800ms to return results. On a phone, an 800ms wait between tap and content feels like an eternity. Here's exactly what was causing it and how we fixed it.

Problem 1: Computing matches at request time. The original implementation ran the similarity calculation on every API call. Fix: pre-compute matches in a background job every 5 minutes, store results in DB. Request time drops to a simple DB read.

Problem 2: No caching. Even with pre-computed results, every request hit the database. Fix: Redis cache with 5-minute TTL. Cache hit rate reached 87% within the first week. Average response time for a cache hit: 12ms.

Problem 3: Returning too much data. The initial response included full user profiles in the match list. Fix: return only the data needed for the card view (name, photo, one-line summary, score). Full profile loads on demand when the user taps a card.

After these three fixes, p50 latency dropped from 800ms to 42ms. p95 from 2.1s to 180ms. The app felt instant.

Cost: From $500/month to $80/month

The first production deployment was expensive. The AI microservice was calling an LLM API for every match computation. With 500+ daily active users each getting matches recomputed multiple times per day, API costs were running $500+/month.

Three changes fixed this:

1. Switch from LLM to embedding models for matching. LLM calls ($0.002 per 1k tokens) replaced with sentence-transformer inference ($0 on our own server, once the model is loaded). Matching accuracy was nearly identical — embeddings capture semantic meaning just as well as LLM responses for this use case.

2. LLM calls only for profile summarization. We kept LLM calls for one thing: generating a 2-sentence summary of each profile for display in the discovery feed. This runs once when a profile is created or meaningfully updated — not on every match computation.

3. Batched embedding updates. Instead of re-embedding a profile the moment it's updated, we batch updates and process them every 15 minutes. 95% of users don't notice the delay. The 5% who update their profile and immediately check their matches see slightly stale results, which is acceptable.

The Lesson About Premature AI Optimisation

The biggest mistake in the first version wasn't the latency or the cost. It was over-engineering the AI layer before we had evidence it was needed.

I spent three weeks building a custom fine-tuned model before we had any users to validate whether the base model worked. The fine-tuned model performed about 6% better in synthetic tests. In production, with real users, the difference was immeasurable — users couldn't tell the difference between the fine-tuned and base model recommendations.

Three weeks of engineering for zero measurable user benefit.

The lesson I took from this: measure user outcomes, not model metrics. A recommendation system's accuracy on a benchmark dataset is not the same as whether users are finding good co-founders. Ship the simpler model. Measure real outcomes. Optimise based on what you can actually measure.

This is premature AI optimisation, and it's as real as premature code optimisation. Build the simple thing. Prove it works. Then improve it.

What I'd Build Differently Today

Looking back on the Fellow Founder AI stack, three things I'd change:

1. pgvector from day one. We started with cosine similarity computed in Python. Adding pgvector to PostgreSQL meant we could query the 1000 nearest vectors in SQL directly — no Python, no roundtrip to the AI service. This change alone simplified the architecture by removing an entire service call from the hot path.

2. Structured output from LLMs. Our LLM summarization prompt returns free text. Parsing it requires handling edge cases, hallucinations, and format variations. Using a structured output format (JSON mode, or function calling) would have eliminated an entire class of bugs in the parsing layer.

3. A/B test the recommendation algorithm earlier. We optimised for similarity scores for 6 months before A/B testing whether higher similarity scores actually correlated with users having better conversations. They did — but knowing that earlier would have validated the approach and let us focus optimisation on the things that actually mattered.