Building AI-Powered Search for Your App: Vector Search, Hybrid Search, and Semantic Ranking from Scratch
Your users are searching "how to fix the login thing when it's stuck" and your search engine returns zero results because no document contains the phrase "login thing stuck." Meanwhile, there's a perfectly relevant knowledge base article titled "Resolving Authentication Token Expiry Issues" sitting right there โ invisible to keyword search.
This is the fundamental failure of traditional search: it matches words, not meaning. And in 2026, users expect search that understands intent.
The good news? Building AI-powered search is no longer a PhD project. With modern embedding models, vector databases, and a few clever patterns, you can build a search system that genuinely understands what users mean โ not just what they type.
In this guide, we'll build a production-ready AI search system step by step. We'll start with the basics of vector search, evolve to hybrid search (the sweet spot for most applications), add semantic reranking for precision, and cover the production gotchas that tutorials skip. All with TypeScript code you can actually ship.
The Three Generations of Search
Before diving into code, let's understand where we are and why each generation exists.
Generation 1: Keyword Search (BM25/TF-IDF)
This is what most apps still use. PostgreSQL's tsvector, Elasticsearch's default mode, or even SQL LIKE queries.
-- The classic approach SELECT * FROM articles WHERE to_tsvector('english', title || ' ' || body) @@ to_tsquery('english', 'authentication & token & expiry');
How it works: Count how many times query terms appear in documents, weight by rarity (IDF), and rank by relevance score.
Where it works great:
- Exact term matching ("ERROR 0x80070005")
- Known-item search (searching for a specific document by name)
- Structured queries with boolean operators
- Domain-specific jargon that embedding models may not understand
Where it fails:
- Synonym handling ("car" vs "automobile" vs "vehicle")
- Intent understanding ("how to make my site faster" โ should match "web performance optimization")
- Typo tolerance (though fuzzy matching helps partially)
- Multi-lingual queries
Generation 2: Vector Search (Semantic)
Vector search converts text into numerical representations (embeddings) that capture meaning. Similar concepts end up close together in vector space, regardless of the exact words used.
// "fix login issue" and "resolve authentication problem" // end up as nearby vectors const embedding1 = await embed("fix login issue"); const embedding2 = await embed("resolve authentication problem"); cosineSimilarity(embedding1, embedding2); // ~0.92 (very similar!)
How it works: An embedding model (like OpenAI's text-embedding-3-small or open-source nomic-embed-text) converts text into a high-dimensional vector (typically 256-1536 dimensions). Search becomes finding the nearest neighbors in vector space.
Where it excels:
- Understanding intent behind vague queries
- Cross-lingual search (embeddings transcend language barriers)
- Finding semantically related content even with zero word overlap
Where it struggles:
- Exact keyword matching (ironically!)
- Rare technical terms the embedding model hasn't seen
- Recency bias โ embeddings don't know what's "new"
- Filter/facet queries ("articles tagged React published after 2025")
Generation 3: Hybrid Search + Reranking (The 2026 Sweet Spot)
The insight: keyword search and vector search fail in complementary ways. Combine them, and you cover each other's blind spots.
User Query
โ
โโโโโโโโโโโโโโโโโโโโโโโโ
โ Parallel Retrieval โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ BM25 (keywords) โโโโ Top 20 keyword results
โ โโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ Vector (semantic)โโโโ Top 20 semantic results
โ โโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโ
โ
Reciprocal Rank Fusion (merge + deduplicate)
โ
Top 40 candidates (merged)
โ
LLM Reranker (optional, but powerful)
โ
Final Top 10 results
This is what we're building. Let's go.
Step 1: Setting Up Vector Search with pgvector
You don't need a specialized vector database to start. PostgreSQL with the pgvector extension handles millions of vectors with excellent performance and gives you the benefit of keeping everything in one database.
Database Setup
-- Enable pgvector extension CREATE EXTENSION IF NOT EXISTS vector; -- Create the documents table with embedding column CREATE TABLE documents ( id SERIAL PRIMARY KEY, title TEXT NOT NULL, content TEXT NOT NULL, metadata JSONB DEFAULT '{}', embedding vector(1536), -- OpenAI text-embedding-3-small dimension created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); -- Create HNSW index for fast approximate nearest neighbor search -- This is the key to performance at scale CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- Also create a full-text search index for BM25 ALTER TABLE documents ADD COLUMN search_vector tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(title, '')), 'A') || setweight(to_tsvector('english', coalesce(content, '')), 'B') ) STORED; CREATE INDEX ON documents USING gin(search_vector);
Generating Embeddings
import OpenAI from 'openai'; const openai = new OpenAI(); async function generateEmbedding(text: string): Promise<number[]> { // Truncate to model's max token limit const truncated = text.slice(0, 8000); const response = await openai.embeddings.create({ model: 'text-embedding-3-small', input: truncated, dimensions: 1536, }); return response.data[0].embedding; } // Batch embedding for efficiency (up to 2048 inputs per call) async function generateEmbeddings( texts: string[] ): Promise<number[][]> { const batchSize = 100; const allEmbeddings: number[][] = []; for (let i = 0; i < texts.length; i += batchSize) { const batch = texts.slice(i, i + batchSize); const response = await openai.embeddings.create({ model: 'text-embedding-3-small', input: batch.map(t => t.slice(0, 8000)), dimensions: 1536, }); allEmbeddings.push( ...response.data.map(d => d.embedding) ); // Respect rate limits if (i + batchSize < texts.length) { await new Promise(r => setTimeout(r, 100)); } } return allEmbeddings; }
Basic Vector Search
import { Pool } from 'pg'; const pool = new Pool({ connectionString: process.env.DATABASE_URL }); async function vectorSearch( query: string, limit: number = 10 ): Promise<SearchResult[]> { const queryEmbedding = await generateEmbedding(query); const result = await pool.query(` SELECT id, title, content, metadata, 1 - (embedding <=> $1::vector) AS similarity FROM documents WHERE embedding IS NOT NULL ORDER BY embedding <=> $1::vector LIMIT $2 `, [JSON.stringify(queryEmbedding), limit]); return result.rows; }
The <=> operator computes cosine distance. We subtract from 1 to get cosine similarity (higher = more similar).
Performance Tuning
With HNSW indexes, there's a critical parameter: ef_search. It controls the trade-off between speed and recall (accuracy).
-- Default: ef_search = 40 (fast, ~95% recall) SET hnsw.ef_search = 40; -- Higher accuracy: ef_search = 100 (~99% recall, 2-3x slower) SET hnsw.ef_search = 100; -- For production, set per-query based on use case
Benchmarks on 1M documents (1536 dimensions):
| ef_search | Recall@10 | Latency (p50) | Latency (p99) |
|---|---|---|---|
| 40 | 95.2% | 5ms | 15ms |
| 100 | 98.8% | 12ms | 30ms |
| 200 | 99.5% | 25ms | 55ms |
For most applications, ef_search = 100 is the sweet spot.
Step 2: Adding Keyword Search (BM25)
Vector search alone isn't enough. When a user searches for "ERROR-4012" or "RFC 7519", keyword search is objectively better. Let's add BM25-style full-text search.
async function keywordSearch( query: string, limit: number = 10 ): Promise<SearchResult[]> { // Convert user query to tsquery, handling special characters const sanitized = query.replace(/[^\w\s]/g, ' ').trim(); const tsQuery = sanitized.split(/\s+/).join(' & '); const result = await pool.query(` SELECT id, title, content, metadata, ts_rank_cd(search_vector, to_tsquery('english', $1)) AS rank FROM documents WHERE search_vector @@ to_tsquery('english', $1) ORDER BY rank DESC LIMIT $2 `, [tsQuery, limit]); return result.rows; }
Step 3: Hybrid Search with Reciprocal Rank Fusion
Now the magic: combining keyword and vector results. The standard approach is Reciprocal Rank Fusion (RRF), which merges ranked lists without needing to normalize scores from different systems.
How RRF Works
RRF Score = ฮฃ (1 / (k + rank_i))
Where k is a constant (typically 60) and rank_i is the document's position in each result list. A document that appears at rank 1 in both lists gets a higher fused score than one at rank 1 in only one list.
Implementation
interface SearchResult { id: number; title: string; content: string; metadata: Record<string, unknown>; score: number; } interface HybridSearchOptions { limit?: number; keywordWeight?: number; // 0-1, weight for keyword results vectorWeight?: number; // 0-1, weight for vector results rrfK?: number; // RRF constant, default 60 } async function hybridSearch( query: string, options: HybridSearchOptions = {} ): Promise<SearchResult[]> { const { limit = 10, keywordWeight = 0.3, vectorWeight = 0.7, rrfK = 60, } = options; // Run both searches in parallel const candidateCount = limit * 4; // Over-fetch for better fusion const [keywordResults, vectorResults] = await Promise.all([ keywordSearch(query, candidateCount), vectorSearch(query, candidateCount), ]); // Build rank maps const rrfScores = new Map<number, { score: number; doc: SearchResult; }>(); // Score keyword results keywordResults.forEach((doc, index) => { const rank = index + 1; const rrfScore = keywordWeight * (1 / (rrfK + rank)); rrfScores.set(doc.id, { score: rrfScore, doc, }); }); // Score vector results (add to existing or create new) vectorResults.forEach((doc, index) => { const rank = index + 1; const rrfScore = vectorWeight * (1 / (rrfK + rank)); const existing = rrfScores.get(doc.id); if (existing) { existing.score += rrfScore; // Document appears in both โ boost! } else { rrfScores.set(doc.id, { score: rrfScore, doc, }); } }); // Sort by fused score and return top results return Array.from(rrfScores.values()) .sort((a, b) => b.score - a.score) .slice(0, limit) .map(({ doc, score }) => ({ ...doc, score })); }
When to Adjust Weights
The keywordWeight and vectorWeight parameters are powerful tuning knobs:
| Use Case | Keyword Weight | Vector Weight | Why |
|---|---|---|---|
| General Q&A | 0.3 | 0.7 | Intent matters more |
| Code search | 0.6 | 0.4 | Exact symbols matter |
| Error lookup | 0.7 | 0.3 | Error codes are exact |
| Conversational | 0.2 | 0.8 | Natural language queries |
| Multi-lingual | 0.1 | 0.9 | Embeddings carry language |
Step 4: Semantic Reranking (The Quality Multiplier)
Hybrid search gets you 80% of the way there. Reranking gets you the last 20% โ and often that last 20% is the difference between "good search" and "magic search."
What Reranking Does
Retrieval (vector + keyword) is optimized for recall โ casting a wide net. Reranking is optimized for precision โ looking at each candidate carefully and scoring how relevant it truly is to the query.
A reranker takes the query and each candidate document as a pair and produces a relevance score. Unlike embeddings (which encode query and document independently), rerankers see both together and can capture fine-grained relevance.
Using a Cross-Encoder Reranker
import Anthropic from '@anthropic-ai/sdk'; const anthropic = new Anthropic(); interface RerankedResult extends SearchResult { rerankerScore: number; relevanceReason: string; } async function rerank( query: string, candidates: SearchResult[], topK: number = 10 ): Promise<RerankedResult[]> { // Format candidates for the reranker prompt const candidateList = candidates .map((c, i) => `[${i}] Title: ${c.title}\nContent: ${c.content.slice(0, 500)}`) .join('\n\n'); const response = await anthropic.messages.create({ model: 'claude-sonnet-4-20250514', max_tokens: 2000, messages: [{ role: 'user', content: `You are a search relevance judge. Given a query and candidate documents, score each document's relevance from 0.0 to 1.0. Query: "${query}" Candidates: ${candidateList} Return JSON array: [{"index": 0, "score": 0.95, "reason": "directly answers the query"}, ...] Score criteria: - 1.0: Directly and completely answers the query - 0.7-0.9: Highly relevant, addresses the core intent - 0.4-0.6: Partially relevant, related topic - 0.1-0.3: Tangentially related - 0.0: Not relevant at all Return ONLY the JSON array, no other text.`, }], }); const scores = JSON.parse( (response.content[0] as { text: string }).text ) as { index: number; score: number; reason: string }[]; return scores .sort((a, b) => b.score - a.score) .slice(0, topK) .map(s => ({ ...candidates[s.index], rerankerScore: s.score, relevanceReason: s.reason, })); }
Dedicated Reranker Models (Cheaper Alternative)
LLM reranking is powerful but expensive. For high-volume search, use a dedicated reranker model:
// Using Cohere Rerank (or similar dedicated reranker) import { CohereClient } from 'cohere-ai'; const cohere = new CohereClient({ token: process.env.COHERE_API_KEY }); async function cohereRerank( query: string, candidates: SearchResult[], topK: number = 10 ): Promise<RerankedResult[]> { const response = await cohere.v2.rerank({ model: 'rerank-v3.5', // or 'rerank-v4.0-pro' for latest query, documents: candidates.map(c => ({ text: `${c.title}\n${c.content.slice(0, 1000)}`, })), topN: topK, }); return response.results.map(r => ({ ...candidates[r.index], rerankerScore: r.relevanceScore, relevanceReason: '', })); }
Cost comparison for 1,000 reranking queries/day (20 candidates each):
| Reranker | Latency | Cost/month |
|---|---|---|
| Claude Sonnet (LLM) | ~800ms | ~$90 |
| Cohere Rerank v4.0 | ~180ms | ~$6 |
| Cohere Rerank v3.5 | ~200ms | ~$5 |
| Jina Reranker v2 | ~150ms | ~$4 |
| Self-hosted (cross-encoder) | ~100ms | Server cost only |
For most applications, a dedicated reranker model is the best choice. Reserve LLM reranking for cases where you need the reasoning capability (e.g., explaining why results are relevant).
Step 5: Putting It All Together
Here's the complete search pipeline as a single, production-ready function:
interface SearchConfig { limit: number; keywordWeight: number; vectorWeight: number; useReranker: boolean; rerankerType: 'llm' | 'cohere' | 'none'; candidateMultiplier: number; } const DEFAULT_CONFIG: SearchConfig = { limit: 10, keywordWeight: 0.3, vectorWeight: 0.7, useReranker: true, rerankerType: 'cohere', candidateMultiplier: 4, }; async function search( query: string, config: Partial<SearchConfig> = {} ): Promise<SearchResult[]> { const cfg = { ...DEFAULT_CONFIG, ...config }; const candidateCount = cfg.limit * cfg.candidateMultiplier; // Stage 1: Parallel retrieval const [keywordResults, vectorResults] = await Promise.all([ keywordSearch(query, candidateCount), vectorSearch(query, candidateCount), ]); // Stage 2: Reciprocal Rank Fusion const fused = reciprocalRankFusion( keywordResults, vectorResults, cfg ); // Stage 3: Reranking (optional) if (cfg.useReranker && fused.length > 0) { const rerankerInput = fused.slice(0, cfg.limit * 2); if (cfg.rerankerType === 'llm') { return rerank(query, rerankerInput, cfg.limit); } else if (cfg.rerankerType === 'cohere') { return cohereRerank(query, rerankerInput, cfg.limit); } } return fused.slice(0, cfg.limit); }
Production Considerations
Building the pipeline is the easy part. Making it reliable, fast, and cost-effective at scale is where the real engineering happens.
1. Embedding Freshness
When documents change, their embeddings go stale. You need a strategy:
// Option 1: Sync on write (simple, adds write latency) async function updateDocument(id: number, content: string) { const embedding = await generateEmbedding(content); await pool.query(` UPDATE documents SET content = $1, embedding = $2::vector, updated_at = NOW() WHERE id = $3 `, [content, JSON.stringify(embedding), id]); } // Option 2: Async embedding queue (recommended for production) import { Queue } from 'bullmq'; const embeddingQueue = new Queue('embeddings', { connection: { host: 'localhost', port: 6379 }, }); async function updateDocumentAsync(id: number, content: string) { // Update content immediately await pool.query( 'UPDATE documents SET content = $1, updated_at = NOW() WHERE id = $2', [content, id] ); // Queue embedding generation await embeddingQueue.add('generate', { documentId: id, content, }, { attempts: 3, backoff: { type: 'exponential', delay: 1000 }, }); }
2. Query Understanding
Raw user queries often need preprocessing before hitting the search pipeline:
async function preprocessQuery(rawQuery: string): Promise<{ processedQuery: string; searchConfig: Partial<SearchConfig>; }> { // 1. Detect if the query is an exact code/error lookup const isExactMatch = /^[A-Z]+-\d+$|^ERROR|^0x|^HTTP \d{3}/.test(rawQuery); if (isExactMatch) { return { processedQuery: rawQuery, searchConfig: { keywordWeight: 0.9, vectorWeight: 0.1, useReranker: false }, }; } // 2. Expand abbreviated queries (optional LLM step) // "k8s OOM pod restart" โ "Kubernetes out of memory pod restart troubleshooting" // 3. Detect language for multi-lingual support // Embeddings handle cross-lingual naturally, but BM25 needs language-specific config return { processedQuery: rawQuery, searchConfig: {}, }; }
3. Caching Strategy
Embedding generation is the most expensive operation. Cache aggressively:
import { Redis } from 'ioredis'; const redis = new Redis(process.env.REDIS_URL); async function getCachedEmbedding(text: string): Promise<number[] | null> { const key = `emb:${simpleHash(text)}`; const cached = await redis.get(key); if (cached) return JSON.parse(cached); return null; } async function cacheEmbedding(text: string, embedding: number[]): Promise<void> { const key = `emb:${simpleHash(text)}`; await redis.set(key, JSON.stringify(embedding), 'EX', 86400); // 24h TTL } // Wrapper with caching async function getEmbedding(text: string): Promise<number[]> { const cached = await getCachedEmbedding(text); if (cached) return cached; const embedding = await generateEmbedding(text); await cacheEmbedding(text, embedding); return embedding; }
4. Monitoring and Quality Measurement
You can't improve what you don't measure. Track these metrics:
interface SearchMetrics { // Performance totalLatencyMs: number; embeddingLatencyMs: number; retrievalLatencyMs: number; rerankLatencyMs: number; // Quality (requires user feedback or implicit signals) clickThroughRate: number; // % of searches with a click meanReciprocalRank: number; // average 1/rank of first clicked result noResultsRate: number; // % of searches with 0 results // Cost embeddingTokensUsed: number; rerankerCallsMade: number; }
5. Scaling Beyond PostgreSQL
pgvector works surprisingly well up to ~5M vectors. Beyond that, consider:
| Scale | Recommendation | Why |
|---|---|---|
| < 100K vectors | pgvector | Keep it simple, same DB |
| 100K - 5M | pgvector + HNSW tuning | Still works, tune m and ef |
| 5M - 50M | Dedicated vector DB | Pinecone, Weaviate, Qdrant |
| 50M+ | Distributed vector DB | Milvus, Vespa, custom |
The migration path from pgvector to a dedicated vector DB is straightforward โ the embedding generation and search API stay the same; you just swap the storage/query layer.
Choosing an Embedding Model
The embedding model is the most important decision in your search system. Here's the current landscape:
| Model | Dimensions | Max Tokens | Quality (MTEB) | Cost/1M tokens | Best For |
|---|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | 8191 | 62.3 | $0.02 | Cost-effective default |
| OpenAI text-embedding-3-large | 3072 | 8191 | 64.6 | $0.13 | Highest quality (API) |
| Cohere embed-v4.0 | 256โ1536 | 128,000 | 66.2 | $0.10 | Multi-lingual, multimodal |
| Voyage AI voyage-3 | 256โ2048 | 32,000 | 67.1 | $0.06 | Long documents |
| nomic-embed-text (open) | 64โ768 | 8192 | 62.4 | Free (self-host) | Privacy, no API costs |
| BGE-M3 (open) | 1024 | 8192 | 63.0 | Free (self-host) | Multi-lingual, self-hosted |
Recommendations:
- Starting out: OpenAI
text-embedding-3-smallโ cheap, good enough, easy API - Multi-lingual: Cohere
embed-v4.0or BGE-M3 - Privacy-sensitive: nomic-embed-text (run locally)
- Maximum quality: Voyage AI
voyage-3
Important: Once you choose an embedding model, switching later requires re-embedding your entire corpus. Choose carefully, and consider starting with a model that handles your future scale.
Common Pitfalls (and How to Avoid Them)
Pitfall 1: Chunking Too Aggressively
If you split documents into tiny chunks, you lose context. The embedding of "It handles this by caching the response" means nothing without knowing what "it" and "this" refer to.
// โ Bad: Fixed 200-token chunks lose context const chunks = splitByTokenCount(document, 200); // โ Better: Semantic chunking with overlap function semanticChunk(text: string): string[] { const paragraphs = text.split(/\n\n+/); const chunks: string[] = []; let current = ''; for (const para of paragraphs) { if (current.length + para.length > 1500) { if (current) chunks.push(current); current = para; } else { current += '\n\n' + para; } } if (current) chunks.push(current); // Add overlap: prepend last sentence of previous chunk return chunks.map((chunk, i) => { if (i === 0) return chunk; const prevLastSentence = chunks[i - 1].split(/\. /).pop(); return `${prevLastSentence}. ${chunk}`; }); }
Pitfall 2: Ignoring Metadata Filtering
Vector search should not be your only filter. Pre-filter by metadata before vector search for both performance and relevance:
-- โ Bad: Search all documents, then filter SELECT * FROM documents ORDER BY embedding <=> $1::vector LIMIT 10; -- Then filter in application code -- โ Good: Filter first, then search within subset SELECT * FROM documents WHERE metadata->>'category' = 'engineering' AND created_at > NOW() - INTERVAL '90 days' ORDER BY embedding <=> $1::vector LIMIT 10;
Pitfall 3: Not Testing with Real Queries
Build a test set from actual user queries (from search logs, support tickets, or feedback). Automated metrics like NDCG and MRR are useful, but nothing replaces eyeballing the results for your top 50 queries.
// Build a golden test set const testCases = [ { query: "how to fix the login thing when stuck", expectedTopResult: "Resolving Authentication Token Expiry Issues", expectedInTop5: ["Auth Troubleshooting Guide", "Session Management"], }, // ... 50 more real queries from your search logs ]; async function evaluateSearch() { let hits = 0; for (const tc of testCases) { const results = await search(tc.query, { limit: 5 }); if (results.some(r => r.title === tc.expectedTopResult)) { hits++; } } console.log(`Recall@5: ${(hits / testCases.length * 100).toFixed(1)}%`); }
Pitfall 4: Not Considering Cold Start
When you launch, you have zero search logs. You don't know what users will search for. Start with a generous keyword weight (0.5/0.5 hybrid) and gradually shift toward vector as you collect query data to tune on.
Conclusion: The Search Stack Decision Tree
Building AI search isn't about choosing one technique โ it's about layering them correctly:
-
Start with hybrid search (BM25 + vector). This alone beats either individual approach by 15-25% on most benchmarks.
-
Add reranking when you need precision. A Cohere Rerank call adds ~200ms and costs pennies, but dramatically improves the top-3 result quality.
-
Use pgvector unless you have a specific reason not to. Keeping vectors in your existing PostgreSQL database simplifies everything โ ops, transactions, backups, joins.
-
Measure relentlessly. Track click-through rates, no-results rates, and build a golden test set from real queries. Without measurement, you're tuning blind.
-
Don't over-engineer embeddings on day one. Start with
text-embedding-3-small, ship it, collect real user queries, and then decide if you need a more powerful (and expensive) model.
The gap between "keyword search" and "AI search" isn't a PhD thesis anymore. With the patterns in this guide, a single developer can build a search system in a weekend that would have taken a dedicated search team a quarter to build five years ago. The tools are mature. The patterns are proven. The only thing left is to build it.
Explore Related Tools
Try these free developer tools from Pockit