Vector Similarity Search at Scale for AI Agent Memory Retrieval

Source: TensorBlue 2025 Comparison, Firecrawl 2026 Guide

Dhawal Chheda•AI Leader at Accel4•December 19, 2025•

Vector Similarity Search at Scale for AI Agent Memory Retrieval

Deep Comparison: pgvector vs Pinecone vs Weaviate vs Qdrant vs Milvus vs ChromaDB

1. Architecture Overview

Database	Language	Deployment Model	Index Types	License
pgvector	C (Postgres extension)	Self-hosted / Managed Postgres	HNSW, IVFFlat	PostgreSQL License
Pinecone	Proprietary	Managed SaaS only	Proprietary (serverless)	Proprietary
Weaviate	Go	Self-hosted / Cloud / BYOC	HNSW + ACORN, Flat, Dynamic	BSD-3
Qdrant	Rust	Self-hosted / Cloud / Hybrid	HNSW, Sparse Inverted Index	Apache 2.0
Milvus	Go + C++	Self-hosted / Zilliz Cloud	HNSW, IVF_FLAT, IVF_PQ, DiskANN, GPU CAGRA	Apache 2.0
ChromaDB	Rust core (Python API)	Embedded / Self-hosted / Cloud (early)	HNSW	Apache 2.0

2. Query Latency Benchmarks

At 1M Vectors (768 dimensions)

Database	P95 Latency	P99 Latency	Notes
Qdrant	30-40 ms	~1 ms (small datasets, optimized)	Rust engine; best raw single-node latency
Pinecone	40-50 ms	7 ms (claimed)	Serverless auto-tuned
Milvus	50-80 ms	<30 ms P95	Strong with GPU acceleration
Weaviate	50-70 ms	~50 ms on 768-dim	ACORN filtering adds minimal overhead
pgvector	Sub-100 ms	Varies with config	Competitive at this scale with HNSW
ChromaDB	Fast for single requests	Degrades under concurrency	23 sec avg at 100 concurrent vs pgvector’s 9.8 sec

Source: TensorBlue 2025 Comparison, Firecrawl 2026 Guide

At 10M Vectors

Database	Observed QPS	Notes
Qdrant	8,000-15,000 (with quantization)	4x RPS improvement in recent versions
Milvus	10,000-20,000	Fastest indexing time; GPU acceleration available
Pinecone	5,000-10,000	Consistent with auto-scaling
Weaviate	3,000-8,000	Additional overhead from graph features
pgvector	Competitive but requires tuning	Index builds consume significant RAM

At 50M-100M Vectors

This is where databases diverge dramatically:

Database	Performance at 50M+	100M+ Viability
Milvus	Purpose-built for this scale; proven at billions	Excellent – designed for billion-scale
Pinecone	Proven at billions with serverless	Excellent – auto-scales transparently
pgvectorscale	471 QPS at 99% recall (50M) – 11.4x better than Qdrant	Hits walls beyond 100M; relational storage model limits
Qdrant	41.47 QPS at 99% recall (50M)	Performance degrades; best under 50M per node
Weaviate	Reports needing more memory/compute above 100M	Viable with enterprise cloud; needs careful sizing
ChromaDB	Not designed for this scale	Not viable – single-node focus

Key finding: pgvectorscale with DiskANN and Statistical Binary Quantization achieves P95 latency 28x lower than Pinecone s1 pods at 99% recall on 50M vectors. However, Postgres’s 8KB page storage model and memory-intensive index builds become limiting factors beyond 100M. Source: Firecrawl 2026

3. Throughput (QPS)

VectorDBBench Standard Results (8-core, 32GB host)

The VectorDBBench project provides standardized comparisons across 15 test cases. Key findings from 2025-2026 runs:

Milvus/Zilliz Cloud: Led in indexing speed and maintained strong QPS at moderate dimensions. Performance drops with higher-dimensional embeddings (1536+).
Qdrant: Highest RPS and lowest latencies in most scenarios at 1M-10M scale. Claims 4x RPS improvements over previous versions on deep-image-96 dataset.
Elasticsearch: 10x slower indexing than Qdrant at 10M+ vectors (5.5 hours vs 32 minutes).
Weaviate: “Improved the least since the last benchmark run” per Qdrant’s benchmarks. However, Weaviate’s own Search Mode benchmarks show +17% Success@1 improvement.
Pinecone: In Cohere 10M streaming tests, maintained higher QPS and recall throughout the write cycle. Performance improved significantly after ingestion completion.

Source: Qdrant Benchmarks, VectorDBBench GitHub

4. Hybrid Search (Vector + Full-Text)

This is a critical dimension for AI agent memory, where you need both semantic similarity (finding conceptually related memories) and keyword precision (matching specific entities, dates, tool names).

Database	BM25/Full-Text	Hybrid Approach	Fusion Method	Maturity
Weaviate	Native BM25F (field-weighted)	Single `hybrid()` API call	RRF or custom weights	Most mature – BlockMax WAND (GA 2025) makes keyword side 10x faster
Milvus	Native Sparse-BM25 (since 2.5)	Multi-vector search combining dense + sparse	RRF, Weighted	Strong – 30x faster than Elasticsearch on BM25 queries (6ms vs 200ms at 1M vectors)
Qdrant	Sparse inverted index (since v1.9)	Named vectors: dense HNSW + sparse index	DBSF (score-aware fusion)	Good – server-side IDF since v1.15.2
Pinecone	Sparse-dense vectors	Combined sparse+dense in single query	Built-in fusion	Good for simple cases; less configurable
pgvector	PostgreSQL tsvector/tsquery	Separate queries combined in SQL	Manual (application-level RRF)	Functional but not integrated; no single-query hybrid
ChromaDB	Basic metadata + full-text search	Where filters + text matching	None (basic)	Limited – not production-grade hybrid

Hybrid Search Quality Benchmarks (Weaviate Search Mode)

Dataset	Success@1	Recall@5	nDCG@10
BEIR Natural Questions	0.43	0.70	0.61
BEIR SciFact	0.58	0.79	0.71
BEIR FiQA	0.45	0.43	0.45

Weaviate’s Search Mode showed +5% to +24% improvement over standard hybrid search across 12 IR benchmarks.

For AI agent memory: Hybrid search is essential. Agent memories contain both semantic content (“the user prefers Python”) and precise identifiers (“tool:code_executor”, “session:abc123”). Weaviate and Milvus lead here.

Sources: Weaviate Hybrid Search, Weaviate Search Mode Benchmarking, Milvus Full Text Search, Qdrant Hybrid Search

5. Metadata Filtering Performance

Filtering is the “Achilles heel” of vector search. AI agents need to filter by user_id, session_id, timestamp ranges, memory type, and tool context – often eliminating 90-99% of candidates.

Filtering Strategies by Database

Database	Strategy	Behavior Under Restrictive Filters
Qdrant	Adaptive query planner (brute-force or graph filtering based on selectivity)	Maintains stable latency; HNSW with 0.1 ratio filter shows similar latency to unfiltered
Weaviate	ACORN (default since v1.34) – multi-hop expansion + random seed entrypoints	Up to 10x improvement with low-correlation filters; ranked #2 on Qdrant’s benchmark
Milvus	Pre-filtering with bitset + partition key	Fast for high-cardinality attributes; partition key enables O(1) tenant isolation
Pinecone	Single-stage integrated filtering	50% selective filter: 57ms; 1% filter: 51.6ms (35% faster than unfiltered 79ms on 1.2M vectors)
pgvector	Post-filtering (no filter push-down into index scan)	“Difference between 50ms and 5 seconds” – iterative scans in 0.8.0 help but fundamental limitation remains
ChromaDB	Where-clause metadata filtering	Basic; no advanced query planning

Critical finding: Engines with in-algorithm filtering (Qdrant, Pinecone, Weaviate ACORN) actually get faster under restrictive filters because they prune the search space. Post-filter systems (pgvector, LanceDB, OpenSearch) spike to 200-300ms P95 under filtering. Source: The Achilles Heel of Vector Search: Filters

Filtered QPS comparison at 10% selectivity:
- Pinecone-p2: ~600 QPS (from ~800 unfiltered)
- Zilliz-AUTOINDEX: ~700 QPS (from ~750 unfiltered)
- pgvector-HNSW: QPS dips below unfiltered (post-filter overhead)
- OpenSearch-HNSW: QPS dips similarly

6. Cost Analysis at Scale

Managed Service Costs (Monthly Estimates, 1536 dimensions)

Scale	Pinecone Serverless	Weaviate Cloud	Qdrant Cloud	Zilliz Cloud (Milvus)	pgvector (Managed Postgres)	ChromaDB
1M vectors	~$15-25	~$45 (minimum)	Free (1GB tier)	Free tier / ~$20	$10-30 (small instance)	Free (self-hosted)
10M vectors (5M queries/mo)	~$64	~$85	~$102 (AWS, no quantization)	~$89 (serverless) / $114 (dedicated)	~$100-200 (self-hosted AWS)	Free but single-node only
100M vectors	~$200-400 storage + significant read costs	Enterprise pricing (contact sales)	~$660+ self-hosted AWS	$500-1000 self-hosted / Enterprise managed	$300-500 (r6g.2xlarge+)	Not viable

Pricing Models Breakdown

Pinecone Serverless:
- Storage: $0.33/GB/month
- Read Units: $8.25 per 1M reads
- Write Units: $2.00 per 1M writes
- Tipping point for self-hosting: ~60-80M queries/month

Zilliz Cloud (Milvus):
- Serverless: $4 per 1M vCUs
- Storage: $0.04/GB/month (standardized Jan 2026)
- DiskANN advantage: 10x more vectors on SSD vs RAM, dramatically lowering cost at 100M+
- Tiered storage: 87% storage cost reduction announced

Weaviate Cloud:
- $25 per 1M vector dimensions/month
- $45/month minimum
- Predictable pricing – no penalty for query spikes

Qdrant Cloud:
- Free: 1GB forever
- Managed: Starting $0.014/hour
- Compute + Memory + Storage billed hourly

Self-Hosting Cost Crossover

Below 50M vectors: Managed SaaS is cheaper due to hidden DevOps costs.
Above 50-100M vectors: Self-hosting becomes economical if you have Kubernetes expertise.
At 1B+ vectors: Milvus self-hosted with DiskANN on commodity hardware is the most cost-effective.

Sources: Rahul Kolekar Pricing Comparison 2026, Pinecone Pricing, Zilliz Pricing, Qdrant Pricing, Weaviate Pricing

7. Operational Complexity

Database	Deployment Effort	Scaling Model	Key Operational Concerns
pgvector	Lowest (existing Postgres)	Vertical only (single-node)	Index builds consume 10+ GB RAM on production DB; no horizontal scaling; HNSW rebuilds cause lock contention; IVFFlat clusters degrade over time requiring periodic rebuilds
Pinecone	Lowest (zero-ops SaaS)	Automatic serverless	No infrastructure to manage; vendor lock-in; limited configuration control; no self-hosting option
ChromaDB	Low (pip install, embedded)	Single-node only	Prototyping only; no production HA; no horizontal scaling; experimental distributed mode
Qdrant	Low-Medium (single binary, Docker)	Manual sharding + replication	Compact Rust binary; simple Docker deployment; snapshot-based backups; manual scaling decisions
Weaviate	Medium (Docker/Kubernetes)	Horizontal with replication	Good documentation; module system adds complexity; BYOC option reduces ops burden
Milvus	High (Kubernetes required for distributed)	Full horizontal (disaggregated compute/storage)	Requires etcd + MinIO/S3 + Pulsar/Kafka; multiple microservices; steep learning curve; powerful but complex

Self-Hosted Kubernetes Complexity

Milvus Distributed on Kubernetes is the most operationally demanding: you must manage etcd clusters, object storage (MinIO), message queues (Pulsar), and the Milvus services themselves. This requires engineers who understand Kubernetes networking, persistent volume claims, and pod disruption budgets. However, Zilliz Cloud eliminates this entirely.

Qdrant and Weaviate sit in the middle: Docker deployment is straightforward, and both offer Helm charts for Kubernetes.

pgvector requires zero additional infrastructure – it is your existing PostgreSQL database. This is its strongest operational argument and why many teams start here.

Source: Scaling Vector Search: Self-Hosted vs Managed, The Case Against pgvector

8. Quantization & Memory Optimization

Quantization is critical for cost control at 100M+ scale:

Database	Binary Quantization	Scalar Quantization	Product Quantization	Memory Reduction
Qdrant	Yes + 1.5-bit, 2-bit, asymmetric (2025)	Yes (float32 -> uint8)	No	4-8x (scalar), 32x (binary), 40x speedup with binary
Milvus	Yes	Yes	Yes (IVF_PQ, GPU CAGRA)	Up to 32x; DiskANN moves vectors to SSD
pgvector	Yes (up to 64K dims)	Yes (halfvec for half-precision)	No	2-4x with halfvec; 32x with binary
Weaviate	Yes (BQ)	Yes (SQ)	Yes (PQ)	Configurable per collection
Pinecone	Automatic (internal)	Automatic	Automatic	Transparent to user
ChromaDB	No (limited)	No (limited)	No	Relies on HNSW defaults

Source: Qdrant Quantization, pgvector GitHub

9. AI Agent Memory: Architecture Recommendations

Modern AI agent memory systems require:
- Episodic memory: Conversation turns, tool calls, observations (high write volume, temporal queries)
- Semantic memory: Facts, user preferences, learned patterns (hybrid search critical)
- Procedural memory: Successful strategies, tool usage patterns (metadata-heavy filtering)

Production approaches combine dense vector search + sparse BM25 keyword matching + metadata filtering (timestamps, user IDs, topics) + cross-encoder reranking, while adding semantic caching to cut costs on repeated queries. Source: AI Agent Memory Architecture

10. Recommendations by Scale

1M Vectors (Startup / Single-Agent)

Best choice: pgvector or Qdrant

pgvector wins if you already run PostgreSQL – zero additional infrastructure, sub-100ms latency, and your metadata lives alongside your vectors. Qdrant wins if you need better filtering performance and want a dedicated vector engine with a small footprint.

10M Vectors (Production Multi-Agent System)

Best choice: Weaviate or Qdrant Cloud

At this scale, hybrid search becomes essential for agent memory quality. Weaviate’s native BM25F + ACORN filtering makes it the strongest choice for retrieval quality. Qdrant Cloud offers the best pure vector performance if hybrid search is handled at the application layer. Pinecone is compelling if you want zero-ops.

100M Vectors (Enterprise Agent Platform)

Best choice: Milvus/Zilliz Cloud or Pinecone Enterprise

Milvus is purpose-built for this scale with DiskANN, GPU acceleration (50x faster search with CAGRA), and disaggregated storage/compute. Zilliz Cloud removes the operational burden. Pinecone Enterprise offers proven billion-scale with zero ops but at premium cost and vendor lock-in. pgvectorscale is surprisingly competitive at 50M (471 QPS at 99% recall) but hits architectural walls beyond 100M.

1B+ Vectors (Massive-Scale Platform)

Best choice: Milvus (self-hosted with GPU) or Pinecone Enterprise

Only Milvus and Pinecone have proven production deployments at billion-scale. Milvus with 8 DGX H100 GPUs can build an index of 635M 1024-dim vectors in 56 minutes (vs 6.22 days on CPU). Zilliz Cloud’s tiered storage delivers 87% storage cost reduction at this scale.

Summary Matrix

Criterion	pgvector	Pinecone	Weaviate	Qdrant	Milvus	ChromaDB
Latency (1M)	Good	Good	Good	Best	Good	Good (single-request)
Throughput (10M+)	Moderate	High	Moderate	High	Highest	Poor
Hybrid Search	Weak	Basic	Best	Good	Strong	Basic
Filtering	Weak (post-filter)	Strong	Strong (ACORN)	Best	Strong	Basic
Cost (10M)	Lowest	$64/mo	$85/mo	$102/mo	$89-114/mo	Free
Cost (100M)	$300-500/mo	$400+/mo	Enterprise	$660+/mo	$500-1000/mo	N/A
Ops Complexity	Lowest	Zero-ops	Medium	Low	High	Lowest
Max Proven Scale	~50-100M	Billions	~100M	~50M per node	Billions	~10M
GPU Acceleration	No	N/A (managed)	No	Yes (indexing)	Yes (50x search)	No
Agent Memory Fit	Good start	Good (managed)	Best (hybrid)	Good (filtering)	Best (scale)	Prototyping only

Sources

Get workflow automation insights that cut through the noise

One email per week. Practical frameworks, not product pitches.

Ready to Run Autonomous Enterprise Operations?

See how QorSync AI deploys governed agents across your enterprise systems.

Request Demo

Not ready for a demo? Start here instead:

Download the governance checklist Try the ROI calculator

AI Memory Digital Twins in 2026: Architecture, Governance, and Enterprise Risk

3 min read

Memory as a Service in 2026: Platform Comparison and Enterprise Buying Guide

4 min read

AI Agent Memory Security: Threat Model, Controls, and Incident Response Blueprint

4 min read

Vector Similarity Search at Scale for AI Agent Memory Retrieval

Deep Comparison: pgvector vs Pinecone vs Weaviate vs Qdrant vs Milvus vs ChromaDB

1. Architecture Overview

2. Query Latency Benchmarks

At 1M Vectors (768 dimensions)

At 10M Vectors

At 50M-100M Vectors

3. Throughput (QPS)

VectorDBBench Standard Results (8-core, 32GB host)

4. Hybrid Search (Vector + Full-Text)

Hybrid Search Quality Benchmarks (Weaviate Search Mode)

5. Metadata Filtering Performance

Filtering Strategies by Database

6. Cost Analysis at Scale

Managed Service Costs (Monthly Estimates, 1536 dimensions)

Pricing Models Breakdown

Self-Hosting Cost Crossover

7. Operational Complexity

Self-Hosted Kubernetes Complexity

8. Quantization & Memory Optimization

9. AI Agent Memory: Architecture Recommendations

10. Recommendations by Scale

1M Vectors (Startup / Single-Agent)

10M Vectors (Production Multi-Agent System)

100M Vectors (Enterprise Agent Platform)

1B+ Vectors (Massive-Scale Platform)

Summary Matrix

Sources

Get workflow automation insights that cut through the noise

Ready to Run Autonomous Enterprise Operations?

Related Articles