Back to Blog

Vector Similarity Search at Scale for AI Agent Memory Retrieval

Source: TensorBlue 2025 Comparison, Firecrawl 2026 Guide

Dhawal ChhedaAI Leader at Accel4

Vector Similarity Search at Scale for AI Agent Memory Retrieval

Deep Comparison: pgvector vs Pinecone vs Weaviate vs Qdrant vs Milvus vs ChromaDB


1. Architecture Overview

DatabaseLanguageDeployment ModelIndex TypesLicense
pgvectorC (Postgres extension)Self-hosted / Managed PostgresHNSW, IVFFlatPostgreSQL License
PineconeProprietaryManaged SaaS onlyProprietary (serverless)Proprietary
WeaviateGoSelf-hosted / Cloud / BYOCHNSW + ACORN, Flat, DynamicBSD-3
QdrantRustSelf-hosted / Cloud / HybridHNSW, Sparse Inverted IndexApache 2.0
MilvusGo + C++Self-hosted / Zilliz CloudHNSW, IVF_FLAT, IVF_PQ, DiskANN, GPU CAGRAApache 2.0
ChromaDBRust core (Python API)Embedded / Self-hosted / Cloud (early)HNSWApache 2.0

2. Query Latency Benchmarks

At 1M Vectors (768 dimensions)

DatabaseP95 LatencyP99 LatencyNotes
Qdrant30-40 ms~1 ms (small datasets, optimized)Rust engine; best raw single-node latency
Pinecone40-50 ms7 ms (claimed)Serverless auto-tuned
Milvus50-80 ms<30 ms P95Strong with GPU acceleration
Weaviate50-70 ms~50 ms on 768-dimACORN filtering adds minimal overhead
pgvectorSub-100 msVaries with configCompetitive at this scale with HNSW
ChromaDBFast for single requestsDegrades under concurrency23 sec avg at 100 concurrent vs pgvector’s 9.8 sec

Source: TensorBlue 2025 Comparison, Firecrawl 2026 Guide

At 10M Vectors

DatabaseObserved QPSNotes
Qdrant8,000-15,000 (with quantization)4x RPS improvement in recent versions
Milvus10,000-20,000Fastest indexing time; GPU acceleration available
Pinecone5,000-10,000Consistent with auto-scaling
Weaviate3,000-8,000Additional overhead from graph features
pgvectorCompetitive but requires tuningIndex builds consume significant RAM

At 50M-100M Vectors

This is where databases diverge dramatically:

DatabasePerformance at 50M+100M+ Viability
MilvusPurpose-built for this scale; proven at billionsExcellent – designed for billion-scale
PineconeProven at billions with serverlessExcellent – auto-scales transparently
pgvectorscale471 QPS at 99% recall (50M) – 11.4x better than QdrantHits walls beyond 100M; relational storage model limits
Qdrant41.47 QPS at 99% recall (50M)Performance degrades; best under 50M per node
WeaviateReports needing more memory/compute above 100MViable with enterprise cloud; needs careful sizing
ChromaDBNot designed for this scaleNot viable – single-node focus

Key finding: pgvectorscale with DiskANN and Statistical Binary Quantization achieves P95 latency 28x lower than Pinecone s1 pods at 99% recall on 50M vectors. However, Postgres’s 8KB page storage model and memory-intensive index builds become limiting factors beyond 100M. Source: Firecrawl 2026


3. Throughput (QPS)

VectorDBBench Standard Results (8-core, 32GB host)

The VectorDBBench project provides standardized comparisons across 15 test cases. Key findings from 2025-2026 runs:

  • Milvus/Zilliz Cloud: Led in indexing speed and maintained strong QPS at moderate dimensions. Performance drops with higher-dimensional embeddings (1536+).
  • Qdrant: Highest RPS and lowest latencies in most scenarios at 1M-10M scale. Claims 4x RPS improvements over previous versions on deep-image-96 dataset.
  • Elasticsearch: 10x slower indexing than Qdrant at 10M+ vectors (5.5 hours vs 32 minutes).
  • Weaviate: “Improved the least since the last benchmark run” per Qdrant’s benchmarks. However, Weaviate’s own Search Mode benchmarks show +17% Success@1 improvement.
  • Pinecone: In Cohere 10M streaming tests, maintained higher QPS and recall throughout the write cycle. Performance improved significantly after ingestion completion.

Source: Qdrant Benchmarks, VectorDBBench GitHub


4. Hybrid Search (Vector + Full-Text)

This is a critical dimension for AI agent memory, where you need both semantic similarity (finding conceptually related memories) and keyword precision (matching specific entities, dates, tool names).

DatabaseBM25/Full-TextHybrid ApproachFusion MethodMaturity
WeaviateNative BM25F (field-weighted)Single hybrid() API callRRF or custom weightsMost mature – BlockMax WAND (GA 2025) makes keyword side 10x faster
MilvusNative Sparse-BM25 (since 2.5)Multi-vector search combining dense + sparseRRF, WeightedStrong – 30x faster than Elasticsearch on BM25 queries (6ms vs 200ms at 1M vectors)
QdrantSparse inverted index (since v1.9)Named vectors: dense HNSW + sparse indexDBSF (score-aware fusion)Good – server-side IDF since v1.15.2
PineconeSparse-dense vectorsCombined sparse+dense in single queryBuilt-in fusionGood for simple cases; less configurable
pgvectorPostgreSQL tsvector/tsquerySeparate queries combined in SQLManual (application-level RRF)Functional but not integrated; no single-query hybrid
ChromaDBBasic metadata + full-text searchWhere filters + text matchingNone (basic)Limited – not production-grade hybrid

Hybrid Search Quality Benchmarks (Weaviate Search Mode)

DatasetSuccess@1Recall@5nDCG@10
BEIR Natural Questions0.430.700.61
BEIR SciFact0.580.790.71
BEIR FiQA0.450.430.45

Weaviate’s Search Mode showed +5% to +24% improvement over standard hybrid search across 12 IR benchmarks.

For AI agent memory: Hybrid search is essential. Agent memories contain both semantic content (“the user prefers Python”) and precise identifiers (“tool:code_executor”, “session:abc123”). Weaviate and Milvus lead here.

Sources: Weaviate Hybrid Search, Weaviate Search Mode Benchmarking, Milvus Full Text Search, Qdrant Hybrid Search


5. Metadata Filtering Performance

Filtering is the “Achilles heel” of vector search. AI agents need to filter by user_id, session_id, timestamp ranges, memory type, and tool context – often eliminating 90-99% of candidates.

Filtering Strategies by Database

DatabaseStrategyBehavior Under Restrictive Filters
QdrantAdaptive query planner (brute-force or graph filtering based on selectivity)Maintains stable latency; HNSW with 0.1 ratio filter shows similar latency to unfiltered
WeaviateACORN (default since v1.34) – multi-hop expansion + random seed entrypointsUp to 10x improvement with low-correlation filters; ranked #2 on Qdrant’s benchmark
MilvusPre-filtering with bitset + partition keyFast for high-cardinality attributes; partition key enables O(1) tenant isolation
PineconeSingle-stage integrated filtering50% selective filter: 57ms; 1% filter: 51.6ms (35% faster than unfiltered 79ms on 1.2M vectors)
pgvectorPost-filtering (no filter push-down into index scan)“Difference between 50ms and 5 seconds” – iterative scans in 0.8.0 help but fundamental limitation remains
ChromaDBWhere-clause metadata filteringBasic; no advanced query planning

Critical finding: Engines with in-algorithm filtering (Qdrant, Pinecone, Weaviate ACORN) actually get faster under restrictive filters because they prune the search space. Post-filter systems (pgvector, LanceDB, OpenSearch) spike to 200-300ms P95 under filtering. Source: The Achilles Heel of Vector Search: Filters

Filtered QPS comparison at 10% selectivity:
- Pinecone-p2: ~600 QPS (from ~800 unfiltered)
- Zilliz-AUTOINDEX: ~700 QPS (from ~750 unfiltered)
- pgvector-HNSW: QPS dips below unfiltered (post-filter overhead)
- OpenSearch-HNSW: QPS dips similarly


6. Cost Analysis at Scale

Managed Service Costs (Monthly Estimates, 1536 dimensions)

ScalePinecone ServerlessWeaviate CloudQdrant CloudZilliz Cloud (Milvus)pgvector (Managed Postgres)ChromaDB
1M vectors~$15-25~$45 (minimum)Free (1GB tier)Free tier / ~$20$10-30 (small instance)Free (self-hosted)
10M vectors (5M queries/mo)~$64~$85~$102 (AWS, no quantization)~$89 (serverless) / $114 (dedicated)~$100-200 (self-hosted AWS)Free but single-node only
100M vectors~$200-400 storage + significant read costsEnterprise pricing (contact sales)~$660+ self-hosted AWS$500-1000 self-hosted / Enterprise managed$300-500 (r6g.2xlarge+)Not viable

Pricing Models Breakdown

Pinecone Serverless:
- Storage: $0.33/GB/month
- Read Units: $8.25 per 1M reads
- Write Units: $2.00 per 1M writes
- Tipping point for self-hosting: ~60-80M queries/month

Zilliz Cloud (Milvus):
- Serverless: $4 per 1M vCUs
- Storage: $0.04/GB/month (standardized Jan 2026)
- DiskANN advantage: 10x more vectors on SSD vs RAM, dramatically lowering cost at 100M+
- Tiered storage: 87% storage cost reduction announced

Weaviate Cloud:
- $25 per 1M vector dimensions/month
- $45/month minimum
- Predictable pricing – no penalty for query spikes

Qdrant Cloud:
- Free: 1GB forever
- Managed: Starting $0.014/hour
- Compute + Memory + Storage billed hourly

Self-Hosting Cost Crossover

Below 50M vectors: Managed SaaS is cheaper due to hidden DevOps costs.
Above 50-100M vectors: Self-hosting becomes economical if you have Kubernetes expertise.
At 1B+ vectors: Milvus self-hosted with DiskANN on commodity hardware is the most cost-effective.

Sources: Rahul Kolekar Pricing Comparison 2026, Pinecone Pricing, Zilliz Pricing, Qdrant Pricing, Weaviate Pricing


7. Operational Complexity

DatabaseDeployment EffortScaling ModelKey Operational Concerns
pgvectorLowest (existing Postgres)Vertical only (single-node)Index builds consume 10+ GB RAM on production DB; no horizontal scaling; HNSW rebuilds cause lock contention; IVFFlat clusters degrade over time requiring periodic rebuilds
PineconeLowest (zero-ops SaaS)Automatic serverlessNo infrastructure to manage; vendor lock-in; limited configuration control; no self-hosting option
ChromaDBLow (pip install, embedded)Single-node onlyPrototyping only; no production HA; no horizontal scaling; experimental distributed mode
QdrantLow-Medium (single binary, Docker)Manual sharding + replicationCompact Rust binary; simple Docker deployment; snapshot-based backups; manual scaling decisions
WeaviateMedium (Docker/Kubernetes)Horizontal with replicationGood documentation; module system adds complexity; BYOC option reduces ops burden
MilvusHigh (Kubernetes required for distributed)Full horizontal (disaggregated compute/storage)Requires etcd + MinIO/S3 + Pulsar/Kafka; multiple microservices; steep learning curve; powerful but complex

Self-Hosted Kubernetes Complexity

Milvus Distributed on Kubernetes is the most operationally demanding: you must manage etcd clusters, object storage (MinIO), message queues (Pulsar), and the Milvus services themselves. This requires engineers who understand Kubernetes networking, persistent volume claims, and pod disruption budgets. However, Zilliz Cloud eliminates this entirely.

Qdrant and Weaviate sit in the middle: Docker deployment is straightforward, and both offer Helm charts for Kubernetes.

pgvector requires zero additional infrastructure – it is your existing PostgreSQL database. This is its strongest operational argument and why many teams start here.

Source: Scaling Vector Search: Self-Hosted vs Managed, The Case Against pgvector


8. Quantization & Memory Optimization

Quantization is critical for cost control at 100M+ scale:

DatabaseBinary QuantizationScalar QuantizationProduct QuantizationMemory Reduction
QdrantYes + 1.5-bit, 2-bit, asymmetric (2025)Yes (float32 -> uint8)No4-8x (scalar), 32x (binary), 40x speedup with binary
MilvusYesYesYes (IVF_PQ, GPU CAGRA)Up to 32x; DiskANN moves vectors to SSD
pgvectorYes (up to 64K dims)Yes (halfvec for half-precision)No2-4x with halfvec; 32x with binary
WeaviateYes (BQ)Yes (SQ)Yes (PQ)Configurable per collection
PineconeAutomatic (internal)AutomaticAutomaticTransparent to user
ChromaDBNo (limited)No (limited)NoRelies on HNSW defaults

Source: Qdrant Quantization, pgvector GitHub


9. AI Agent Memory: Architecture Recommendations

Modern AI agent memory systems require:
- Episodic memory: Conversation turns, tool calls, observations (high write volume, temporal queries)
- Semantic memory: Facts, user preferences, learned patterns (hybrid search critical)
- Procedural memory: Successful strategies, tool usage patterns (metadata-heavy filtering)

Production approaches combine dense vector search + sparse BM25 keyword matching + metadata filtering (timestamps, user IDs, topics) + cross-encoder reranking, while adding semantic caching to cut costs on repeated queries. Source: AI Agent Memory Architecture


10. Recommendations by Scale

1M Vectors (Startup / Single-Agent)

Best choice: pgvector or Qdrant

pgvector wins if you already run PostgreSQL – zero additional infrastructure, sub-100ms latency, and your metadata lives alongside your vectors. Qdrant wins if you need better filtering performance and want a dedicated vector engine with a small footprint.

10M Vectors (Production Multi-Agent System)

Best choice: Weaviate or Qdrant Cloud

At this scale, hybrid search becomes essential for agent memory quality. Weaviate’s native BM25F + ACORN filtering makes it the strongest choice for retrieval quality. Qdrant Cloud offers the best pure vector performance if hybrid search is handled at the application layer. Pinecone is compelling if you want zero-ops.

100M Vectors (Enterprise Agent Platform)

Best choice: Milvus/Zilliz Cloud or Pinecone Enterprise

Milvus is purpose-built for this scale with DiskANN, GPU acceleration (50x faster search with CAGRA), and disaggregated storage/compute. Zilliz Cloud removes the operational burden. Pinecone Enterprise offers proven billion-scale with zero ops but at premium cost and vendor lock-in. pgvectorscale is surprisingly competitive at 50M (471 QPS at 99% recall) but hits architectural walls beyond 100M.

1B+ Vectors (Massive-Scale Platform)

Best choice: Milvus (self-hosted with GPU) or Pinecone Enterprise

Only Milvus and Pinecone have proven production deployments at billion-scale. Milvus with 8 DGX H100 GPUs can build an index of 635M 1024-dim vectors in 56 minutes (vs 6.22 days on CPU). Zilliz Cloud’s tiered storage delivers 87% storage cost reduction at this scale.


Summary Matrix

CriterionpgvectorPineconeWeaviateQdrantMilvusChromaDB
Latency (1M)GoodGoodGoodBestGoodGood (single-request)
Throughput (10M+)ModerateHighModerateHighHighestPoor
Hybrid SearchWeakBasicBestGoodStrongBasic
FilteringWeak (post-filter)StrongStrong (ACORN)BestStrongBasic
Cost (10M)Lowest$64/mo$85/mo$102/mo$89-114/moFree
Cost (100M)$300-500/mo$400+/moEnterprise$660+/mo$500-1000/moN/A
Ops ComplexityLowestZero-opsMediumLowHighLowest
Max Proven Scale~50-100MBillions~100M~50M per nodeBillions~10M
GPU AccelerationNoN/A (managed)NoYes (indexing)Yes (50x search)No
Agent Memory FitGood startGood (managed)Best (hybrid)Good (filtering)Best (scale)Prototyping only

Sources

Get workflow automation insights that cut through the noise

One email per week. Practical frameworks, not product pitches.

Ready to Run Autonomous Enterprise Operations?

See how QorSync AI deploys governed agents across your enterprise systems.

Request Demo

Not ready for a demo? Start here instead:

Related Articles