Vector Database Monitoring: Pinecone, Weaviate, pgvector

Two years ago, "vector database" was a phrase you only heard from machine learning researchers. Today it sits behind the AI features in your product. Customer support search? Vectors. The "related articles" widget? Vectors. The retrieval step before every LLM call in your RAG pipeline? Vectors.

The vector database has quietly become critical infrastructure. If Pinecone has an incident, your AI features stop working. If pgvector queries get slow, your support search times out. If your Weaviate cluster's index falls out of sync after a re-ingest, users start getting subtly wrong answers — and nothing on your dashboards looks broken.

Vector databases fail in ways traditional database monitoring doesn't catch. The connection is fine, the queries return 200, the latency looks acceptable — and the actual retrieval quality has silently degraded by 30% because an embedding model was updated without re-indexing. Your AI features are now returning worse results, and you have no signal.

This guide covers what to monitor on Pinecone, Weaviate, pgvector, and the other vector stores powering production AI systems, plus the end-to-end RAG pipeline monitoring that catches quality regressions standard uptime checks miss.

Why Vector Databases Need Their Own Monitoring Approach

A traditional database serves SQL queries with predictable correctness. A vector database serves approximate nearest-neighbor (ANN) queries — and "approximate" is the operative word. Two queries with the same input can return slightly different results depending on index state, recall settings, and embedding-model version.

This creates failure modes traditional monitoring misses:

Stale embeddings — Your embeddings were generated by model v1.2, but production code now uses model v1.3. Queries return the wrong nearest neighbors because the vectors live in slightly different spaces.
Index rebuild blocking queries — A large re-ingest triggers an index rebuild. During rebuild, query latency can 10x with no error response — just slow answers.
Recall degradation — As your dataset grows, the index parameters (HNSW ef, IVF nprobe, etc.) that were tuned at 100K vectors start returning worse results at 10M vectors. No errors, just worse retrieval.
Dimension mismatches — Code change pushes 1024-dim embeddings into an index built for 1536-dim. Insert fails silently, queries return nothing useful.
Ingestion lag — New documents are queued for embedding but the worker is backed up. Users searching for content added five minutes ago don't find it.
Cost spikes from index growth — Pinecone and most managed services charge by stored vectors. A bug in your ingestion can 10x your bill overnight.
Cluster degradation under bulk ingest — Bulk ingestion of millions of vectors can saturate a Weaviate or Qdrant cluster's I/O for hours, slowing all queries.

Traditional database monitoring catches "is the database up?" and "are queries returning?" For vector databases, those two answers can both be yes while the underlying retrieval system is silently broken.

What to Monitor on Every Vector Database

1) Query Endpoint Uptime

The first layer is straightforward. Set up an HTTP check that issues a simple similarity query against your production index:

Pinecone: POST https://{index}.svc.{region}.pinecone.io/query with a small test vector
Weaviate: POST https://{cluster}/v1/graphql with a nearVector query
pgvector: a SELECT ... ORDER BY embedding <-> '[...]'::vector LIMIT 5 against a known index
Qdrant: POST https://{host}/collections/{name}/points/search with a small test vector
Chroma: a query call to the running server

Use a stable "canary" vector you control — not real user data — so the expected result is known and you can validate the response body, not just the status code.

Run this check every 1–2 minutes from at least two regions (see Multi-Region Monitoring: Why Location Matters).

2) Query Latency Distribution

Like LLM APIs, vector DB latency is more about distribution than averages. A single average hides everything that matters:

p50 — Your typical user's retrieval experience
p95 — The slow tail; this is where AI features start feeling sluggish
p99 — Worst-case retrievals; when this exceeds your timeout, users see errors

Track latency separately for:

Read queries (similarity search) — what users feel
Write/upsert latency — affects how quickly new content is searchable
Filtered queries vs unfiltered — filtered queries (with metadata predicates) often have very different performance characteristics

A vector DB whose p50 looks healthy but whose p99 has tripled is one re-index away from a user-facing incident.

3) Index Health

Index state is the most provider-specific signal but always worth tracking:

Vector count — Sudden drops mean accidental deletes; sudden spikes mean runaway ingestion
Index size on disk — Predicts cost and capacity ceiling
Index parameter drift — If your ef_construction was 200 in dev but 50 in prod, retrieval quality differs
Sharding state — Weaviate, Qdrant, and Pinecone serverless distribute data across shards; one degraded shard affects a subset of queries
Replication lag — In multi-replica setups, replica lag means queries to one replica return stale results

For pgvector specifically, watch pg_stat_* views the same way you would for any Postgres table — the index isn't magic, it's a B-tree or HNSW index on a Postgres column with the same operational characteristics. See Database Monitoring: MySQL, PostgreSQL, Redis, and Uptime.

4) Recall and Retrieval Quality

The hardest and most important signal. The DB can be up, fast, and returning the wrong results. You catch this with a golden dataset:

Maintain 50–200 known query/expected-result pairs (an "eval set")
Run them against your production index daily (or on every deploy)
Track recall@k — what percentage of expected results appear in the top-k retrievals
Alert when recall drops by more than 5% from baseline

This is the only way to detect:

An embedding model change that silently degrades retrieval
An index parameter tweak that lost recall
Index corruption that drops some documents from results

Without a golden dataset, you'll only learn about retrieval regressions when customers complain.

5) Ingestion Lag

How long between "a document is added" and "the document is findable in queries?"

Time-to-searchable — Tag each new document with an ingestion timestamp; query for it periodically; record how long until it appears
Queue depth — If you embed asynchronously, watch the queue between "document received" and "vector indexed"
Worker count and throughput — How many embeddings per second can your pipeline process?

For RAG applications where freshness matters (support docs, chat history, real-time content), ingestion lag is as important as query latency.

6) Embedding Model Version

A subtle but devastating failure mode: you update the embedding model in code without re-embedding the existing index. Now every new query is embedded with model v2 but the stored vectors are from model v1. The geometry doesn't match. Retrieval quality collapses.

Monitor:

Production embedding model version — Track which model is currently being called for new embeddings
Index embedding model version — Track which model produced the vectors currently stored (tag them in metadata)
Mismatch alert — Page immediately if production model version doesn't match index version

This is one of the easiest disasters to create and one of the easiest to monitor for.

7) Cost per Query and Storage Cost

Managed vector databases charge in non-obvious ways:

Pinecone serverless — charges per read/write operation plus storage
Pinecone pods — charges per pod-hour
Weaviate Cloud — charges by cluster size
AWS Bedrock + OpenSearch — charges for storage, ingestion, and queries separately

Track:

Daily query count and per-query cost trend
Storage cost trend — alert on weekly growth > 50% (probably runaway ingestion)
Cost per LLM call end-to-end (vector query cost + LLM API cost) — see AI API Monitoring: OpenAI, Anthropic, and Gemini Uptime

Per-Provider Specifics

Pinecone

Status page: status.pinecone.io
Serverless vs pods: very different operational profiles. Serverless scales automatically but can have variable latency under bursty load. Pods are predictable but require manual capacity planning.
Watch for: throttling on writes during bulk ingest, p99 latency degradation during background optimization
Specific metrics: vector_count, index_fullness (for pod-based), read_unit and write_unit consumption (for serverless)
Authentication: API key in Api-Key header — monitor with authenticated headers

Weaviate

Self-hosted vs Weaviate Cloud: monitor the cluster differently in each case
Watch for: shard health imbalance, schema mismatches between modules
Specific metrics: nodes status (via /v1/nodes), schema consistency, object count per class, batch import throughput
Performance gotcha: hybrid search (BM25 + vector) has very different latency than pure vector search; monitor each query shape separately

pgvector (Postgres extension)

Operationally a Postgres database — connection pool exhaustion is your enemy. See Database Monitoring: MySQL, PostgreSQL, Redis, and Uptime.
Index types matter: IVFFlat is fast to build but lower recall; HNSW is higher recall but slower to build. Know which you have.
Performance gotcha: SET hnsw.ef_search and SET ivfflat.probes are session-level and dramatically affect both latency and recall. Make sure your application sets them correctly.
Watch: index build time, autovacuum on vector tables, connection pool utilization

Qdrant

Self-hosted or Qdrant Cloud: similar split as Weaviate
Watch for: collection size, optimizer status, segment count (too many segments slows queries)
Specific metrics: collection vectors_count, segments_count, indexed_vectors_count

Chroma

Single-process for many deployments — your monitoring is the host monitoring
Watch: disk usage, process memory, the persistence layer (DuckDB/Parquet/Clickhouse depending on version)

AWS OpenSearch with k-NN, Vertex Vector Search, Azure AI Search

Cloud-provider integration — monitor via the cloud provider's native telemetry, but treat the vector functions as a third-party dependency (see third-party dependency monitoring)
Status pages: AWS Health, Google Cloud status, Azure status — and treat them as lagging indicators

End-to-End RAG Pipeline Monitoring

The vector DB is one step in a pipeline. RAG (retrieval-augmented generation) looks like this:

user query
   ↓ embed
query vector
   ↓ retrieve (vector DB)
top-k documents
   ↓ rerank (optional)
final context
   ↓ LLM call
generated answer
   ↓ post-process / validate
response to user

A single failure anywhere makes the whole pipeline broken. Monitor each step:

Embedding step latency and error rate — usually an LLM-provider call
Vector retrieval latency, count of results, recall — covered above
Rerank latency — if you use a reranker (Cohere, cross-encoder), monitor it as a separate service
LLM generation latency, token count, refusal rate — see AI API Monitoring: OpenAI, Anthropic, and Gemini Uptime
End-to-end latency — wall clock from "user submits" to "response returned" — this is what the user feels
End-to-end quality — golden dataset evaluation of the full pipeline, not just retrieval

The end-to-end quality eval is the single most important signal. It catches everything: embedding regressions, retrieval regressions, reranker bugs, prompt regressions, model drift. Run it on every deploy.

Alerting Thresholds That Work

Vector DBs and RAG pipelines have inherent noise. Tune thresholds for signal:

Query endpoint uptime check fails 3 consecutive times (3 minutes): page on-call
p95 query latency > 2× baseline for 15 minutes: notification channel
p95 query latency > 4× baseline for 5 minutes: page on-call
Recall@10 on golden dataset drops > 5% from baseline: page (this is a real quality regression)
Ingestion lag > 10 minutes for content where freshness matters: notification
Vector count drops > 1% in a 1-hour window: page (probably accidental deletes)
Cost growth > 50% week-over-week: notification, investigate during business hours
Embedding model version mismatch (prod vs index): page immediately

See Alert Fatigue: Notifications That Get Acted On for the principles.

Setting Up a Vector DB Uptime Check

A minimal but useful uptime check against pgvector via your application's API:

POST https://yourapp.com/api/internal/vector-canary
Authorization: Bearer <monitoring-key>

{
  "query_text": "known canary query",
  "expected_top_result_id": "doc-abc-123"
}

The endpoint runs an actual similarity search on your production index using a known query vector and verifies the expected document appears in the top-k results. Configure the monitor to:

Run every 60 seconds from at least two regions
Alert if response time > 2 seconds (vector queries should be fast)
Alert if response time > 10 seconds as an emergency threshold
Validate response body confirms the expected document ID appears
Use a separate monitoring API key
Don't expose this endpoint publicly without authentication

For REST API endpoint monitoring patterns more broadly, apply the same approach to your other AI-pipeline endpoints.

Common Vector Database Pitfalls

No re-index after embedding model change — Production model v2, index from model v1. Catastrophic retrieval quality, no error response.
Filter predicates not indexed — Metadata filters that aren't indexed cause full scans, killing latency at scale.
Bulk ingestion saturating queries — A large re-ingest brings the cluster to its knees while running. Throttle ingestion or do it off-peak.
Forgotten test data in production index — Dev/test vectors get pushed to production indexes. Production queries occasionally return junk results.
Inconsistent metadata schemas — Half your vectors have a tenant_id field, the other half don't. Filter queries return inconsistent results.
Wrong index parameters at scale — Index parameters tuned for 100K vectors are wrong for 10M. Recall silently degrades.
Trusting a 200 response — A vector search that returns "no results" with a 200 is still a failure for your user. Validate response shape, not just status code.
No backup / no point-in-time recovery — Vector indexes are often treated as derived data. Until you lose them. Always have a re-build path from source content.

Vector DB Monitoring Checklist

For every vector database powering production AI features:

Query endpoint uptime check every 60 seconds with response body validation
Query latency tracked at p50/p95/p99, separately for read/write/filtered queries
Index health metrics (vector count, size, shard/segment status)
Golden-dataset evaluation runs daily, recall@k tracked over time
Ingestion lag measured end-to-end (document added → searchable)
Embedding model version monitored on both production code and stored index; mismatch alerts
Cost monitoring with weekly growth alerts
Provider status page subscribed (lagging indicator only)
End-to-end RAG pipeline quality eval on every deploy
Backup / re-index path documented and tested
Connection pool monitoring (especially for pgvector + Postgres)
Alerting tuned: short-window failures notify, sustained failures page

How Webalert Helps Monitor Vector Databases

Webalert monitors vector databases the way they actually need to be monitored — as authenticated APIs powering critical AI features:

HTTP monitoring with custom headers — Api-Key for Pinecone, Authorization: Bearer for Weaviate Cloud, custom headers for your own canary endpoints
Response time alerts — Catch p95 latency degradation before users feel it
Content validation — Verify response bodies contain expected document IDs, not just a 200
Multi-region checks — Confirm vector DB reachability from every region your users come from
1-minute check intervals — Detect provider incidents and index degradation within a minute
Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
Status page — Communicate to your customers when your AI features are degraded
5-minute setup — Add the endpoint, paste your monitoring key, set thresholds, and you're live

See features and pricing for details.

Summary

Vector databases have failure modes traditional database monitoring misses: stale embeddings, recall degradation, ingestion lag, dimension mismatches, and silent quality regressions.
Monitor query uptime, latency distribution (not just averages), index health, recall against a golden dataset, ingestion lag, and embedding-model version on both the code and the stored index.
Each provider has specific quirks — Pinecone serverless vs pods, Weaviate's hybrid-search behavior, pgvector's Postgres operational profile, Qdrant's segment count.
Monitor the full RAG pipeline, not just the vector DB step. End-to-end quality evaluation is the single highest-leverage signal.
A vector DB returning a 200 with the wrong results is the most common production AI failure mode. Standard uptime checks won't catch it — golden-dataset evaluation will.

The companies winning with production AI aren't the ones with the fanciest models. They're the ones whose retrieval pipelines stay reliable when models change, indexes grow, and ingestion patterns shift. Monitoring is how you get there.

Monitor your vector database before the next silent quality regression

Start monitoring with Webalert →

See features and pricing. No credit card required.