Pinecone is a managed vector database built for production-scale similarity search. Unlike self-hosted alternatives, it handles infrastructure, replication, and scaling automatically — so you can focus on building instead of operating. In this tutorial you will create a Pinecone index from scratch, generate embeddings with Sentence Transformers, upsert vectors with metadata, run semantic queries, and wire up a full RAG pipeline with the Claude API. By the end you will have a working semantic search application in under 100 lines of Python.
Common use cases for Pinecone: semantic search (find documents by meaning, not keywords), RAG (retrieve relevant context before calling an LLM), recommendation systems (find similar items from user history), and duplicate detection (cluster near-identical records at scale).
Prerequisites
Before you start, make sure you have:
- Python 3.8+ installed
- A free Pinecone account at pinecone.io — the free tier includes 2 GB storage and one serverless index
- Your Pinecone API key from the Pinecone console (Project → API Keys)
Install the required packages:
pip install pinecone sentence-transformersThe pinecone package is the official Python client (v3+). sentence-transformers provides the local embedding model — no OpenAI key required.
Create Your First Index
A Pinecone index is a named collection of vectors. You specify the vector dimension and distance metric at creation time — these must match your embedding model. The all-MiniLM-L6-v2 model used in this tutorial produces 384-dimensional vectors.
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")
# Create a serverless index (AWS us-east-1 is available on the free tier)
pc.create_index(
name="demo",
dimension=384,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
# Connect to the index
index = pc.Index("demo")
print(index.describe_index_stats())The metric parameter controls how similarity is measured. cosine works well for text embeddings because it normalises for vector magnitude. Use dotproduct if your embeddings are already unit-length, or euclidean for image or numeric embeddings.
If the index already exists from a previous run, skip creation with a guard:
existing = [i.name for i in pc.list_indexes()]
if "demo" not in existing:
pc.create_index(
name="demo",
dimension=384,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index("demo")Generate Embeddings
An embedding is a dense numerical vector that captures the semantic meaning of a piece of text. Two sentences with similar meaning will have vectors close together in the embedding space — even if they share no words.
all-MiniLM-L6-v2 is a fast, accurate model that runs locally with no API key. It maps sentences to 384-dimensional vectors and is a strong default for English-language semantic search.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [
"Python is a high-level programming language.",
"Machine learning models require training data.",
"Vector databases store embeddings for similarity search.",
"LLMs generate text by predicting the next token.",
"Pinecone is a managed vector database for production use.",
]
embeddings = model.encode(texts)
print(f"Shape: {embeddings.shape}") # (5, 384)model.encode() returns a NumPy array of shape (n_texts, 384). Call .tolist() on each row to convert it to a plain Python list before upserting into Pinecone.
Upsert Vectors
Pinecone stores vectors as records with three fields: a unique id, the values list (the embedding), and optional metadata — a dict of arbitrary key-value pairs that you can filter on at query time.
vectors = [
{
"id": f"doc{i}",
"values": embeddings[i].tolist(),
"metadata": {"text": texts[i], "source": "docs"},
}
for i in range(len(texts))
]
index.upsert(vectors=vectors)
print(index.describe_index_stats())Upsert means insert-or-update: if a vector with the same id already exists it is replaced, otherwise it is added. Pinecone recommends batching large upserts into chunks of 100 vectors:
def upsert_in_batches(index, vectors, batch_size=100):
for i in range(0, len(vectors), batch_size):
batch = vectors[i : i + batch_size]
index.upsert(vectors=batch)
print(f" Upserted {min(i + batch_size, len(vectors))}/{len(vectors)}")Query Vectors
To search, encode your query with the same model and call index.query(). Pinecone returns the top-k most similar vectors along with their similarity scores and metadata.
query_text = "What is a vector database?"
query_emb = model.encode([query_text])[0].tolist()
results = index.query(
vector=query_emb,
top_k=3,
include_metadata=True,
)
for match in results["matches"]:
print(f"Score: {match['score']:.4f} | {match['metadata']['text']}")The score is the cosine similarity between the query and each result — 1.0 is identical, 0.0 is orthogonal. You can add a filter parameter to restrict results by metadata:
results = index.query(
vector=query_emb,
top_k=5,
include_metadata=True,
filter={"source": {"$eq": "docs"}},
)Full Semantic Search Example
Here is a complete, self-contained script that indexes 10 short articles and lets you query them by natural language. Copy it into a file, replace the API key, and run it directly.
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
PINECONE_API_KEY = "YOUR_PINECONE_API_KEY"
INDEX_NAME = "semantic-search-demo"
articles = [
"Pinecone is a managed vector database optimised for high-speed similarity search.",
"ChromaDB is an open-source vector store that runs locally without any API key.",
"FAISS is a Facebook AI library for efficient exact and approximate nearest neighbour search.",
"Sentence Transformers convert text to dense embedding vectors for semantic similarity.",
"RAG combines retrieval with generation to ground LLM answers in real documents.",
"Python is the dominant language for machine learning and AI development in 2026.",
"Claude is Anthropic's family of AI assistants based on Constitutional AI training.",
"LangChain provides tools for composing LLM pipelines using a pipe-operator syntax.",
"LlamaIndex specialises in document ingestion and advanced retrieval for RAG systems.",
"Cosine similarity measures the angle between two vectors, ignoring their magnitude.",
]
# Load model and encode all articles
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(articles)
# Create index if it does not exist
pc = Pinecone(api_key=PINECONE_API_KEY)
existing = [i.name for i in pc.list_indexes()]
if INDEX_NAME not in existing:
pc.create_index(
name=INDEX_NAME,
dimension=384,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index(INDEX_NAME)
# Upsert all articles
vectors = [
{"id": f"article{i}", "values": embeddings[i].tolist(), "metadata": {"text": articles[i]}}
for i in range(len(articles))
]
index.upsert(vectors=vectors)
print(f"Indexed {len(articles)} articles\n")
# Query loop
queries = [
"Which vector database works offline?",
"How does semantic search differ from keyword search?",
"What tool helps build RAG pipelines?",
]
for query in queries:
qvec = model.encode([query])[0].tolist()
results = index.query(vector=qvec, top_k=2, include_metadata=True)
print(f"Query: {query}")
for r in results["matches"]:
print(f" [{r['score']:.3f}] {r['metadata']['text']}")
print()Pinecone + RAG with Claude
The canonical Pinecone use case in 2026 is retrieval-augmented generation: retrieve the most relevant context from Pinecone, then pass it to an LLM to synthesise an answer. Here is a minimal RAG loop using claude-sonnet-4-6:
import anthropic
from pinecone import Pinecone
from sentence_transformers import SentenceTransformer
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")
index = pc.Index("semantic-search-demo")
model = SentenceTransformer("all-MiniLM-L6-v2")
claude = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
def rag_query(question: str, top_k: int = 3) -> str:
# 1. Embed the question and retrieve relevant chunks
qvec = model.encode([question])[0].tolist()
results = index.query(vector=qvec, top_k=top_k, include_metadata=True)
context = "\n".join(r["metadata"]["text"] for r in results["matches"])
# 2. Build a grounded prompt
prompt = (
f"Answer the question using only the context below.\n\n"
f"Context:\n{context}\n\n"
f"Question: {question}"
)
# 3. Call Claude
message = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
return message.content[0].text
print(rag_query("What is the difference between Pinecone and ChromaDB?"))Install the Claude SDK: pip install anthropic. For a deeper walkthrough of the RAG pattern including chunking strategies and re-ranking, see the full RAG Tutorial with Python.
Pinecone vs ChromaDB
The two most popular vector stores for Python developers in 2026 take opposite approaches:
- Pinecone — managed cloud service: no infrastructure to run, automatic replication, horizontal scaling. Ideal when you need a production-ready vector store without an ops team.
- ChromaDB — open source, self-hosted: runs in-process (embedded mode) or as a standalone server. Zero cost, full data control, great for local development and privacy-sensitive workloads.
- Scale: Pinecone handles billions of vectors with single-digit millisecond latency. ChromaDB works well up to tens of millions of vectors on a single machine.
- Cost: Pinecone free tier is generous for experimentation; paid tiers start at $70/month. ChromaDB is free to self-host, but you pay for compute and storage.
- Metadata filtering: both support server-side filtering by metadata fields at query time.
- When to choose Pinecone: production SaaS, large datasets, teams without dedicated DevOps, or when you need SLAs and uptime guarantees.
- When to choose ChromaDB: local development, RAG prototypes, air-gapped environments, budget constraints, or when you want to avoid vendor lock-in.
For a hands-on ChromaDB walkthrough, see the ChromaDB Tutorial. For a full comparison of five vector databases including Weaviate, Qdrant, and pgvector, see Vector Databases Comparison (2026).
Pinecone Pricing
Pinecone uses a usage-based pricing model with three tiers as of 2026:
- Free (Starter) — 2 GB storage, 1 serverless index, unlimited queries within the free allocation. No credit card required. Sufficient for prototypes and small projects under ~500k vectors.
- Standard — from $70/month — multiple indexes, higher storage limits, dedicated support SLA, higher query throughput. Suitable for production applications with moderate traffic.
- Enterprise — $100+/month (custom) — private clusters, VPC peering, SSO, custom SLAs, dedicated account management. For large-scale production with compliance requirements.
Query and write costs are measured in read units (RU) and write units (WU). On the free tier you receive a fixed monthly allocation of each. For current numbers, check the Pinecone pricing page directly, as the specifics change with platform updates.
Summary
- Pinecone is a managed vector database — create an index with
pc.create_index(), upsert vectors with metadata, and query by similarity withindex.query(). - Use
all-MiniLM-L6-v2from Sentence Transformers for fast, accurate local embeddings with no API key required. - Batch upserts in chunks of 100 vectors for best performance on large datasets.
- Add a
filterparameter toindex.query()to restrict results by metadata fields. - Combine Pinecone with
claude-sonnet-4-6for a production RAG pipeline: retrieve with Pinecone, generate with Claude. - Choose Pinecone for managed, scalable production deployments; choose ChromaDB for local development and cost-sensitive projects.
- The free tier covers 2 GB and one index — enough to build and test a complete semantic search application.
Further reading: RAG Tutorial with Python — full end-to-end retrieval pipeline; ChromaDB Tutorial — self-hosted alternative with zero setup; Vector Databases Comparison (2026) — Pinecone, Chroma, Weaviate, Qdrant, pgvector side by side.
Subscribe to my newsletter — practical guides on Claude API, AI agents, RAG, and automation.