A vector database stores data as high-dimensional vectors and retrieves it by semantic similarity — not keyword match. This tutorial builds a complete semantic search pipeline from scratch: generate embeddings, store them in a vector database, and query with natural language. You’ll use Python, OpenAI embeddings, and ChromaDB.
If you need a comparison of different vector databases, see Vector Databases Comparison: Pinecone vs Chroma vs Weaviate. For building a full RAG chatbot, see How to Build a RAG Chatbot with Python.
What Is a Vector Database?
When you embed text with a model like text-embedding-3-small, you get a list of ~1500 numbers — a point in high-dimensional space. Similar texts produce points that are geometrically close to each other.
A vector database stores those points and answers the question: “Which stored vectors are closest to this query vector?” That’s semantic search — find results by meaning, not keyword.
- Exact search: checks every vector (slow at scale, perfect recall)
- ANN (Approximate Nearest Neighbour): uses indexes like HNSW for millisecond search over millions of vectors
Prerequisites
- Python 3.9+
- OpenAI API key (
OPENAI_API_KEY) - Basic Python knowledge
Step 1: Install Dependencies
pip install chromadb openaichromadb is a local vector database that runs in-process — no server needed. openai provides the embedding model.
Step 2: Generate Embeddings
An embedding turns text into a vector. Here’s how to generate one:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def embed(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
vector = embed("Python is a popular programming language")
print(f"Dimensions: {len(vector)}") # 1536Step 3: Set Up ChromaDB
Create a persistent collection to store documents and their vectors:
import chromadb
# Persistent storage — data survives process restarts
client_db = chromadb.PersistentClient(path="./chroma_store")
# Create (or load) a collection
collection = client_db.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}, # cosine similarity
)
print(f"Collection has {collection.count()} documents")Step 4: Add Documents
Insert documents with their embeddings and metadata:
documents = [
{"id": "1", "text": "Python is a high-level programming language known for its readability."},
{"id": "2", "text": "JavaScript runs in the browser and is used for web development."},
{"id": "3", "text": "Rust is a systems language focused on memory safety and performance."},
{"id": "4", "text": "Machine learning models learn patterns from training data."},
{"id": "5", "text": "Neural networks are inspired by the structure of the human brain."},
{"id": "6", "text": "Docker containers package applications with their dependencies."},
{"id": "7", "text": "PostgreSQL is a powerful open-source relational database."},
{"id": "8", "text": "Vector databases store embeddings for semantic similarity search."},
]
# Generate embeddings in batch (more efficient)
texts = [d["text"] for d in documents]
response = client.embeddings.create(model="text-embedding-3-small", input=texts)
vectors = [e.embedding for e in response.data]
# Insert into ChromaDB
collection.add(
ids=[d["id"] for d in documents],
embeddings=vectors,
documents=texts,
metadatas=[{"source": "tutorial"} for _ in documents],
)
print(f"Inserted {collection.count()} documents")Step 5: Semantic Search
Query the collection with a natural language question:
def search(query: str, n_results: int = 3) -> list[dict]:
query_vector = embed(query)
results = collection.query(
query_embeddings=[query_vector],
n_results=n_results,
include=["documents", "distances", "metadatas"],
)
output = []
for doc, dist in zip(results["documents"][0], results["distances"][0]):
output.append({
"text": doc,
"score": round(1 - dist, 4), # cosine: 1=identical, 0=unrelated
})
return output
# Test queries
for query in [
"How do I run apps in isolated environments?",
"What language is best for systems programming?",
"How do AI models learn?",
]:
print(f"\nQuery: {query}")
for r in search(query, n_results=2):
print(f" [{r['score']:.3f}] {r['text']}")Example output:
Query: How do I run apps in isolated environments?
[0.847] Docker containers package applications with their dependencies.
[0.612] Python is a high-level programming language known for its readability.
Query: What language is best for systems programming?
[0.821] Rust is a systems language focused on memory safety and performance.
[0.614] Python is a high-level programming language known for its readability.
Query: How do AI models learn?
[0.852] Machine learning models learn patterns from training data.
[0.789] Neural networks are inspired by the structure of the human brain.Step 6: Filter by Metadata
Combine semantic search with metadata filters — useful for multi-tenant apps:
# Add documents with different categories
collection.add(
ids=["doc_a", "doc_b", "doc_c"],
embeddings=[embed(t) for t in ["Python web frameworks", "Python data science", "Java web frameworks"]],
documents=["Python web frameworks", "Python data science", "Java web frameworks"],
metadatas=[
{"language": "python", "category": "web"},
{"language": "python", "category": "data"},
{"language": "java", "category": "web"},
],
)
# Search only within Python documents
results = collection.query(
query_embeddings=[embed("building web applications")],
n_results=2,
where={"language": {"$eq": "python"}}, # metadata filter
)
print(results["documents"]) # Only Python resultsStep 7: Update and Delete
# Update a document (re-embed the new text)
new_text = "Python 3.12 is a high-level language known for readability and speed."
collection.update(
ids=["1"],
embeddings=[embed(new_text)],
documents=[new_text],
)
# Delete a document
collection.delete(ids=["7"])
print(f"Collection now has {collection.count()} documents")Complete Pipeline: Index a PDF
Here’s a full example that indexes a PDF and answers questions about it:
import chromadb
from openai import OpenAI
import PyPDF2
import os
import uuid
openai_client = OpenAI()
db = chromadb.PersistentClient(path="./pdf_store")
col = db.get_or_create_collection("pdf_docs", metadata={"hnsw:space": "cosine"})
def chunk_text(text: str, size: int = 400, overlap: int = 50) -> list[str]:
words = text.split()
chunks = []
i = 0
while i < len(words):
chunks.append(" ".join(words[i : i + size]))
i += size - overlap
return chunks
def index_pdf(path: str) -> int:
with open(path, "rb") as f:
reader = PyPDF2.PdfReader(f)
text = "\n".join(p.extract_text() or "" for p in reader.pages)
chunks = chunk_text(text)
if not chunks:
return 0
resp = openai_client.embeddings.create(model="text-embedding-3-small", input=chunks)
vectors = [e.embedding for e in resp.data]
col.add(
ids=[str(uuid.uuid4()) for _ in chunks],
embeddings=vectors,
documents=chunks,
metadatas=[{"source": os.path.basename(path)}] * len(chunks),
)
return len(chunks)
def answer(question: str) -> str:
q_vec = openai_client.embeddings.create(
model="text-embedding-3-small", input=question
).data[0].embedding
results = col.query(query_embeddings=[q_vec], n_results=4)
context = "\n\n".join(results["documents"][0])
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer using ONLY the provided context. If not found, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
],
)
return response.choices[0].message.content
# Usage
n = index_pdf("document.pdf")
print(f"Indexed {n} chunks")
print(answer("What are the main topics covered?"))Scaling Beyond ChromaDB
ChromaDB is ideal for development and small datasets (< 1M vectors). For production at scale:
- Pinecone — fully managed, serverless, handles billions of vectors. See Pinecone Tutorial
- Weaviate — open-source, supports hybrid search (vector + BM25)
- pgvector — PostgreSQL extension; good if you’re already on Postgres
- Qdrant — high-performance Rust-based, supports filtering on payloads
Summary
- Embed text with
text-embedding-3-small→ 1536-dim float vectors - Store in ChromaDB with
collection.add(ids, embeddings, documents) - Query with
collection.query(query_embeddings, n_results) - Filter results with
wheremetadata conditions - For production scale, move to Pinecone, Weaviate, or pgvector
Subscribe to my newsletter — practical guides on Claude API, AI agents, RAG, and automation.