Vector Database Tutorial: Build Semantic Search from Scratch (2026)

A vector database stores data as high-dimensional vectors and retrieves it by semantic similarity — not keyword match. This tutorial builds a complete semantic search pipeline from scratch: generate embeddings, store them in a vector database, and query with natural language. You’ll use Python, OpenAI embeddings, and ChromaDB.

If you need a comparison of different vector databases, see Vector Databases Comparison: Pinecone vs Chroma vs Weaviate. For building a full RAG chatbot, see How to Build a RAG Chatbot with Python.


What Is a Vector Database?

When you embed text with a model like text-embedding-3-small, you get a list of ~1500 numbers — a point in high-dimensional space. Similar texts produce points that are geometrically close to each other.

A vector database stores those points and answers the question: “Which stored vectors are closest to this query vector?” That’s semantic search — find results by meaning, not keyword.

  • Exact search: checks every vector (slow at scale, perfect recall)
  • ANN (Approximate Nearest Neighbour): uses indexes like HNSW for millisecond search over millions of vectors

Prerequisites

  • Python 3.9+
  • OpenAI API key (OPENAI_API_KEY)
  • Basic Python knowledge

Step 1: Install Dependencies

pip install chromadb openai

chromadb is a local vector database that runs in-process — no server needed. openai provides the embedding model.


Step 2: Generate Embeddings

An embedding turns text into a vector. Here’s how to generate one:

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding


vector = embed("Python is a popular programming language")
print(f"Dimensions: {len(vector)}")  # 1536

Step 3: Set Up ChromaDB

Create a persistent collection to store documents and their vectors:

import chromadb

# Persistent storage — data survives process restarts
client_db = chromadb.PersistentClient(path="./chroma_store")

# Create (or load) a collection
collection = client_db.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"},  # cosine similarity
)

print(f"Collection has {collection.count()} documents")

Step 4: Add Documents

Insert documents with their embeddings and metadata:

documents = [
    {"id": "1", "text": "Python is a high-level programming language known for its readability."},
    {"id": "2", "text": "JavaScript runs in the browser and is used for web development."},
    {"id": "3", "text": "Rust is a systems language focused on memory safety and performance."},
    {"id": "4", "text": "Machine learning models learn patterns from training data."},
    {"id": "5", "text": "Neural networks are inspired by the structure of the human brain."},
    {"id": "6", "text": "Docker containers package applications with their dependencies."},
    {"id": "7", "text": "PostgreSQL is a powerful open-source relational database."},
    {"id": "8", "text": "Vector databases store embeddings for semantic similarity search."},
]

# Generate embeddings in batch (more efficient)
texts = [d["text"] for d in documents]
response = client.embeddings.create(model="text-embedding-3-small", input=texts)
vectors = [e.embedding for e in response.data]

# Insert into ChromaDB
collection.add(
    ids=[d["id"] for d in documents],
    embeddings=vectors,
    documents=texts,
    metadatas=[{"source": "tutorial"} for _ in documents],
)

print(f"Inserted {collection.count()} documents")

Step 5: Semantic Search

Query the collection with a natural language question:

def search(query: str, n_results: int = 3) -> list[dict]:
    query_vector = embed(query)

    results = collection.query(
        query_embeddings=[query_vector],
        n_results=n_results,
        include=["documents", "distances", "metadatas"],
    )

    output = []
    for doc, dist in zip(results["documents"][0], results["distances"][0]):
        output.append({
            "text": doc,
            "score": round(1 - dist, 4),  # cosine: 1=identical, 0=unrelated
        })
    return output


# Test queries
for query in [
    "How do I run apps in isolated environments?",
    "What language is best for systems programming?",
    "How do AI models learn?",
]:
    print(f"\nQuery: {query}")
    for r in search(query, n_results=2):
        print(f"  [{r['score']:.3f}] {r['text']}")

Example output:

Query: How do I run apps in isolated environments?
  [0.847] Docker containers package applications with their dependencies.
  [0.612] Python is a high-level programming language known for its readability.

Query: What language is best for systems programming?
  [0.821] Rust is a systems language focused on memory safety and performance.
  [0.614] Python is a high-level programming language known for its readability.

Query: How do AI models learn?
  [0.852] Machine learning models learn patterns from training data.
  [0.789] Neural networks are inspired by the structure of the human brain.

Step 6: Filter by Metadata

Combine semantic search with metadata filters — useful for multi-tenant apps:

# Add documents with different categories
collection.add(
    ids=["doc_a", "doc_b", "doc_c"],
    embeddings=[embed(t) for t in ["Python web frameworks", "Python data science", "Java web frameworks"]],
    documents=["Python web frameworks", "Python data science", "Java web frameworks"],
    metadatas=[
        {"language": "python", "category": "web"},
        {"language": "python", "category": "data"},
        {"language": "java",   "category": "web"},
    ],
)

# Search only within Python documents
results = collection.query(
    query_embeddings=[embed("building web applications")],
    n_results=2,
    where={"language": {"$eq": "python"}},  # metadata filter
)

print(results["documents"])  # Only Python results

Step 7: Update and Delete

# Update a document (re-embed the new text)
new_text = "Python 3.12 is a high-level language known for readability and speed."
collection.update(
    ids=["1"],
    embeddings=[embed(new_text)],
    documents=[new_text],
)

# Delete a document
collection.delete(ids=["7"])

print(f"Collection now has {collection.count()} documents")

Complete Pipeline: Index a PDF

Here’s a full example that indexes a PDF and answers questions about it:

import chromadb
from openai import OpenAI
import PyPDF2
import os
import uuid

openai_client = OpenAI()
db = chromadb.PersistentClient(path="./pdf_store")
col = db.get_or_create_collection("pdf_docs", metadata={"hnsw:space": "cosine"})


def chunk_text(text: str, size: int = 400, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunks.append(" ".join(words[i : i + size]))
        i += size - overlap
    return chunks


def index_pdf(path: str) -> int:
    with open(path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        text = "\n".join(p.extract_text() or "" for p in reader.pages)

    chunks = chunk_text(text)
    if not chunks:
        return 0

    resp = openai_client.embeddings.create(model="text-embedding-3-small", input=chunks)
    vectors = [e.embedding for e in resp.data]

    col.add(
        ids=[str(uuid.uuid4()) for _ in chunks],
        embeddings=vectors,
        documents=chunks,
        metadatas=[{"source": os.path.basename(path)}] * len(chunks),
    )
    return len(chunks)


def answer(question: str) -> str:
    q_vec = openai_client.embeddings.create(
        model="text-embedding-3-small", input=question
    ).data[0].embedding

    results = col.query(query_embeddings=[q_vec], n_results=4)
    context = "\n\n".join(results["documents"][0])

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer using ONLY the provided context. If not found, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return response.choices[0].message.content


# Usage
n = index_pdf("document.pdf")
print(f"Indexed {n} chunks")
print(answer("What are the main topics covered?"))

Scaling Beyond ChromaDB

ChromaDB is ideal for development and small datasets (< 1M vectors). For production at scale:

  • Pinecone — fully managed, serverless, handles billions of vectors. See Pinecone Tutorial
  • Weaviate — open-source, supports hybrid search (vector + BM25)
  • pgvector — PostgreSQL extension; good if you’re already on Postgres
  • Qdrant — high-performance Rust-based, supports filtering on payloads

Summary

  • Embed text with text-embedding-3-small → 1536-dim float vectors
  • Store in ChromaDB with collection.add(ids, embeddings, documents)
  • Query with collection.query(query_embeddings, n_results)
  • Filter results with where metadata conditions
  • For production scale, move to Pinecone, Weaviate, or pgvector

Subscribe to my newsletter — practical guides on Claude API, AI agents, RAG, and automation.

Subscribe