Build Ask Tom: A RAG Chatbot on One SQLite File

How I built Ask Tom: Pagefind chunks, text-embedding-3-small, sqlite-vec for vectors, gpt-4.1-mini for answers. One SQLite file, no Pinecone, no LangChain.

A diagram of a retrieval pipeline collapsing into a single SQLite file holding the vector index
A diagram of a retrieval pipeline collapsing into a single SQLite file holding the vector index
1 file the entire vector store (one SQLite database)
1536 embedding dimensions (text-embedding-3-small)
$0 managed vector-DB bill (no Pinecone)
6 msgs the honest limit: rolling conversation memory

Ask Tom is a retrieval-augmented chatbot that answers from my published writing, and its entire vector store is one SQLite file. The pipeline is small on purpose: the site's Pagefind search index supplies the fragments, I chunk them and embed each chunk with OpenAI's text-embedding-3-small (1536 dimensions), store the vectors in a sqlite-vec vec0 virtual table, and at query time retrieve the nearest chunks and hand them to gpt-4.1-mini for a cited answer. No Pinecone, no managed vector service, no LangChain.

Key Takeaways

  • The vector store is a file, not a service. — sqlite-vec registers a vec0 virtual table inside an ordinary SQLite database. The "vector database" is a file on disk you can copy, back up, and ship with scp. There is nothing else running.
  • The retrieval source was already built. — The site ships a Pagefind search index for on-page search. I reuse those fragments as the corpus instead of re-crawling, so the content that feeds the bot is the same content the reader can search.
  • No framework did the orchestration. — No LangChain, no agent framework. Chunk, embed, store, retrieve, prompt: each step is a few lines of plain code I can read end to end, which is the point of keeping it this small.
  • One database, two personas. — The same SQLite file and the same retrieval code power Ask Tom and Ask CTAIO. The persona is a config value that selects the system prompt and the content slice, not a second stack.

How little it actually took

There is a version of building a RAG chatbot that involves a vector-database account, a framework with its own opinions, a queue, and a diagram with seven boxes. I built the opposite. Ask Tom, the bot that answers questions from everything I have published, keeps its whole vector store in a single SQLite file. The same file, with a different config flag, also powers Ask CTAIO. There is no Pinecone bill, no managed service to page me at 3am, and no LangChain abstraction sitting between me and the four steps that matter.

I want to be precise about that "four steps," because the appeal here is not cleverness, it is the absence of it. You take content you already have, turn it into vectors, store the vectors in a file, and at question time find the closest ones and let a small model write the answer. Every part of that is something you can read in one sitting. The rest of this post is the actual stack, in the order the data flows through it.

What is the full stack?

Five pieces, and only two of them cost money to run. The retrieval source is the search index the site already ships; the storage is a file; the framework is no framework.

Stage What I use Why this and not the obvious alternative
Source corpus Pagefind fragments The site already builds them for on-page search; reusing them means the bot reads exactly what the reader can search.
Chunking Overlapping word windows Fixed windows with overlap beat a clever semantic chunker for prose, and there is nothing to debug.
Embeddings OpenAI text-embedding-3-small (1536-dim) Cheap, fast, and accurate enough for retrieving prose; a larger embedding model buys precision I do not need.
Vector store sqlite-vec (a vec0 virtual table) The vectors live in the same SQLite file as everything else. No separate vector database to run, bill, or back up.
Generation OpenAI gpt-4.1-mini The retrieved chunks do the work, so the model can be fast and cheap rather than frontier.

What is the build sequence, start to finish?

Six steps, run once to ingest and then continuously to answer. The first four build the index; the last two are the live request.

  1. Collect the corpus. Read the Pagefind fragments the site already generates, keeping each fragment's source URL.
  2. Chunk. Split each fragment into overlapping word windows so passages near a boundary survive intact.
  3. Embed. Send each chunk to text-embedding-3-small and get back a 1536-dimension vector.
  4. Store. Insert each vector, plus its text and source URL, into a sqlite-vec vec0 table in one SQLite file.
  5. Retrieve. At query time, embed the question and run a vector similarity search to pull the closest chunks.
  6. Generate. Hand the top chunks and the question to gpt-4.1-mini and return the answer with citations to the source URLs.

How does the chunking and embedding work?

The part people skip is overlap. If you cut content into hard, non-overlapping blocks, a sentence that explains the exact thing someone asked about gets split across two chunks and retrieved as neither. Overlapping windows fix that for almost no cost. The chunker is fixed-size windows with a small slide, and each chunk carries the URL it came from so the answer can cite back:

def chunk(text, size=220, overlap=40):
    words = text.split()
    step = size - overlap
    for start in range(0, len(words), step):
        window = words[start:start + size]
        if window:
            yield " ".join(window)

# embed each chunk with text-embedding-3-small (1536-dim)
resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=list(chunk(fragment_text)),
)
vectors = [d.embedding for d in resp.data]

Nothing here is exotic. size and overlap are the only two knobs, and for prose a couple hundred words with a small overlap is a sane default you rarely have to touch.

How does sqlite-vec store and search the vectors?

This is the piece that replaces the managed vector database. You load the extension, declare a vec0 virtual table with the embedding dimension, and from then on it is just SQL. Storage is an INSERT; search is a SELECT ordered by distance:

import sqlite3, sqlite_vec

db = sqlite3.connect("ask-tom.db")
db.enable_load_extension(True)
sqlite_vec.load(db)

# the entire "vector database": one virtual table in one file
db.execute("""
  CREATE VIRTUAL TABLE chunks USING vec0(
    embedding float[1536],
    url TEXT,
    body TEXT
  )
""")

# query: nearest chunks to the embedded question
rows = db.execute("""
  SELECT url, body, distance
  FROM chunks
  WHERE embedding MATCH ?
  ORDER BY distance
  LIMIT 6
""", [question_vector]).fetchall()

That LIMIT 6 is the top-k handed to the model. The whole store is the ask-tom.db file: I build it on my machine, then scp it to the server and bounce the API. Deploying a new index is copying a file. There is no vector-database migration, because there is no vector database.

How does it turn retrieved chunks into a cited answer?

The retrieved chunks go into the prompt, the question goes after them, and gpt-4.1-mini writes the answer constrained to what it was given. Because the retrieval already narrowed the context to the relevant passages, the generation model does not have to be large; it has to be fast and obedient. The system prompt is where the persona lives, which is how one codebase serves two bots.

Ask Tom and Ask CTAIO share the retrieval code and, for the overlapping content, the same vectors. The persona is a config value: it selects the system prompt, the display name, and which slice of the corpus is in scope. Two products, one stack, one file. The backend is a FastAPI service at ask.tfw.bz; the front end mounts the chat component and passes a persona, which is all the wiring there is between "Tom answers" and "CTAIO answers."

What is the honest limitation?

Memory. Ask Tom keeps a rolling window of the last six messages. It answers a question well and handles a short follow-up, but it does not hold a long conversation the way an assistant with full history would. I call it the working-memory gap, and it is a deliberate trade, not an oversight: a short window keeps every prompt small, cheap, and predictable. The cost is multi-turn continuity. For a bot whose job is to answer questions from a body of writing, that is the right trade most of the time. If your use case is an extended dialogue, you widen the window or bolt on a summary memory, and you pay for it in tokens and in the complexity you were trying to avoid.

That trade is the through-line of the whole build. Every choice here, the file instead of the service, the fixed chunker instead of the clever one, the small model instead of the frontier one, the six-message window instead of full history, is the same discipline: spend complexity only where it changes the answer. For where this kind of retrieval layer actually earns its keep, I made the strategy argument in Build a Second Brain That Answers Back on ctaio.dev. If you are still deciding where your notes should live before you wire retrieval on top, the We The Flywheel Radar verdict on Obsidian vs Notion is the place to start. And if you would rather poke the finished thing than read the build, Ask Tom is live — ask it something and watch it cite its way to an answer.

FAQ

What is sqlite-vec, and why use it instead of Pinecone?

sqlite-vec is a SQLite extension that adds vector search to an ordinary SQLite database through a vec0 virtual table. You store embeddings as rows and query them with a normal SELECT using a distance function. The reason to reach for it over Pinecone or a managed vector service is honesty about scale: for a corpus of a few thousand chunks, a hosted vector database is a network hop, a bill, and an account to manage in exchange for nothing you can feel. sqlite-vec keeps the vectors in the same file as everything else, so the "database" is something you can copy with scp and back up with cp. If you ever genuinely outgrow it, you swap the storage layer and keep the rest. Most projects calling themselves RAG never get there.

How do you chunk the content before embedding it?

I start from the site's Pagefind fragments rather than re-crawling, because Pagefind already split each page into searchable units for on-page search. I split those into overlapping word windows, on the order of a couple hundred words with a small overlap so a sentence that straddles a boundary still lands intact in at least one chunk. The overlap is the part people skip and then wonder why retrieval misses the obvious passage. Each chunk keeps its source URL so the answer can cite back to the page it came from. There is no clever semantic chunker here, just fixed windows with overlap, and for prose that is plenty.

Which models does Ask Tom use, and what does it cost to run?

Two OpenAI models do the work. Embeddings come from text-embedding-3-small at 1536 dimensions, which is the cheap, fast embedding model and more than accurate enough for retrieving prose. Generation runs on gpt-4.1-mini, which is fast and inexpensive and, because the retrieved chunks do the heavy lifting, does not need to be a frontier model. The dominant cost is embedding the corpus once, plus one small embedding per question at query time; generation is a short prompt with a handful of retrieved chunks. For a personal-scale bot the monthly spend is rounding error, and there is no managed vector-database line item at all because the vectors live in the SQLite file.

Did you use LangChain or a RAG framework to build it?

No. The orchestration is plain code: a function to chunk, a call to the embeddings API, an INSERT into the vec0 table, a similarity SELECT at query time, and a prompt assembled from the top results. A framework would add abstractions over steps that are already one function each, and the cost of those abstractions is that you stop being able to read what your own system does. The retrieval layer is a FastAPI backend at ask.tfw.bz; the front end calls it with a persona. Keeping the framework out is not minimalism for its own sake, it is so the whole thing fits in your head when something breaks.

What is the biggest limitation of this setup?

Conversation memory. Ask Tom keeps a rolling window of the last six messages, so it answers a question well and handles a short follow-up, but it does not hold a long thread the way a chat assistant with full history would. This is the working-memory gap, and it is a deliberate trade: a short window keeps prompts small, cheap, and predictable, at the cost of long multi-turn continuity. For a bot whose job is to answer questions from a body of writing, that trade is usually right. If your use case is an extended back-and-forth, you would widen the window or add a summary memory, and you would pay for it in tokens and complexity.

Only 3 slots available this month

Ready to Transform Your AI Strategy?

Get personalized guidance from someone who's led AI initiatives at Adidas, Sweetgreen, and 50+ Fortune 500 projects.

Trusted by leaders at
Google · Amazon · Nike · Adidas · McDonald's