LlamaIndex RAG: Index Your Personal Knowledge Base

A builder's LlamaIndex RAG tutorial: load your notes, parse them into nodes, embed, build a VectorStoreIndex, and query it with citations. Five stages, plain Python.

A diagram of a personal knowledge base — a stack of notes — flowing through a LlamaIndex pipeline into a single queryable vector index
A diagram of a personal knowledge base — a stack of notes — flowing through a LlamaIndex pipeline into a single queryable vector index
5 stages load → parse → embed → index → query
1 index VectorStoreIndex over your whole knowledge base
~30 lines the core pipeline, before you tune anything
1x embed persist once, then load — no re-embedding per run

LlamaIndex turns a folder of personal notes into a question-answering engine in five stages: SimpleDirectoryReader loads the notes as Documents, a node parser splits them into overlapping Nodes, an embedding model turns each Node into a vector, VectorStoreIndex stores those vectors, and a query engine embeds your question, retrieves the nearest Nodes, and asks an LLM for a cited answer. The framework already contains the reader, the splitter, the retriever, and the response synthesizer — so the build is mostly assembly, not plumbing.

Key Takeaways

  • LlamaIndex names every stage for you. — Reader, node parser, embedding model, VectorStoreIndex, query engine: the abstractions map one-to-one onto the steps a RAG pipeline needs. You assemble named parts instead of hand-wiring the plumbing.
  • Your notes are the right first corpus. — A personal knowledge base is finite, it is yours, and you already know what good answers look like. That makes it the honest way to learn whether retrieval is actually working before you point it at anything bigger.
  • Embed once, persist, reload. — The first run embeds the corpus and writes the index to disk. Every run after that loads the persisted index in a second, so you pay the embedding cost once, not on every startup.
  • The framework is the trade. — LlamaIndex hands you a working pipeline fast, which is the point. The cost is learning its abstractions and accepting its defaults until you have a reason to override them — a real trade, not a free lunch.

Why use LlamaIndex for a personal knowledge base?

Because the framework is shaped like the job. Indexing your own notes and asking questions of them is the exact flow LlamaIndex was built around: its named parts — reader, node parser, embedding, index, query engine — map one-to-one onto the steps the task needs, so you spend your time choosing parts rather than wiring plumbing. I built the same pipeline the no-framework way for Ask Tom on one SQLite file, and that build is illuminating precisely because you do every step by hand. This one is the opposite lesson: how little is left to do once the framework carries the orchestration.

A personal knowledge base is also the right corpus to learn on. It is finite, it is yours, and you already know what a correct answer looks like, which means you can actually tell whether retrieval is working instead of trusting a benchmark. Get it right on your own notes first; scale is a later problem. The strategy question of what a second brain is even for, I argued separately in Build a Second Brain That Answers Back — this page assumes you have decided to build one and want the pipeline.

What are the five stages of the pipeline?

Five named parts, and each one is a LlamaIndex abstraction you configure rather than implement. The first four build the index; the last one is the live query.

Stage LlamaIndex part What it does
Load SimpleDirectoryReader Reads a folder of notes (Markdown, text, PDF) into Document objects, each carrying its source path.
Parse Node parser / splitter Splits each Document into Nodes — overlapping chunks small enough to embed and retrieve cleanly.
Embed Embedding model (via Settings) Turns each Node into a vector. Model-agnostic: set it once, swap it without touching the rest.
Index VectorStoreIndex Stores the Node vectors so the nearest ones to a query can be retrieved. Persist it to disk once.
Query Query engine Embeds the question, retrieves the top Nodes, and asks the LLM for an answer with source citations.

What is the build sequence, start to finish?

Run the first four steps once to build and persist the index; the last step is the live request you run on every question.

  1. Load the notes. Point SimpleDirectoryReader at your knowledge-base folder and read every note into a Document.
  2. Parse into Nodes. Split the Documents into overlapping Nodes so a passage near a chunk boundary survives intact.
  3. Embed and index. Hand the Nodes to VectorStoreIndex, which embeds each one and stores the vectors.
  4. Persist. Write the index to disk so the next run loads it instead of re-embedding the corpus.
  5. Query. Build a query engine from the index, ask a question, and get an answer with citations back to the source notes.

How do you load and index your notes?

This is the part that surprises people coming from a hand-rolled pipeline: loading a folder and building a queryable index is a handful of lines, because the reader and the index do the work. You set the models once on the global Settings object, read the directory, and build the index in one call:

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Settings,
    StorageContext,
    load_index_from_storage,
)

# set the embedding model + LLM once; the rest of the code is model-agnostic
# Settings.embed_model = ...   # e.g. a small local or hosted embedding model
# Settings.llm = ...           # e.g. a small, fast generation model

# 1. load every note in the knowledge base into Documents
documents = SimpleDirectoryReader("knowledge-base/").load_data()

# 2-3. parse into Nodes, embed them, and store them in one index
index = VectorStoreIndex.from_documents(documents)

# 4. persist so the next run loads instead of re-embedding
index.storage_context.persist(persist_dir="kb-index/")

The next time the program starts, you skip the embedding cost entirely and load the persisted index straight from disk:

# on subsequent runs: load the persisted index in a fraction of a second
storage_context = StorageContext.from_defaults(persist_dir="kb-index/")
index = load_index_from_storage(storage_context)

That persist-and-reload split is the difference between a toy that re-reads everything on every launch and a tool that starts instantly. You embed each note once, not once per run.

How does LlamaIndex chunk notes into Nodes?

A Node is LlamaIndex’s name for a chunk: a slice of a Document small enough to embed and retrieve cleanly, carrying a reference back to its source. The default node parser already splits with overlap, which matters more than people expect — if a note is cut into hard, non-overlapping pieces, a sentence that answers the exact question can land split across two Nodes and get retrieved as neither. When you want control, you set the parser explicitly:

from llama_index.core.node_parser import SentenceSplitter

# overlap is the knob people skip and then wonder why retrieval misses the obvious passage
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=64)

index = VectorStoreIndex.from_documents(documents)

chunk_size and chunk_overlap are the two knobs you actually touch. For prose notes the defaults are sane; you tune them only when retrieval is consistently missing or returning too much. Each Node keeps its source metadata, which is what lets the query engine cite the note an answer came from.

How do you query the index and get cited answers?

Turn the index into a query engine and ask it a question. The engine embeds the question, runs the similarity search to pull the nearest Nodes, hands them to the LLM, and returns an answer plus the source nodes it used — so you can show your work:

# build a query engine; similarity_top_k is the number of Nodes retrieved
query_engine = index.as_query_engine(similarity_top_k=4)

response = query_engine.query(
    "What did I conclude about fractional CTO engagements?"
)

print(response)                    # the synthesized answer
for node in response.source_nodes: # the notes it cited
    print(node.metadata.get("file_name"), round(node.score, 3))

similarity_top_k is the top-k handed to the model, the same dial as the LIMIT in a hand-written vector query. Because retrieval has already narrowed the context to the relevant notes, the generation model does not need to be large — it needs to be fast and obedient, since the Nodes carry the substance. The source_nodes are how you keep the system honest: an answer you can trace back to the note it came from is one you can check.

What is the honest trade with LlamaIndex?

Abstractions. LlamaIndex hands you a working RAG pipeline fast because it already contains the reader, the splitter, the retriever, and the response synthesizer. That is the whole value, and it is real. The cost is that those same abstractions sit between you and the mechanics: when retrieval misbehaves, you are debugging the framework’s defaults and your configuration of them, not your own twenty lines. For a personal knowledge base, where the corpus is small and the goal is a useful tool rather than a deep understanding of vector search, that is usually the right trade.

When it is not the right trade, you already have the comparison built. The same pipeline wired by hand in Ask Tom on one SQLite file shows exactly what LlamaIndex is doing for you, step by visible step. The LangChain version of this build shows the same task through a broader framework, so you can feel where the two diverge. And the strategy question underneath all three — what a second brain is for and what belongs in it — is in Build a Second Brain That Answers Back. Pick the build whose trade matches what you are trying to learn; the corpus, your own notes, is the same in all of them.

FAQ

What is LlamaIndex, and how is it different from LangChain for RAG?

LlamaIndex is a data framework built specifically around retrieval-augmented generation: its core abstractions are documents, nodes, indexes, retrievers, and query engines, and they are arranged so that getting from "a folder of notes" to "a question-answering engine" is a short, opinionated path. LangChain is the broader of the two — a general orchestration framework for chains, agents, and tools, of which RAG is one use case. In practice, if your job is precisely to index a corpus and query it, LlamaIndex tends to need less assembly because the whole framework is shaped around that flow; if you are building a larger agentic system where retrieval is one component among many, LangChain's breadth earns its keep. I built the same personal-knowledge-base pipeline both ways on purpose, and the LlamaIndex version is the shorter one for this specific task.

What should I actually put in the personal knowledge base?

Start with text you wrote or curated and would trust as an answer: Markdown notes, exported Obsidian or Notion pages, meeting notes, saved articles with your own annotations. The discipline that matters is curation, not volume. A RAG pipeline retrieves whatever you feed it, so a knowledge base full of half-finished drafts and stale links returns confident answers built on half-finished drafts and stale links. I keep mine to material I would be comfortable having quoted back at me. The strategy side of that — what a second brain is actually for, and what belongs in it — is the argument I made separately; the corpus decision is upstream of every retrieval choice on this page.

Which embedding and LLM models does this tutorial assume?

The pipeline is model-agnostic by design: LlamaIndex lets you set an embedding model and an LLM in its Settings object, and the rest of the code does not change when you swap them. For a personal-scale knowledge base a small, fast embedding model is more than accurate enough for retrieving prose, and a small generation model works well because the retrieved nodes do the heavy lifting. You can run fully local with an Ollama model for both embedding and generation, or use a hosted provider — LlamaIndex supports both behind the same interface. The model names and exact defaults in LlamaIndex move quickly, so treat any specific version in example code as a starting point to confirm against the current docs, not a fixed requirement.

Do I have to re-embed my whole knowledge base every time I run it?

No, and you should not. The first run embeds every node and builds the VectorStoreIndex; after that you persist the index to disk with the storage context and load it back on startup in a fraction of a second. You only re-embed the documents that changed. This is the difference between a toy script that re-reads everything on every launch and a real tool that starts instantly — the embedding cost is paid once per document, not once per run. When you add new notes, you embed just those and insert them into the existing index rather than rebuilding from scratch.

Is LlamaIndex overkill for a personal knowledge base?

It depends on what you want to learn. If your goal is a working question-answering layer over your notes with the least code, LlamaIndex is the opposite of overkill — it is the fastest honest path, because the framework already contains the reader, the splitter, the retriever, and the response synthesizer you would otherwise write yourself. If your goal is to understand RAG mechanically, down to how chunks are stored and how a similarity search runs, then a framework hides exactly the parts you want to see, and a no-framework build teaches more. I have done both: the framework version ships faster, the from-scratch version fits in your head. Neither is wrong; they answer different questions.

Only 3 slots available this month

Ready to Transform Your AI Strategy?

Get personalized guidance from someone who's led AI initiatives at Adidas, Sweetgreen, and 50+ Fortune 500 projects.

Trusted by leaders at
Google · Amazon · Nike · Adidas · McDonald's