NotebookLM Clone: Building a RAG Chat App with LangGraph, Supabase, and Next.js
Introduction
NotebookLM by Google lets you upload documents and chat with an AI that answers questions using only your uploaded content. This post documents a full-stack clone that replicates that experience: upload PDFs or paste text, then ask questions and get answers grounded in your documentsβno hallucination from external knowledge.
The project combines LangGraph for orchestration, Supabase for vector storage, Vertex AI (or Ollama) for embeddings and chat models, and Next.js for the frontend. The result is a production-ready RAG (Retrieval-Augmented Generation) application with per-chat document scoping, streaming responses, and a clean chat UI.
Project Overview
Key Features
- Document upload: PDF files (up to 2MB) and raw text snippets
- Per-chat document scoping: Each chat thread has its own document set; retrieval only uses documents from that thread
- Smart routing: An LLM decides whether to retrieve from documents or answer directly (e.g., general knowledge questions)
- Streaming responses: Server-Sent Events (SSE) for real-time token streaming
- Chat history: Multiple sessions with sidebar, edit/retry, and session persistence
- Dual model support: Vertex AI (Gemini) by default, or local Ollama for development
Architecture at a Glance
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Next.js Frontend (React 19, Tailwind) β
β βββ /api/upload β POST PDFs or text β
β βββ /api/chat β POST message, stream SSE β
β βββ /api/documents β GET list, DELETE remove β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LangGraph Server (LangGraph SDK) β
β βββ upload_graph β PDF/text β chunk β embed β Supabase β
β βββ retrieval_graph β route β retrieve/answer β stream β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Supabase (PostgreSQL + pgvector) β
β documents table: id, content, embedding, metadata (thread_id, filename) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | Next.js 16, React 19, Tailwind CSS, React Markdown, KaTeX |
| Backend | LangGraph, LangGraph SDK |
| Embeddings | Vertex AI Gemini Embedding 001 (768d) or Ollama nomic-embed-text |
| Chat Models | Vertex AI Gemini 2.5 Flash or Ollama gemma3:4b |
| Vector Store | Supabase (pgvector) |
| PDF Parsing | pdf-parse, LangChain PDFLoader |
Project Structure
notebooklm-clone/
βββ frontend/
β βββ app/
β β βββ api/
β β β βββ chat/route.ts # SSE streaming chat
β β β βββ upload/route.ts # PDF/text upload
β β β βββ documents/route.ts # list/delete docs
β β βββ layout.tsx
β β βββ page.tsx
β βββ components/
β β βββ Messages.tsx
β β βββ MessageInput.tsx
β β βββ Sidebar.tsx
β β βββ AddTextModal.tsx
β β βββ file-preview.tsx
β βββ hooks/
β β βββ useChat.ts
β β βββ useChatHistory.ts
β βββ lib/
β βββ langgraph.ts
β βββ utils/sse.ts
βββ backend/
β βββ src/
β β βββ retrieval_graph/
β β β βββ graph.ts
β β β βββ prompts.ts
β β β βββ utils.ts
β β βββ upload_graph/
β β β βββ graph.ts
β β βββ shared/
β β βββ retrieval.ts
β β βββ pdf.ts
β β βββ state.ts
β β βββ configuration.ts
β βββ langgraph.json
βββ tests/
βββ load-test.js
LangGraph Architecture
The backend exposes two graphs via langgraph.json:
{
"node_version": "22",
"graphs": {
"upload_graph": "./src/upload_graph/graph.ts:graph",
"retrieval_graph": "./src/retrieval_graph/graph.ts:graph"
},
"dependencies": ["."]
}
Upload Graph
The upload graph handles document ingestion and management:
- checkOperation β Determines whether the request is
upload,list, ordelete - uploadDocs β For PDFs: decode base64 β temp file β PDFLoader β chunk β embed β Supabase. For text: create a single document with
source: user_textandtext_id - manageDocuments β For
list: query Supabase bythread_id, return files and text sources. Fordelete: remove documents byfilenameortext_id
All documents are tagged with thread_id in metadata so retrieval is scoped per chat.
Chunking configuration:
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1500,
chunkOverlap: 200,
});
const splitDocs = await splitter.splitDocuments(docs);
Rate limiting: The upload graph uses batched embedding (20 docs per batch) with retry and exponential backoff for Vertex AI rate limits.
Retrieval Graph
The retrieval graph implements the RAG flow with intelligent routing:
START β checkQueryType β route
β
βββ retrieve β retrieveDocuments β generateResponse β END
β
βββ direct β directAnswer β END
- checkQueryType β An LLM with structured output decides:
retrieve(needs documents) ordirect(general knowledge). - routeQuery β Maps the decision to
retrieveDocumentsordirectAnswer. - retrieveDocuments β Invokes the Supabase retriever with
thread_idfilter, returns top-k chunks. - generateResponse β Formats chunks as context, invokes the chat model with a strict βanswer only from contextβ prompt.
- directAnswer β For general questions, invokes the model with conversation history only (no retrieval).
Router prompt (excerpt):
Select 'retrieve' ONLY if:
- The question asks for specific details, facts, policies, procedures...
- It refers to internal company information, product specifications...
- The question mentions or implies reliance on uploaded files...
Select 'direct' if:
- The question involves general knowledge, reasoning, creative tasks...
- No specific document or proprietary information is required.
Response prompt:
Answer the user's question using ONLY the provided context below.
- Base the answer exclusively on the retrieved context.
- Never add or invent information not supported by the context.
- If the context lacks sufficient information, respond with "I don't have sufficient information..."
Configuration and Model Selection
The backend supports two modes via USE_OLLAMA:
- Cloud (default): Vertex AI Gemini 2.5 Flash for chat, Gemini Embedding 001 for embeddings
- Local: Ollama with
gemma3:4bandnomic-embed-text:latest
const getDefaultQueryModel = (): string => {
const useOllama = process.env.USE_OLLAMA === "true";
if (useOllama) {
return "ollama/gemma3:4b";
} else {
return "vertexai/gemini-2.5-flash";
}
};
Retrieval configuration is passed from the frontend:
export const retrievalAssistantConfig = {
queryModel: "vertexai/gemini-2.5-flash",
retrieverProvider: "supabase",
filterKwargs: {},
k: 5,
};
The chat API injects thread_id into filterKwargs so the retriever only fetches documents for the current chat.
Frontend Implementation
Chat Flow
- Thread creation: On mount,
useChatcreates a thread viaclient.threads.create({ graphId: "retrieval_graph" }). - Message send: User submits β
POST /api/chatwithmessage,threadId, and optionalmessagesBeforeEdit(for edit/retry). - State sync: Before streaming, the API calls
client.threads.updateState(threadId, { values: { messages } })so the graph has conversation history. - Streaming:
client.runs.stream(threadId, assistantId, { input: { query }, streamMode: ["messages", "updates"] })returns an async iterator; the API encodes chunks as SSE and streams to the client. - Client parsing:
readSSEStreamandparseSSEMessageChunkextract content; the UI appends to the last assistant message for live streaming.
SSE Handling
The chat API returns text/event-stream with JSON payloads:
const eventStream = await client.runs.stream(threadId, targetAssistantId, {
input: { query: message },
streamMode: ["messages", "updates"],
config: { configurable: configWithThread },
});
const sseStream = new ReadableStream({
async start(controller) {
for await (const chunk of eventStream) {
controller.enqueue(encoder.encode(`data ${JSON.stringify(chunk)}\n\n`));
}
controller.close();
},
});
The frontend filters out classifier output from checkQueryType so only the final answer is shown.
Upload Flow
- PDF: FormData with
filesβ convert to base64 βclient.runs.create(threadId, "upload_graph", { input: { pdfFile, threadId } })β poll until run completes. - Text: JSON body
{ text, threadId?, textId? }β sameupload_graphwithtextContentandtextId.
If no threadId is provided, the API creates a new thread and returns it; the frontend switches to that thread and loads its document list.
Document Management
- List:
GET /api/documents?threadId=...β invokes upload graph withoperation: "list"β returns{ files, textSources }. - Delete:
DELETE /api/documents?threadId=...&type=file&filename=...ortype=text&textId=...β upload graph withoperation: "delete".
Key Implementation Details
Per-Thread Document Scoping
Every document row in Supabase includes thread_id in metadata. The retriever receives filterKwargs: { thread_id: threadId }, so each chat only sees its own documents. This mirrors NotebookLMβs notebook-scoped behavior.
Edit and Retry
When the user edits a prior message, the frontend sends messagesBeforeEdit (messages before the edit). The API updates thread state with those messages, then runs the graph with the new query. The model sees the corrected context and generates a new response.
Stop Generation
The frontend tracks run_id from SSE metadata. On stop, it calls client.runs.cancel(threadId, runId) and aborts the fetch. Partial content is preserved in the UI.
PDF Processing
PDFs are sent as base64 from the browser. The backend decodes to a buffer, writes to a temp file, uses LangChainβs PDFLoader, then deletes the temp file. Each chunk gets filename in metadata for source attribution.
Environment Variables
| Variable | Purpose |
|---|---|
SUPABASE_URL |
Supabase project URL |
SUPABASE_SERVICE_ROLE_KEY |
Service role key for server-side access |
GOOGLE_CLOUD_PROJECT_ID |
GCP project for Vertex AI |
GOOGLE_CLOUD_LOCATION |
Region (e.g., asia-south1) |
USE_OLLAMA |
Set to "true" for local Ollama |
OLLAMA_BASE_URL |
Ollama server URL (default localhost:11434) |
NEXT_PUBLIC_LANGGRAPH_API_URL |
LangGraph server URL (default localhost:2024) |
Supabase Schema
The documents table uses pgvector for similarity search:
id: Primary keycontent: Chunk textembedding: Vector (768 dimensions for Vertex AI)metadata: JSON withthread_id,filename(for PDFs),source,text_id(for user text)
A match_documents function is used for the retriever query.
Takeaways
- LangGraph provides a clean way to model RAG as a graph: routing, retrieval, and generation as separate nodes with clear state flow.
- Per-thread document scoping via
thread_idin metadata keeps chats isolated and avoids cross-contamination. - Smart routing reduces unnecessary retrieval for general questions and improves latency.
- SSE streaming gives a responsive chat experience with real-time token display.
- Dual model support (Vertex AI vs Ollama) makes it easy to develop locally and deploy to the cloud.
The NotebookLM clone demonstrates a production-ready RAG stack: document upload, chunking, embedding, retrieval, and chatβall orchestrated by LangGraph and delivered through a modern Next.js frontend.