Retrieval Augmented Generation on Notion Docs via LangChain

February 27, 2024

RAG, or Retrieval Augmented Generation, is a prominent AI framework in the era of large language models (LLMs). It enhances the capabilities of these models by integrating external knowledge, ensuring more accurate and current responses. A standard RAG system includes an LLM, a vector database, and some prompts as code that can be send queries to the LLM.

source:zilliz

Purpose and Goals:

  1. Learn How to Vectorise Notion Pages and Databases: We'll delve into the process of vectorising Notion pages and databases, enabling efficient storage and retrieval of information.
  2. Introduction to Retrieval Augmented Generation (RAG) using LangChain: We'll provide an overview of RAG and demonstrate its implementation using LangChain, a powerful tool for integrating external knowledge into AI models

In this guide, we'll explore three important tools: Notion, Supabase, and OpenAI. Notion data is the external information we are going to ingest. Supabase stores this information in a vector representation to act as the context to the LLM. First, we'll set up these tools step by step. Then, we'll learn how to take information from Notion and store it in a Vector Store. After that, we'll use LangChain to build a knowledge retrieval system that can find the right answers for us. Finally, we'll see how all these pieces work together in a real situation.

If you're interested in learning more about prompt engineering and its importance in shaping the responses of AI models, check out my previous post:Prompt Engineering for OpenAI Chat Completions. In this beginner's guide, I delve into the significance of crafting well-designed prompts to enhance the accuracy and relevance of AI-generated responses. Understanding prompt engineering can greatly improve your experience with AI models like OpenAI's Chat Completions.

Before diving into the implementation, let's ensure we have all the necessary requirements:

  1. Notion Database: A table database in Notion with appropriate records.
  2. Visit the Notion Developers page and log in with your Notion account.
  3. Notion Integration: Generate an integration token in Notion to access the database programmatically.
  4. Connect the Integration to the Database.
  5. Supabase Setup: Set up a Supabase project and obtain the Supabase URL and service key.
  6. Environment Variables: Set environment variables for the OpenAI API key, Supabase URL, and service key.
import os

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["SUPABASE_URL"] = "your-supabase-url"
os.environ["SUPABASE_SERVICE_KEY"] = "your-supabase-service-key"
os.environ["NOTION_TOKEN"] = "your-notion-integration-token"
os.environ["DATABASE_ID"] = "your-notion-database-id"

Setup Supabase Database

Use these steps to setup your Supabase database if you haven't already.

  1. Head over to https://database.new to provision your Supabase database.
  2. In the studio, jump to the SQL editor and run the following script to enable pgvector and setup your database as a vector store:
-- Enable the pgvector extension to work with embedding vectors
create extension if not exists vector;

-- Create a table to store your documents
create table
  documents (
    id uuid primary key,
    content text, -- corresponds to Document.pageContent
    metadata jsonb, -- corresponds to Document.metadata
    embedding vector (1536) -- 1536 works for OpenAI embeddings, change as needed
  );

-- Create a function to search for documents
create function match_documents (
  query_embedding vector (1536),
  filter jsonb default '{}'
) returns table (
  id uuid,
  content text,
  metadata jsonb,
  similarity float
) language plpgsql as $$
#variable_conflict use_column
begin
  return query
  select
    id,
    content,
    metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where metadata @> filter
  order by documents.embedding <=> query_embedding;
end;
$$;

Loading Notion Documents with NotionDBLoader

Note: The following code only loads documents from a single Notion database. You can export your entire Notion workspace and load documents using NotionDirectoryLoader, see the ingesting-your-own-dataset from Langchain for more details.

We'll start by loading documents from a Notion database using the NotionDBLoader class. This class retrieves pages from the database, reads their content, and returns a list of Document objects.

from langchain_community.document_loaders import NotionDBLoader

NOTION_TOKEN = os.environ["NOTION_TOKEN"]
DATABASE_ID = os.environ["DATABASE_ID"]

loader = NotionDBLoader(
    integration_token=NOTION_TOKEN,
    database_id=DATABASE_ID,
    request_timeout_sec=30  # Optional, defaults to 10
)

docs = loader.load()

Storing Documents with SupabaseVectorStore

Next, we'll store the retrieved documents in Supabase using the SupabaseVectorStore. This component enables efficient storage and retrieval of indexed documents in Supabase.

from langchain_community.vectorstores import SupabaseVectorStore
from supabase.client import Client, create_client

SUPABASE_URL = os.environ["SUPABASE_URL"]
SUPABASE_SERVICE_KEY = os.environ["SUPABASE_SERVICE_KEY"]

supabase: Client = create_client(SUPABASE_URL, SUPABASE_SERVICE_KEY)
embeddings = OpenAIEmbeddings()

vector_store = SupabaseVectorStore.from_documents(
    docs,
    embeddings,
    client=supabase,
    table_name="documents",
    query_name="match_documents",
    chunk_size=500
)

In the code above, we've created a SupabaseVectorStore from the retrieved documents. The from_documents method takes the following parameters:

  • docs: A list of Document objects to be stored in the vector store.
  • embeddings: An instance of the OpenAIEmbeddings class, which provides methods for generating embeddings from text.
  • client: A Supabase client instance for interacting with the database.
  • table_name: The name of the table in the Supabase database where the documents will be stored.
  • query_name: The name of the function in the database that will be used for document retrieval.
  • chunk_size: The number of documents to be stored in each batch.
retriever = vector_store.as_retriever()
retriever.get_relevant_documents("NotionFlow")

Output:

[Document(page_content='Project Overview:\n\nNotionFlow is a comprehensive automation tool designed to streamline workflows, enhance productivity, and optimize resource allocation using Notion\'s versatile database and collaboration features.)]

In this example, we've queried the vector store with the input "NotionFlow" and retrieved a relevant document containing information about the project overview.

Performing Retrieval with OpenAI

With the documents stored in Supabase, we can leverage OpenAI's powerful language models for advanced processing tasks. We'll use the ChatOpenAI class to interact with the OpenAI model.

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=1)

def format_docs(docs):
    return "\\n\\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Let's break down the RAG pipeline:

  1. Retriever: The retriever component retrieves documents from the Supabase vector store based on the input query.
  2. Prompt: The prompt component provides a structured prompt for the OpenAI language model, guiding it to generate a response based on the retrieved documents.
  3. LLM: The llm component represents the large language model (LLM) from OpenAI, which processes the prompt and generates a response.
  4. Output Parser: The StrOutputParser component parses the output from the LLM and formats it as a string for further processing.

Let's query the RAG pipeline with a sample input and retrieve the generated response.

rag_chain.invoke("What are the main features of NotionFlow?")

Output:

NotionFlow features include automation triggers for project creation, task completion notifications, content idea generation, content calendar updates, goal progress tracking, cross-database relationships, bug report handling, and customer feedback integration. The tool is designed to streamline workflows, enhance productivity, and improve resource allocation through automation and integration with Notion's collaboration features. Its benefits include streamlining project management processes, enhancing collaboration and communication, and enabling data-driven decision-making.

Conclusion

In this tutorial, we've demonstrated how to build a RAG pipeline using LangChain, OpenAI, and Supabase. By combining the capabilities of these tools we can query a custom knowledge base and generate responses based on the retrieved documents.

Resources

Next Steps

The RAG pipeline we've built is a basic example of how to ingest and retrieve documents from a knowledge base. You can further enhance the pipeline by ingesting more documents, fine tuning the chunk size, and experimenting with different prompts to generate more accurate responses. Additionally, you can explore other components and integrations available in LangChain to build more advanced language processing pipelines.

Stay Connected

Join me on LinkedIn, where I share insights and updates on AI, Automation, Productivity, and more.

Connect with me on LinkedIn

Additionally, if you're interested in learning more about how I'm leveraging AI for simple automations and productivity hacks, subscribe to my newsletter "Growth Journal". Be the first to receive exclusive content and stay up-to-date with the latest trends in AI and automation.

Subscribe to my newsletter

Until next time, happy prompting!