RAG, or Retrieval Augmented Generation, is a prominent AI framework in the era of large language models (LLMs). It enhances the capabilities of these models by integrating external knowledge, ensuring more accurate and current responses. A standard RAG system includes an LLM, a vector database, and some prompts as code that can be send queries to the LLM.
Purpose and Goals:
- Learn How to Vectorise Notion Pages and Databases: We'll delve into the process of vectorising Notion pages and databases, enabling efficient storage and retrieval of information.
- Introduction to Retrieval Augmented Generation (RAG) using LangChain: We'll provide an overview of RAG and demonstrate its implementation using LangChain, a powerful tool for integrating external knowledge into AI models
In this guide, we'll explore three important tools: Notion, Supabase, and OpenAI. Notion data is the external information we are going to ingest. Supabase stores this information in a vector representation to act as the context to the LLM. First, we'll set up these tools step by step. Then, we'll learn how to take information from Notion and store it in a Vector Store. After that, we'll use LangChain to build a knowledge retrieval system that can find the right answers for us. Finally, we'll see how all these pieces work together in a real situation.
If you're interested in learning more about prompt engineering and its importance in shaping the responses of AI models, check out my previous post:Prompt Engineering for OpenAI Chat Completions. In this beginner's guide, I delve into the significance of crafting well-designed prompts to enhance the accuracy and relevance of AI-generated responses. Understanding prompt engineering can greatly improve your experience with AI models like OpenAI's Chat Completions.
Before diving into the implementation, let's ensure we have all the necessary requirements:
- Notion Database: A table database in Notion with appropriate records.
- Visit the Notion Developers page and log in with your Notion account.
- Notion Integration: Generate an integration token in Notion to access the database programmatically.
- Connect the Integration to the Database.
- Supabase Setup: Set up a Supabase project and obtain the Supabase URL and service key.
- Environment Variables: Set environment variables for the OpenAI API key, Supabase URL, and service key.
import os
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["SUPABASE_URL"] = "your-supabase-url"
os.environ["SUPABASE_SERVICE_KEY"] = "your-supabase-service-key"
os.environ["NOTION_TOKEN"] = "your-notion-integration-token"
os.environ["DATABASE_ID"] = "your-notion-database-id"
Setup Supabase Database
Use these steps to setup your Supabase database if you haven't already.
- Head over to https://database.new to provision your Supabase database.
- In the studio, jump to the SQL editor and run the following script to enable
pgvector
and setup your database as a vector store:
-- Enable the pgvector extension to work with embedding vectors
create extension if not exists vector;
-- Create a table to store your documents
create table
documents (
id uuid primary key,
content text, -- corresponds to Document.pageContent
metadata jsonb, -- corresponds to Document.metadata
embedding vector (1536) -- 1536 works for OpenAI embeddings, change as needed
);
-- Create a function to search for documents
create function match_documents (
query_embedding vector (1536),
filter jsonb default '{}'
) returns table (
id uuid,
content text,
metadata jsonb,
similarity float
) language plpgsql as $$
#variable_conflict use_column
begin
return query
select
id,
content,
metadata,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where metadata @> filter
order by documents.embedding <=> query_embedding;
end;
$$;
Loading Notion Documents with NotionDBLoader
Note: The following code only loads documents from a single Notion database. You can export your entire Notion workspace and load documents using NotionDirectoryLoader, see the ingesting-your-own-dataset from Langchain for more details.
We'll start by loading documents from a Notion database using the NotionDBLoader
class. This class retrieves pages from the database, reads their content, and returns a list of Document objects.
from langchain_community.document_loaders import NotionDBLoader
NOTION_TOKEN = os.environ["NOTION_TOKEN"]
DATABASE_ID = os.environ["DATABASE_ID"]
loader = NotionDBLoader(
integration_token=NOTION_TOKEN,
database_id=DATABASE_ID,
request_timeout_sec=30 # Optional, defaults to 10
)
docs = loader.load()
Storing Documents with SupabaseVectorStore
Next, we'll store the retrieved documents in Supabase using the SupabaseVectorStore
. This component enables efficient storage and retrieval of indexed documents in Supabase.
from langchain_community.vectorstores import SupabaseVectorStore
from supabase.client import Client, create_client
SUPABASE_URL = os.environ["SUPABASE_URL"]
SUPABASE_SERVICE_KEY = os.environ["SUPABASE_SERVICE_KEY"]
supabase: Client = create_client(SUPABASE_URL, SUPABASE_SERVICE_KEY)
embeddings = OpenAIEmbeddings()
vector_store = SupabaseVectorStore.from_documents(
docs,
embeddings,
client=supabase,
table_name="documents",
query_name="match_documents",
chunk_size=500
)
In the code above, we've created a SupabaseVectorStore
from the retrieved documents. The from_documents
method takes the following parameters:
docs
: A list of Document objects to be stored in the vector store.embeddings
: An instance of theOpenAIEmbeddings
class, which provides methods for generating embeddings from text.client
: A Supabase client instance for interacting with the database.table_name
: The name of the table in the Supabase database where the documents will be stored.query_name
: The name of the function in the database that will be used for document retrieval.chunk_size
: The number of documents to be stored in each batch.
retriever = vector_store.as_retriever()
retriever.get_relevant_documents("NotionFlow")
Output:
[Document(page_content='Project Overview:\n\nNotionFlow is a comprehensive automation tool designed to streamline workflows, enhance productivity, and optimize resource allocation using Notion\'s versatile database and collaboration features.)]
In this example, we've queried the vector store with the input "NotionFlow" and retrieved a relevant document containing information about the project overview.
Performing Retrieval with OpenAI
With the documents stored in Supabase, we can leverage OpenAI's powerful language models for advanced processing tasks. We'll use the ChatOpenAI
class to interact with the OpenAI model.
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain import hub
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=1)
def format_docs(docs):
return "\\n\\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
Let's break down the RAG pipeline:
- Retriever: The
retriever
component retrieves documents from the Supabase vector store based on the input query. - Prompt: The
prompt
component provides a structured prompt for the OpenAI language model, guiding it to generate a response based on the retrieved documents. - LLM: The
llm
component represents the large language model (LLM) from OpenAI, which processes the prompt and generates a response. - Output Parser: The
StrOutputParser
component parses the output from the LLM and formats it as a string for further processing.
Let's query the RAG pipeline with a sample input and retrieve the generated response.
rag_chain.invoke("What are the main features of NotionFlow?")
Output:
NotionFlow features include automation triggers for project creation, task completion notifications, content idea generation, content calendar updates, goal progress tracking, cross-database relationships, bug report handling, and customer feedback integration. The tool is designed to streamline workflows, enhance productivity, and improve resource allocation through automation and integration with Notion's collaboration features. Its benefits include streamlining project management processes, enhancing collaboration and communication, and enabling data-driven decision-making.
Conclusion
In this tutorial, we've demonstrated how to build a RAG pipeline using LangChain, OpenAI, and Supabase. By combining the capabilities of these tools we can query a custom knowledge base and generate responses based on the retrieved documents.
Resources
- LangChain Documentation
- Creating a Supabase vector store
- self-query-supabase
- NotionDirectoryLoader
- NotionDBLoader
Next Steps
The RAG pipeline we've built is a basic example of how to ingest and retrieve documents from a knowledge base. You can further enhance the pipeline by ingesting more documents, fine tuning the chunk size, and experimenting with different prompts to generate more accurate responses. Additionally, you can explore other components and integrations available in LangChain to build more advanced language processing pipelines.
Stay Connected
Join me on LinkedIn, where I share insights and updates on AI, Automation, Productivity, and more.
Additionally, if you're interested in learning more about how I'm leveraging AI for simple automations and productivity hacks, subscribe to my newsletter "Growth Journal". Be the first to receive exclusive content and stay up-to-date with the latest trends in AI and automation.
Until next time, happy prompting!