Retrieval Augmentation for GPT-4 using Pinecone
Fixing LLMs that Hallucinate
In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a GPT-4 model to generate an answer backed by real data sources.
GPT-4 is a big step up from previous OpenAI completion models. It also exclusively uses the ChatCompletion
endpoint, so we must use it in a slightly different way to usual. However, the power of the model makes the change worthwhile, particularly when augmented with an external knowledge base like the Pinecone vector database.
Required installs for this notebook are:
!pip install -qU bs4 tiktoken openai langchain pinecone-client[grpc]
Preparing the Data
In this example, we will download the LangChain docs from langchain.readthedocs.io/. We get all .html
files located on the site like so:
!wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/
This downloads all HTML into the rtdocs
directory. Now we can use LangChain itself to process these docs. We do this using the ReadTheDocsLoader
like so:
from langchain.document_loaders import ReadTheDocsLoader#读取文件的模块
loader = ReadTheDocsLoader('rtdocs')
docs = loader.load()
len(docs)
This leaves us with hundreds of processed doc pages. Let's take a look at the format each one contains:
docs[0]
We access the plaintext page content like so:
print(docs[0].page_content)
print(docs[5].page_content)
We can also find the source of each document:
docs[5].metadata['source'].replace('rtdocs/', 'https://')
We can use these to create our `data` list:
data = []
for doc in docs:
data.append({
'url': doc.metadata['source'].replace('rtdocs/', 'https://'),
'text': doc.page_content
})
data[3]
It’s pretty ugly but it’s good enough for now. Let’s see how we can process all of these. We will chunk everything into ~400 token chunks, we can do this easily with langchain
and tiktoken
:
import tiktoken
tokenizer = tiktoken.get_encoding('p50k_base')
# create the length function
def tiktoken_len(text):
tokens = tokenizer.encode(
text,
disallowed_special=()