GPT4_Retrieval_Augmentation

最新推荐文章于 2025-08-05 17:20:40 发布

python算法工程师

最新推荐文章于 2025-08-05 17:20:40 发布

阅读量1.1k

点赞数

CC 4.0 BY-SA版权

文章标签： python 数学建模开发语言

本文链接：https://blog.csdn.net/qq_44089890/article/details/130871823

本文展示如何结合Pinecone向量数据库和GPT-4模型进行知识增强的问答。首先，从LangChain文档中提取数据，将其分块并编码。接着，使用text-embedding-ada-002模型创建嵌入向量，并在Pinecone上建立索引。最后，通过查询Pinecone检索相关上下文，用GPT-4生成基于真实数据的答案，减少模型的‘幻觉’现象。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Retrieval Augmentation for GPT-4 using Pinecone

Fixing LLMs that Hallucinate

In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a GPT-4 model to generate an answer backed by real data sources.

GPT-4 is a big step up from previous OpenAI completion models. It also exclusively uses the ChatCompletion endpoint, so we must use it in a slightly different way to usual. However, the power of the model makes the change worthwhile, particularly when augmented with an external knowledge base like the Pinecone vector database.

Required installs for this notebook are:

!pip install -qU bs4 tiktoken openai langchain pinecone-client[grpc]

Preparing the Data

In this example, we will download the LangChain docs from langchain.readthedocs.io/. We get all .html files located on the site like so:

!wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/

This downloads all HTML into the rtdocs directory. Now we can use LangChain itself to process these docs. We do this using the ReadTheDocsLoader like so:

from langchain.document_loaders import ReadTheDocsLoader#读取文件的模块

loader = ReadTheDocsLoader('rtdocs')
docs = loader.load()
len(docs)
This leaves us with hundreds of processed doc pages. Let's take a look at the format each one contains:
docs[0]
We access the plaintext page content like so:
print(docs[0].page_content)
print(docs[5].page_content)
We can also find the source of each document:
docs[5].metadata['source'].replace('rtdocs/', 'https://')
We can use these to create our `data` list:
data = []

for doc in docs:
    data.append({
   
   
        'url': doc.metadata['source'].replace('rtdocs/', 'https://'),
        'text': doc.page_content
    })
data[3]

It’s pretty ugly but it’s good enough for now. Let’s see how we can process all of these. We will chunk everything into ~400 token chunks, we can do this easily with langchain and tiktoken:

import tiktoken


tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()