书生大模型---RAG实践

RAG实践

Lv1-llamaindex+Internlm2 RAG实践

1. RAG简介

RAG（Retrieval-Augmented Generation）技术是一种结合了信息检索和文本生成的技术，旨在通过检索外部知识库来增强生成模型的能力

1.1 RAG优化方法

2. 搭建环境

2.1 相关基础依赖python虚拟环境

conda activate llamaindex
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia

pip install einops==0.7.0 protobuf==5.26.1

2.2 安装 Llamaindex和相关包

1
2

conda activate llamaindex
pip install llama-index==0.10.38 llama-index-llms-huggingface==0.2.0 "transformers[torch]==4.41.1" "huggingface_hub[inference]==0.23.1" huggingface_hub==0.23.1 sentence-transformers==2.7.0 sentencepiece==0.2.0

2.3 下载 Sentence Transformer 模型

源词向量模型 Sentence Transformer:（也可以选用别的开源词向量模型来进行 Embedding）运行以下指令，新建一个python文件

cd ~
mkdir llamaindex_demo
mkdir model
cd ~/llamaindex_demo
touch download_hf.py

打开download_hf.py 贴入以下代码

import os

# 设置环境变量
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# 下载模型
os.system('huggingface-cli download --resume-download sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 --local-dir /root/model/sentence-transformer')

然后，在 /root/llamaindex_demo 目录下执行该脚本即可自动开始下载：

1
2
3

cd /root/llamaindex_demo
conda activate llamaindex
python download_hf.py

更多关于镜像使用可以移步至 HF Mirror 查看。

2.4 下载 NLTK 相关资源

我们在使用开源词向量模型构建开源词向量的时候，需要用到第三方库 nltk 的一些资源。正常情况下，其会自动从互联网上下载，但可能由于网络原因会导致下载中断，此处我们可以从国内仓库镜像地址下载相关资源，保存到服务器上。我们用以下命令下载 nltk 资源并解压到服务器上：

cd /root
git clone https://gitee.com/yzy0612/nltk_data.git  --branch gh-pages
cd nltk_data
mv packages/*  ./
cd tokenizers
unzip punkt.zip
cd ../taggers
unzip averaged_perceptron_tagger.zip

2.5 安装词嵌入向量依赖

1 2	conda activate llamaindex pip install llama-index-embeddings-huggingface==0.2.0 llama-index-embeddings-instructor==0.1.3

2.6 准备知识库

你所需要检索的文件

2.7 引入模型编写相关代码

详情请参考

import streamlit as st
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM

# 设置Streamlit应用的页面标题和图标
st.set_page_config(page_title="llama_index_demo", page_icon="🦜🔗")

# 在页面上显示标题
st.title("llama_index_demo")

# 初始化模型的函数，使用Streamlit的缓存机制来节约资源（防止每次交互时重新加载模型）
@st.cache_resource
def init_models():
    # 加载嵌入模型，使用HuggingFace的sentence-transformer
    embed_model = HuggingFaceEmbedding(
        model_name="/root/model/sentence-transformer"  # 指定模型的路径
    )
    Settings.embed_model = embed_model  # 设置全局的嵌入模型
    
    # 加载语言模型（LLM），指定HuggingFace模型及其对应的分词器
    llm = HuggingFaceLLM(
        model_name="/root/model/internlm2-chat-1_8b",  # 指定生成模型的路径
        tokenizer_name="/root/model/internlm2-chat-1_8b",  # 指定分词器的路径
        model_kwargs={"trust_remote_code": True},  # 允许远程代码的信任
        tokenizer_kwargs={"trust_remote_code": True}  # 同样允许分词器的远程代码信任
    )
    Settings.llm = llm  # 设置全局的语言模型

    # 读取存储在指定目录下的文档数据，并加载为一个文档对象列表
    documents = SimpleDirectoryReader("/root/llamaindex_demo/data").load_data()
    
    # 创建向量检索索引，将文档转换为向量并构建检索索引
    index = VectorStoreIndex.from_documents(documents)
    
    # 将向量索引转化为查询引擎，用于后续查询
    query_engine = index.as_query_engine()

    # 返回查询引擎
    return query_engine

# 检查session_state中是否已经存在查询引擎，如果不存在则初始化
if 'query_engine' not in st.session_state:
    st.session_state['query_engine'] = init_models()

# 定义问答函数，用于基于问题生成回复
def greet2(question):
    # 使用查询引擎根据用户问题返回结果
    response = st.session_state['query_engine'].query(question)
    return response

# 初始化聊天记录，若没有则创建，并在首次交互时显示欢迎语
if "messages" not in st.session_state.keys():
    st.session_state.messages = [{"role": "assistant", "content": "你好，我是你的助手，有什么我可以帮助你的吗？"}]    

# 显示消息记录，遍历session_state中的消息并展示在页面上
for message in st.session_state.messages:
    # 根据消息的角色显示聊天气泡，角色可以是"user"或"assistant"
    with st.chat_message(message["role"]):
        st.write(message["content"])  # 显示消息内容

# 清除聊天历史记录的函数
def clear_chat_history():
    # 重置聊天记录为最初的欢迎消息
    st.session_state.messages = [{"role": "assistant", "content": "你好，我是你的助手，有什么我可以帮助你的吗？"}]

# 在侧边栏中添加一个按钮，点击后调用清除聊天历史记录的函数
st.sidebar.button('Clear Chat History', on_click=clear_chat_history)

# 定义生成回复的函数，通过问答接口调用查询引擎
def generate_llama_index_response(prompt_input):
    return greet2(prompt_input)  # 返回基于用户输入的生成内容

# 检查用户是否输入了新的问题
if prompt := st.chat_input():
    # 将用户的输入作为消息添加到session_state的消息列表中
    st.session_state.messages.append({"role": "user", "content": prompt})
    
    # 显示用户输入的消息
    with st.chat_message("user"):
        st.write(prompt)

# 如果上一条消息不是助手的回复，则生成助手的回复
if st.session_state.messages[-1]["role"] != "assistant":
    with st.chat_message("assistant"):
        # 使用加载中的提示效果，表示助手在“思考中”
        with st.spinner("Thinking..."):
            # 调用生成回复的函数，基于用户的输入生成回答
            response = generate_llama_index_response(prompt)
            
            # 创建一个临时占位符，用于显示生成的回复
            placeholder = st.empty()
            
            # 将生成的回复以Markdown的格式显示在页面上
            placeholder.markdown(response)
    
    # 将助手的回复作为新的消息追加到消息列表中
    message = {"role": "assistant", "content": response}
    st.session_state.messages.append(message)