自学内容网 自学内容网

【NVIDIA NIM Demo】 文档渐进式提炼

如今的聊天模型是基于巨大的数据库训练得到的,如果我们所问的问题不在训练的数据库内(数据库过时),此时更新这个巨大的数据库重新训练又不太现实,于是一种方法是将相关内容作为我们提问的上下文与我们的问题一起提供给模型,但如果上下文很长怎么办,比如我们问的问题是基于长篇的论文,将文章内容作为上下文一下子喂给模型也不太现实。

于是我们可以将文章切分成一个个 chunk ,将 chunk 依次喂给模型,每次喂给模型一个 chunk ,就让模型从 chunk 中提取新的信息更新到知识库,直到所有 chunk 都处理完毕,于是我们就得到一个包含该文章关键内容的知识库,后续就可以将这个知识库作为上下文了,且这个知识库中的内容相比较文章本身而言内容少了很多

Step 0:模块导入 & 环境

下面是一些基本环境与依赖

# pip install -qq langchain langchain-nvidia-ai-endpoints gradio
# pip install -qq arxiv pymupdf

# import os
# os.environ["NVIDIA_API_KEY"] = "nvapi-..."
from functools import partial
from rich.console import Console
from rich.style import Style
from rich.theme import Theme
from langchain_core.runnables import RunnableLambda
from functools import partial

console = Console()
base_style = Style(color="#76B900", bold=True)
pprint = partial(console.print, style=base_style)

def RPrint(preface="State: "):
    def print_and_return(x, preface=""):
        print(f"{preface}{x}")
        return x
    return RunnableLambda(partial(print_and_return, preface=preface))

def PPrint(preface="State: "):
    def print_and_return(x, preface=""):
        pprint(preface, x)
        return x
    return RunnableLambda(partial(print_and_return, preface=preface))
# 文档加载与切分相关
from langchain.document_loaders import UnstructuredFileLoader
from langchain.document_loaders import ArxivLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Runnable 相关
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables.passthrough import RunnableAssign
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.output_parsers import PydanticOutputParser

# 模型相关
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# 知识库相关
from langchain_core.pydantic_v1 import BaseModel, Field

# 其他
from typing import List
from IPython.display import clear_output

Step 1:文档切分

我们以 [2404.16130] From Local to Global: A Graph RAG Approach to Query-Focused Summarization (arxiv.org) 这篇文章为例

将文章内容切分成一个个 chunk

documents = ArxivLoader(query="2404.16130").load()  ## GraphRAG

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

docs_split = text_splitter.split_documents(documents)

Step 2:知识库更新

接着就可以进行知识库更新,最后我们实现了将文档纳入到这个知识库

知识库模版

首先我们需要定义一个知识库模版,来存储文章被提炼出来的精华,其中

  • running_summary:对目前已经获取到的文章内容的总结
  • main_ideas:文档的核心思想或主要观点
  • loose_ends:记录文档中尚未解决或未明确阐述的问题和疑问
class DocumentSummaryBase(BaseModel):
    running_summary: str = Field("", description="Running description of the document. Do not override; only update!")
    main_ideas: List[str] = Field([], description="Most important information from the document (max 3)")
    loose_ends: List[str] = Field([], description="Open questions that would be good to incorporate into summary, but that are yet unknown (max 3)")

知识库更新 prompt

知识库更新用到的 prompt 模版

summary_prompt = ChatPromptTemplate.from_template(
    "You are generating a running summary of the document. Make it readable by a technical user."
    " After this, the old knowledge base will be replaced by the new one. Make sure a reader can still understand everything."
    " Keep it short, but as dense and useful as possible! The information should flow from chunk to (loose ends or main ideas) to running_summary."
    " The updated knowledge base keep all of the information from running_summary here: {info_base}."
    "\n\n{format_instructions}. Follow the format precisely, including quotations and commas"
    "\n\nWithout losing any of the info, update the knowledge base with the following: {input}"
)

提取器

与之前的 Demo 用到的一致,可以返回一个能够从新输入中提取信息更新到知识库并返回这个知识库的 Runnable ,

def RExtract(pydantic_class, llm, prompt):
    '''
    Runnable Extraction module
    Returns a knowledge dictionary populated by slot-filling extraction
    '''
    parser = PydanticOutputParser(pydantic_object=pydantic_class)
    instruct_merge = RunnableAssign({'format_instructions' : lambda x: parser.get_format_instructions()})
    def preparse(string):
        if '{' not in string: string = '{' + string
        if '}' not in string: string = string + '}'
        string = (string
            .replace("\\_", "_")
            .replace("\n", " ")
            .replace("\]", "]")
            .replace("\[", "[")
        )
        # print(string)  ## Good for diagnostics
        return string
    return instruct_merge | prompt | llm | preparse | parser

循环更新

接下来循环更新知识库,用一个 state 描述知识库的状态

latest_summary = ""

def RSummarizer(knowledge, llm, prompt, verbose=False):
    '''
    Create a chain that summarizes
    '''
    def summarize_docs(docs):
        # 构造 parse_chain
        parse_chain = RunnableAssign({'info_base' : RExtract(knowledge.__class__, llm, prompt)})
        # 初始化 state
        state = {"info_base": knowledge}
        
        global latest_summary  # If your loop crashes, you can check out the latest_summary
        
        for i, doc in enumerate(docs):
            # 当前文档内容更新到 state
            state['input'] = doc.page_content
            # 更新知识库
            state = parse_chain.invoke(state)
            
            assert 'info_base' in state 
            
            # 打印每次数据库的内容
            if verbose:
                print(f"Considered {i+1} documents")
                pprint(state['info_base'])
                latest_summary = state['info_base']
                clear_output(wait=True)

        return state['info_base']
    return RunnableLambda(summarize_docs)

# 模型
# instruct_model = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1").bind(max_tokens=4096)
instruct_model = ChatNVIDIA(model="mistralai/mixtral-8x22b-instruct-v0.1").bind(max_tokens=4096)
instruct_llm = instruct_model | StrOutputParser()


summarizer = RSummarizer(DocumentSummaryBase(), instruct_llm, summary_prompt, verbose=True)
summary = summarizer.invoke(docs_split[:15])  # 前 15 个 chunk 作为输入

原文地址:https://blog.csdn.net/m0_46296905/article/details/142795254

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!