jazzjazz国产精品一区二区,国产免费一级视频,91短视频版在线观看www免费

[the documentation of vllm](https://docs.vllm.ai/en/stable/). Now, you can have fun with Qwen2.5 models.

這是一個(gè)能很好體現(xiàn)從其他上下文中獲益的信息塊示例。單獨(dú)來(lái)看，這個(gè)信息塊的信息含量相對(duì)有限。接下來(lái)，我們來(lái)看看增加了上下文信息后的信息塊：

帶上下文的示例數(shù)據(jù)塊：

For more information, please refer to 

[the documentation of vllm](https://docs.vllm.ai/en/stable/).

Now, you can have fun with Qwen2.5 models.
The chunk is situated at the end of the document, following the section on 
deploying Qwen2.5 models with vLLM, and serves as a concluding remark 
encouraging users to explore the capabilities of Qwen2.5 models.

你可以想象，當(dāng)模型接收到這個(gè)塊時(shí)，它對(duì)上下文有了更好的理解，并且可以提供更準(zhǔn)確的答案。讓我們構(gòu)建管道來(lái)創(chuàng)建這些塊。

什么是上下文檢索？

上下文檢索（由 Anthropic 引入）解決了傳統(tǒng)檢索增強(qiáng)生成（RAG）系統(tǒng)中的一個(gè)常見(jiàn)問(wèn)題：?jiǎn)蝹€(gè)文本塊通常缺乏足夠的上下文來(lái)準(zhǔn)確檢索和理解。

上下文檢索通過(guò)在嵌入或索引之前添加特定的解釋性上下文來(lái)增強(qiáng)每個(gè)塊。這保留了塊與其更廣泛的文檔之間的關(guān)系，從而顯著提高了系統(tǒng)檢索和使用最相關(guān)信息的能力。

根據(jù) Anthropic 的實(shí)驗(yàn)：

上下文嵌入將前 20 個(gè)塊檢索失敗率降低了 35%。
將上下文嵌入與上下文 BM25?相結(jié)合，進(jìn)一步降低了49%的失敗率。

這些改進(jìn)凸顯了上下文檢索的潛力，可以提高 AI 驅(qū)動(dòng)的問(wèn)答系統(tǒng)的性能，使其更加準(zhǔn)確和上下文感知。

我們將構(gòu)建什么

我們將使用兩個(gè)示例文檔來(lái)演示上下文檢索如何改進(jìn)問(wèn)答系統(tǒng)。我們的系統(tǒng)將執(zhí)行以下操作：

將文檔拆分成更小的信息塊。
向每個(gè)信息塊添加上下文信息，將其嵌入，并將它們存儲(chǔ)在數(shù)據(jù)庫(kù)中。
執(zhí)行相似性搜索以找到最相關(guān)的上下文。
使用大型語(yǔ)言模型（LLM）根據(jù)檢索到的上下文生成用戶問(wèn)題的答案。

設(shè)置環(huán)境

首先，讓我們安裝必要的庫(kù)：

pip install -Uqqq pip --progress-bar off

pip install -qqq fastembed==0.3.6 --progress-bar off

pip install -qqq sqlite-vec==0.1.2 --progress-bar off

pip install -qqq groq==0.11.0 --progress-bar off

pip install -qqq langchain-text-splitters==0.3.0 --progress-bar off

現(xiàn)在，讓我們導(dǎo)入所需的模塊：

import sqlite3

from textwrap import dedent

from typing import List



import sqlite_vec

from fastembed import TextEmbedding

from google.colab import userdata

from groq import Groq

from groq.types.chat import ChatCompletionMessage

from langchain_text_splitters import RecursiveCharacterTextSplitter

from sqlite_vec import serialize_float32

from tqdm import tqdm

語(yǔ)言模型設(shè)置

我們將通過(guò) Groq API 使用 Llama 3.1。首先，讓我們?cè)O(shè)置客戶端：

client = Groq(api_key=userdata.get("GROQ_API_KEY"))

MODEL = "llama-3.1-70b-versatile"

TEMPERATURE = 0

接下來(lái)，我們將創(chuàng)建一個(gè)輔助函數(shù)來(lái)與模型交互。此函數(shù)將接受提示和可選的消息歷史記錄：

def call_model(prompt: str, messages=[]) -> ChatCompletionMessage:

    messages.append({

        "role": "user",

        "content": prompt,

    })

    response = client.chat.completions.create(

        model=MODEL,

        messages=messages,

        temperature=TEMPERATURE,

    )

    return response.choices[0].message.content

此函數(shù)向模型發(fā)送提示并返回模型的響應(yīng)。您還可以傳遞消息歷史記錄以維護(hù)對(duì)話的上下文。

數(shù)據(jù)庫(kù)設(shè)置

我們將使用帶有sqlite-vec擴(kuò)展的SQLite來(lái)存儲(chǔ)我們的文檔及其嵌入。以下是設(shè)置數(shù)據(jù)庫(kù)的方法：

db = sqlite3.connect("readmes.sqlite3")

db.enable_load_extension(True)

sqlite_vec.load(db)

db.enable_load_extension(False)

連接到數(shù)據(jù)庫(kù)后，讓我們創(chuàng)建必要的表：

db.execute("""

CREATE TABLE documents(

    id INTEGER PRIMARY KEY AUTOINCREMENT,

    text TEXT

);

""")



db.execute("""

CREATE TABLE chunks(

    id INTEGER PRIMARY KEY AUTOINCREMENT,

    document_id INTEGER,

    text TEXT,

    FOREIGN KEY(document_id) REFERENCES documents(id)

);

""")



db.execute(f"""

CREATE VIRTUAL TABLE chunk_embeddings USING vec0(

  id INTEGER PRIMARY KEY,

  embedding FLOAT[{document_embeddings[0].shape[0]}]

);

""")

以下是表格的分類：

documents：存儲(chǔ)每個(gè)文檔的全文。
chunks：存儲(chǔ)從文檔中拆分的較小文本塊。
chunk_embeddings：存儲(chǔ)每個(gè)塊的嵌入，以便進(jìn)行相似性搜索。

這種數(shù)據(jù)庫(kù)設(shè)置允許我們有效地存儲(chǔ)、檢索和嵌入塊，從而便于以后執(zhí)行相似性搜索。

創(chuàng)建數(shù)據(jù)塊

為了將文檔分解為可管理的塊以便更好地進(jìn)行上下文檢索，我們將按照以下步驟操作：

將文檔文本拆分為較小的塊。
向每個(gè)數(shù)據(jù)塊添加上下文信息。
嵌入每個(gè)塊并將其與文本一起存儲(chǔ)在數(shù)據(jù)庫(kù)中。

我們將使用的文檔是Qwen 2.5模型和LangGraph項(xiàng)目的README文件。

首先，讓我們將文檔保存在數(shù)據(jù)庫(kù)中：

documents = [qwen_doc, langgraph_doc]



with db:

    for doc in documents:

        db.execute("INSERT INTO documents(text) VALUES(?)", [doc])

為了將文檔拆分成更小的信息塊，我們將使用LangChain中的RecursiveCharacterTextSplitter3工具：

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=128)

我們現(xiàn)在可以創(chuàng)建塊并將它們存儲(chǔ)在數(shù)據(jù)庫(kù)中：

with db:

    document_rows = db.execute("SELECT id, text FROM documents").fetchall()

    for row in document_rows:

        doc_id, doc_text = row

        chunks = text_splitter.split_text(doc_text)

        contextual_chunks = create_contextual_chunks(chunks, doc_text)

        save_chunks(contextual_chunks)

為了給每個(gè)數(shù)據(jù)塊提供額外的上下文，我們將使用以下提示生成簡(jiǎn)短的摘要：

CONTEXTUAL_EMBEDDING_PROMPT = """

Here is the chunk we want to situate within the whole document:

<chunk>

{chunk}

</chunk>



Here is the content of the whole document:

<document>

{document}

</document>



Please provide a short, succinct context to situate this chunk within the overall document to improve search retrieval. Respond only with the context.

"""

以下是該函數(shù)的工作原理：

def create_contextual_chunks(chunks: List[str], document: str) -> List[str]:

    contextual_chunks = []

    for chunk in chunks:

        prompt = CONTEXTUAL_EMBEDDING_PROMPT.format(chunk=chunk, document=document)

        chunk_context = call_model(prompt)

        contextual_chunks.append(f"{chunk}\n{chunk_context}")

    return contextual_chunks

此函數(shù)會(huì)將每個(gè)信息塊連同整個(gè)文檔一起發(fā)送到模型，模型會(huì)生成一個(gè)簡(jiǎn)短的上下文，以提高搜索檢索的準(zhǔn)確性。然后，將這個(gè)上下文前置到信息塊的前面。

我們將使用fastembed4庫(kù)來(lái)為文檔的信息塊創(chuàng)建嵌入表示：

embedding_model = TextEmbedding()

最后，讓我們將塊及其嵌入保存在數(shù)據(jù)庫(kù)中：

def save_chunks(chunks: List[str]):

    chunk_embeddings = list(embedding_model.embed(chunks))

    for chunk, embedding in zip(chunks, chunk_embeddings):

        result = db.execute(

            "INSERT INTO chunks(document_id, text) VALUES(?, ?)", [doc_id, chunk]

        )

        chunk_id = result.lastrowid

        db.execute(

            "INSERT INTO chunk_embeddings(id, embedding) VALUES (?, ?)",

            [chunk_id, serialize_float32(embedding)],

        )

此函數(shù)將每個(gè)信息塊及其嵌入表示保存到數(shù)據(jù)庫(kù)中的chunks表和chunk_embeddings表中。serialize_float32函數(shù)用于將嵌入表示存儲(chǔ)為一種可以稍后高效檢索的格式。

檢索上下文

一旦塊及其嵌入存儲(chǔ)在數(shù)據(jù)庫(kù)中，我們就可以檢索給定查詢的最相關(guān)上下文。下面是實(shí)現(xiàn)這一點(diǎn)的函數(shù)：

def retrieve_context(query: str, k: int = 3, embedding_model: TextEmbedding = embedding_model) -> str:

    query_embedding = list(embedding_model.embed([query]))[0]

    results = db.execute(

        """

    SELECT

        chunk_embeddings.id,

        distance,

        text

    FROM chunk_embeddings

    LEFT JOIN chunks ON chunks.id = chunk_embeddings.id

    WHERE embedding MATCH ? AND k = ?

    ORDER BY distance

        """,

        [serialize_float32(query_embedding), k],

    ).fetchall()

    return "\n-----\n".join([item[2] for item in results])

查詢嵌入：該函數(shù)首先使用嵌入模型將輸入查詢轉(zhuǎn)換為嵌入表示。
數(shù)據(jù)庫(kù)查詢：然后，它通過(guò)以下方式檢索與查詢嵌入表示最相似的前k個(gè)信息塊及其嵌入表示：
- 計(jì)算查詢嵌入表示與存儲(chǔ)的信息塊嵌入表示之間的余弦相似度（這由sqlite-vec擴(kuò)展處理）。
- 按相似度距離對(duì)結(jié)果進(jìn)行排序（距離越低表示匹配越緊密）。
返回結(jié)果：將檢索到的信息塊連接成一個(gè)單獨(dú)的字符串，并用“\n—–\n”分隔以提高清晰度。

生成答案

為了生成答案，我們將系統(tǒng)提示符與檢索到的上下文相結(jié)合。這可確保模型提供準(zhǔn)確且與上下文相關(guān)的響應(yīng)。

系統(tǒng)提示為模型應(yīng)如何響應(yīng)設(shè)定基調(diào)和期望：

SYSTEM_PROMPT = """

You're an expert AI/ML engineer with a background in software development.

You're answering questions about technical topics and projects.

If you don't know the answer, simply state that you don't know. 

Keep your answers brief and to the point. Be kind and respectful.



Use the provided context for your answers. The most relevant information is 

at the top. Each piece of information is separated by ---.

"""

以下是將所有內(nèi)容聯(lián)系在一起的函數(shù)：

def ask_question(query: str) -> str:

    messages = [

        {

            "role": "system",

            "content": SYSTEM_PROMPT,

        },

    ]

    context = retrieve_context(query)

    prompt = dedent(

        f"""

Use the following information:



```

{context}

```



to answer the question:

{query}

        """

    )

    return call_model(prompt, messages), context

設(shè)置系統(tǒng)提示：SYSTEM_PROMPT指導(dǎo)模型如何回答問(wèn)題——鼓勵(lì)簡(jiǎn)潔、禮貌且考慮上下文的回答。如果模型不知道答案，它會(huì)按照指示承認(rèn)這一點(diǎn)。
檢索相關(guān)上下文：retrieve_context(query)函數(shù)從數(shù)據(jù)庫(kù)中為給定查詢檢索最相關(guān)的上下文信息塊。
創(chuàng)建最終提示：將檢索到的上下文插入到提示中，然后指示模型使用該信息來(lái)回答用戶的問(wèn)題。
調(diào)用模型：call_model(prompt, messages)函數(shù)將提示發(fā)送到大型語(yǔ)言模型（LLM）并生成答案。
返回響應(yīng)：該函數(shù)返回模型生成的答案以及檢索到的上下文（供審查時(shí)可選）。

要回答問(wèn)題，您可以像這樣調(diào)用函數(shù)：

answer, context = ask_question("How does Contextual Retrieval improve RAG performance?")

print("Answer:", answer)

print("Context used:", context)

這既提供了答案，也提供了模型用來(lái)生成回答的上下文。

使用 RAG

現(xiàn)在我們可以用一些問(wèn)題來(lái)測(cè)試我們的系統(tǒng)。讓我們先問(wèn)一個(gè)關(guān)于Qwen模型的簡(jiǎn)單問(wèn)題：

query = "How many parameters does Qwen have?"

response, context = ask_question(query)

print(response)

輸出：

Qwen2.5 models are available in various sizes, with the number of parameters 

ranging from 0.5B to 72B. The specific model mentioned in the text has 32.5B 

parameters, with 31.0B non-embedding parameters.

非常好，看起來(lái)模型是基于檢索到的上下文提供了準(zhǔn)確的信息。讓我們嘗試一些技術(shù)性更強(qiáng)的內(nèi)容：

query = "How should one deploy Qwen model on a private server?"

response, context = ask_question(query)

print(response)

輸出：

To deploy Qwen2.5 on a private server, you can use vLLM, a fast and easy-to-use 

framework for LLM inference and serving. First, install vllm>=0.4.0 using 
pip. Then, run the following command to build up a vLLM service:

```bash
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct
```

Alternatively, with vllm>=0.5.3, you can use:

```bash
vllm serve Qwen/Qwen2.5-7B-Instruct
```

This will start a service that you can interact with using the OpenAI API.

這是對(duì)文檔部署部分的一個(gè)很好的總結(jié)。讓我們?cè)賴L試一個(gè)問(wèn)題：

query = "I have a RTX 4090 (24GB). Which version of the model can I run with good inference speed?"

response, context = ask_question(query)

print(response)

輸出：

Based on the provided information, the model sizes available for Qwen2.5 are 

0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. 



Considering your RTX 4090 has 24GB of memory, you can likely run the 7B or 14B 

models with good inference speed. However, the 14B model might be pushing the 

limits of your GPU's memory, so the 7B model would be a safer choice.



Keep in mind that the actual performance will also depend on other factors such 

as your system's CPU, RAM, and the specific use case.

此信息在文檔中找不到，但該模型根據(jù)檢索到的上下文及其推理能力提供了很好的答案。