
GraphRAG:基于PolarDB+通義千問api+LangChain的知識圖譜定制實踐
Gorilla LLM通過提高精度、減少幻覺錯誤、節省時間和工作量、增強可靠性以及處理有約束的API等方面的優勢,為開發人員提供了更高效、準確和可靠的API調用生成能力。
目前來看,流行的開源模型非常多,Gorilla 為什么選擇 LLaMA 而不是其他模型?是否對多個模型進行了微調和測試?之所以選擇 LLaMA 作為起點,因為它被認為是開源LLM的主力。許多其他模型都是其針對特定應用的衍生模型。當然,Gorilla 也使用了 GPT-4、GPT-3.5 和 Claude-v1 對 Gorilla 進行了基準測試。因為考慮到開源模型可商用的情況,后續又發布了兩款基于 MPT-7B 和 Falcon-7B 的 Gorilla 模型。現在 Gorilla 模型使用 Apache 2.0 許可證,這意味著 Gorilla 可以在沒有任何約束的情況下用于商業用途!
訓練 Gorilla 需要的硬件條件據官方介紹使用 8 個 A100 40GB GPU 節點來訓練和評估所有模型。根據模型和 API 數據集的不同,所需時間差異很大。最短的運行總計大約 10 個 GPU 小時,而最長的運行大約 120 個 GPU 小時。同時訓練過程也使用了所有最先進的計算技術(高效的注意力機制)和內存優化(分片、檢查點和混合精度訓練)。沒有使用 LoRA,所有 Gorilla 模型都經過端到端微調。
Gorilla LLM 接受過海量 API 文檔和代碼數據集的培訓。該數據集包括來自各種不同平臺的 API 調用,例如 Google Cloud Platform、Amazon Web Services 和 Microsoft Azure。 Gorilla 使用此數據集來學習 API 調用的語法和語義。當你要求 Gorilla 生成 API 調用時,它會首先嘗試在其數據集中查找匹配的 API 調用。如果它找到匹配的 API 調用,它將簡單地返回該調用。如果它沒有找到匹配的 API 調用,它將根據其對 API 語法和語義的了解生成新的 API 調用。以下是 Gorilla 連接 API 的過程涉及幾個關鍵步驟:
值得注意的是,Gorilla 具有很強的適應性,可以在零樣本和檢索模式下運行,使其能夠適應 API 文檔的變化并隨著時間的推移保持準確性。第一個(也是最流行的)是零樣本模式。在這種情況下,Gorilla 以自然語言接受用戶的查詢,并返回正確的 API 進行調用。現在,在很多場景中,您經常會看到 API 隨著時間的推移而演變 – 這可能是版本控制,或者端點可能會發生變化,或者參數可能會被重新洗牌,或者其中一些可能會被棄用。為了使我們的系統對此具有魯棒性,我們引入了第二種使用 Gorilla 的模式 – 檢索器感知。在這種情況下,Gorilla 會選擇最相關的 API,然后將其附加到用戶的提示中。這使我們能夠了解 API 的變化。
要使用 Gorilla LLM,必須安裝 Python 3.10 或更高版本。早期版本的 Python 無法編譯。
如果是全新的服務器,首次需要安裝Conda,在終端中,使用以下命令下載Miniconda安裝腳本:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
使用以下命令運行安裝腳本:
bash Miniconda3-latest-Linux-x86_64.sh
按照安裝程序的提示進行安裝。您可以選擇安裝位置和環境變量設置等選項。安裝完成后,使用以下命令激活conda環境:
source ~/.bashrc
使用以下命令檢查conda是否成功安裝:
conda --version
如果conda成功安裝,您將看到conda的版本號,我這里安裝的是conda 23.5.2。
conda create -n gorilla python=3.10
conda activate gorilla
pip install -r requirements.txt
https://huggingface.co/docs/transformers/main/model_doc/llama
https://huggingface.co/gorilla-llm/gorilla-7b-hf-delta-v1
將以下 Python 命令中的占位符替換為正確的文件路徑:
python3 apply_delta.py
--base-model-path path/to/hf_llama/
--target-model-path path/to/gorilla-falcon-7b-hf-v0
--delta-path path/to/models--gorilla-llm--gorilla-7b-hf-delta-v1
使用此命令將增量權重應用于您的 LLaMA 模型。
python3 serve/gorilla_falcon_cli.py --model-path path/to/gorilla-falcon-7b-hf-v0
# 如果您在使用 Apple 芯片(M1、M2 等)的 Mac 上運行,請添加 “--device mps”
path/to/gorilla-7b-hf,th,tf-v0
應替換為 Gorilla 模型的真實路徑。api
子目錄中的每個文件代表一個 API,標題為 {api_name}_api.jsonl
。apibench
子文件夾包含 LLM 模型訓練和評估數據集。它包含文件 {api_name}_train.jsonl
和 {api_name}_eval.jsonl
。apizoo
子目錄中找到。README.md
文件包含有關評估過程的說明或數據。get_llm_responses.py
腳本。eval-scripts
包含每個 API 的評估腳本例如 ast_eval_{api_name}.py
。responses_{api_name}Gorilla_FT{eval_metric}.jsonl
和 responses_{api_name}Gorilla_RT{eval_metric}
.jsonl 的文件。questions
子目錄中,每個 API 文件夾都有標題為 questions_{api_name}_{eval_metric}.jsonl
的文件。questions
子文件夾中的問題文件按 API 名稱和評估指標組織。responses
子文件夾中。README.md
文件很可能包含執行推理代碼的指令。serve
子目錄包含 Gorilla 命令行界面 (CLI) 腳本和聊天模板。train
文件夾標記為“即將推出!”并且很可能應該包含 Gorilla 模型訓練代碼。但是,該文件夾現在似乎不可用。您可以參考每個文件夾中的自述文件,以獲取有關使用提供的代碼和數據集的更具體說明和信息。
首先,使用pip安裝OpenAI
pip install openai
像這樣配置 api_key 和 api_base
import openai
openai.api_key = "EMPTY" # key可以忽略
openai.api_base = "http://34.132.127.197:8000/v1" #http://zanino.millennium.berkeley.edu:8000/v1
使用 OpenAI 庫創建獲取 Gorilla 結果的函數
def get_gorilla_response(prompt="我想把英語翻譯成中文。", model="gorilla-falcon-7b-hf-v0"):
completion = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return completion.choices[0].message.content
執行發送提示的函數和您想要使用的模型,在本例中為 gorilla-falcon-7b-hf-v0。
prompt = "我想從英語翻譯成中文。"
print(get_gorilla_response(prompt, model="gorilla-falcon-7b-hf-v0" ))
就是這樣。然后,您將收到來自 Huggingface API 的完整信息以及有關如何執行該請求的說明。
<<<domain>>>: Natural Language Processing Translation
<<<api_call>>>: pipeline('translation_en_to_zh', model='Helsinki-NLP/opus-mt-en-zh') <<<api_provider>>>: Hugging Face Transformers
<<<explanation>>>:
1. Import the pipeline function from the transformers library provided by Hugging Face.
2. Use the pipeline function to create a translation model, specifying the model 'Helsinki-NLP/opus-mt-en-zh' to be loaded. This model is trained to translate text from English to Chinese.
3. Pass your English text to the translator object, and it will return the translated Chinese text.
Gorilla HuggingFace Demo:https://huggingface.co/spaces/davila7/try-gorilla?source=post_page
# 對象檢測
prompt = "我想構建一個可以檢測圖像“cat.jpeg”中對象的機器人。輸入:['cat.jpeg']"
print(get_gorilla_response(prompt, model="gorilla-falcon-7b-hf-v0"))
輸出結果:
<<<domain>>>: Computer Vision Object Detection
<<<api_call>>>: model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-101-dc5')
<<<api_provider>>>: Hugging Face Transformers
<<<explanation>>>: 1. Import the necessary components from the Hugging Face Transformers library, torch, and PIL (Python Imaging Library).
2. Open the image using PIL's Image.open() function with the provided image path.
3. Initialize the pretrained DETR (DEtection TRansformer) model and the image processor.
4. Generate inputs for the model using the image processor.
5. Pass the inputs to the model, which returns object detection results.
<<<code>>>:
from transformers import AutoFeatureExtractor, AutoModelForObjectDetection
from PIL import Image
import torch
def load_model():
feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/detr-resnet-101-dc5')
model = AutoModelForObjectDetection.from_pretrained('facebook/detr-resnet-101-dc5')
return feature_extractor, model
def process_data(image_path, feature_extractor, model):
image = Image.open(image_path)
inputs = feature_extractor(images=image, return_tensors='pt')
outputs = model(**inputs)
results = feature_extractor.post_process(outputs, threshold=0.6)[0]
response = [model.config.id2label[label.item()] for label in results['labels']]
return response
image_path = 'cat.jpeg'
# Load the model and feature extractor
feature_extractor, model = load_model()
# Process the data
response = process_data(image_path, feature_extractor, model)
print(response)
# Torch Hub 翻譯
prompt = "我想把英語翻譯成漢語。"
print(get_gorilla_response(prompt, model="gorilla-falcon-7b-hf-v0"))
輸出結果:
{'domain': 'Machine Translation', 'api_call': \"model = torch.hub.load('pytorch/fairseq
在 Gorilla api 數據集上微調 ChatGPT-3.5 以嘗試提高其性能,關于OpenAI ChatGPT-3.5 微調文檔可以訪問這里:
https://platform.openai.com/docs/guides/fine-tuning注意:這個微調腳本將在 OpenAI 上訓練 720 萬個 token,需要花費一定的費用,請在繼續之前先考慮清楚是否愿意支付這筆費用。
pip install openai tiktoken
import re
import os
import json
import openai
from pprint import pprint
openai_api_key = "OPENAI API KEY"
openai.api_key = openai_api_key
下載 Gorrilla Huggingface api 訓練數據,可以在這里找到所有 Gorilla 訓練數據:https://github.com/ShishirPatil/gorilla/tree/main/data/apibench
wget https://raw.githubusercontent.com/ShishirPatil/gorilla/cab053ba7fdf4a3286c0e75aa2bf7abc4053812f/data/apibench/huggingface_train.json
data = []
with open("huggingface_train.json", "r") as file:
# data = json.load(file)
for line in file:
item = json.loads(line.strip())
data.append(item)
# 這是與訓練有關的數據
data[0]["code"]
解析訓練數據指令
def parse_instructions_and_outputs(code_section):
sections = code_section.split('###')
for section in sections:
if "Instruction:" in section:
instruction = section.split("Instruction:", 1)[1].strip()
break
domain = re.search(r'<<<domain>>>(.*?)\n', code_section, re.IGNORECASE).group(1).lstrip(': ')
api_call = re.search(r'<<<api_call>>>(.*?)\n', code_section, re.IGNORECASE).group(1).lstrip(': ')
api_provider = re.search(r'<<<api_provider>>>(.*?)\n', code_section, re.IGNORECASE).group(1).lstrip(': ')
if "<<<explanation>>>" in code_section:
explanation_pattern = r'<<<explanation>>>(.*?)(?:\n<<<code>>>|``|$)'<br> explanation = re.search(explanation_pattern, code_section, re.DOTALL).group(1).lstrip(': ')<br> <strong>else</strong>:<br> explanation = <strong>None</strong><br><br> # 考慮兩種情況提取代碼片段<br> code_pattern = r'(?:<<<code>>>
`) (.*)' # 匹配 <<<code>>> 或
``
code_snippet_match = re.search(code_pattern, code_section, re.DOTALL)
code_snippet = code_snippet_match.group(1).lstrip(': ') if code_snippet_match else None
return instruction, domain, api_call, api_provider, explanation, code_snippet
def encode_train_sample(data, api_name):
"""將多個提示指令編碼為單個字符串。"""
code_section = data['code']
if "<<<api_call>>>" in code_section:
instruction, domain, api_call, api_provider, explanation, code = parse_instructions_and_outputs(code_section)
prompts = []
#prompt = instruction + "\nWrite a python program in 1 to 2 lines to call API in " + api_name + ".\n\nThe answer should follow the format: <<<domain>>> $DOMAIN, <<<api_call>>>: $API_CALL, <<<api_provider>>>: $API_PROVIDER, <<<explanation>>>: $EXPLANATION, <<<code>>>: $CODE}. Here are the requirements:\n" + domains + "\n2. The $API_CALL should have only 1 line of code that calls api.\n3. The $API_PROVIDER should be the programming framework used.\n4. $EXPLANATION should be a step-by-step explanation.\n5. The $CODE is the python code.\n6. Do not repeat the format in your answer."
prompts.append({"role": "system", "content": "你是一個有厲害的API開發人員,可以根據需求編寫API。"})
prompts.append({"role": "user", "content": instruction})
prompts.append({"role": "assistant", "content": f"<<<domain>>> {domain},\
<<<api_call>>>: {api_call}, <<<api_provider>>>: {api_provider}, <<<explanation>>>: {explanation}, <<<code>>>: {code}"})
return prompts
else:
return None
使用正確的格式格式化訓練樣本以反映 Gorilla 論文
encoded_data = []
none_count = 0
for d in data:
res = encode_train_sample(d, "huggingface")
if res is not None:
encoded_data.append({"messages":res})
else:
none_count += 1
print(f"{none_count} samples out of {len(data)} ignored")
打印將傳遞給 OpenAI 進行微調的樣本
encoded_data[3]
輸出結果:
{'messages': [{'role': 'system',
'content': 'You are a helpful API writer who can write APIs based on requirements.'},
{'role': 'user',
'content': 'I run an online art store and I want to classify the art pieces uploaded by the users into different categories like abstract, landscape, portrait etc.'},
{'role': 'assistant',
'content': "<<<domain>>> Computer Vision Image Classification,<<<api_call>>>: ViTModel.from_pretrained('facebook/dino-vits8'), <<<api_provider>>>: Hugging Face Transformers, <<<explanation>>>: 1. We first import the necessary classes from the transformers and PIL packages. This includes ViTModel for the image classification model and Image for processing image data.\n2. We then use the from_pretrained method of the ViTModel class to load the pre-trained model 'facebook/dino-vits8'. This model has been trained using the DINO method which is particularly useful for getting good features for image classification tasks.\n3. We load the image data from an uploaded image file by the user.\n4. This model can then be used to classify the image into different art categories like 'abstract', 'landscape', 'portrait' etc., <<<code>>>: None"}]}
# 我們從導入所需的包開始
import json
import os
import tiktoken
import numpy as np
from collections import defaultdict
# 接下來,我們指定數據通路并打開JSONL文件
data_path = encoded_file_path
# 加載數據集
with open(data_path) as f:
dataset = [json.loads(line) for line in f]
# 我們可以通過檢查示例數量和第一項來快速檢查數據
# 初始數據集統計信息
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
print(message)
# 現在我們對數據有了了解,我們需要遍歷所有不同的示例并檢查以確保格式正確并與Chat完成消息結構匹配
# 格式錯誤檢查
format_errors = defaultdict(int)
for ex in dataset:
if not isinstance(ex, dict):
format_errors["data_type"] += 1
continue
messages = ex.get("messages", None)
if not messages:
format_errors["missing_messages_list"] += 1
continue
for message in messages:
if "role" not in message or "content" not in message:
format_errors["message_missing_key"] += 1
if any(k not in ("role", "content", "name") for k in message):
format_errors["message_unrecognized_key"] += 1
if message.get("role", None) not in ("system", "user", "assistant"):
format_errors["unrecognized_role"] += 1
content = message.get("content", None)
if not content or not isinstance(content, str):
format_errors["missing_content"] += 1
if not any(message.get("role", None) == "assistant" for message in messages):
format_errors["example_missing_assistant_message"] += 1
if format_errors:
print("Found errors:")
for k, v in format_errors.items():
print(f"{k}: {v}")
else:
print("No errors found")
# 除了消息的結構之外,我們還需要確保長度不超過4096令牌限制。
# Token 計數功能
encoding = tiktoken.get_encoding("cl100k_base")
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3
return num_tokens
def num_assistant_tokens_from_messages(messages):
num_tokens = 0
for message in messages:
if message["role"] == "assistant":
num_tokens += len(encoding.encode(message["content"]))
return num_tokens
def print_distribution(values, name):
print(f"\n#### Distribution of {name}:")
print(f"min / max: {min(values)}, {max(values)}")
print(f"mean / median: {np.mean(values)}, {np.median(values)}")
print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")
# 最后,在繼續創建微調作業之前,我們可以查看不同格式化操作的結果:
# 警告和token計數
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []
for ex in dataset:
messages = ex["messages"]
if not any(message["role"] == "system" for message in messages):
n_missing_system += 1
if not any(message["role"] == "user" for message in messages):
n_missing_user += 1
n_messages.append(len(messages))
convo_lens.append(num_tokens_from_messages(messages))
assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
print("缺少系統消息的示例數:", n_missing_system)
print("缺少用戶消息的數字示例:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} 示例可能超過4096令牌限制,它們將在微調期間被截斷")
# 定價和違約n_epochs估計
MAX_TOKENS_PER_EXAMPLE = 4096
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
TARGET_EPOCHS = 3
MIN_EPOCHS = 1
MAX_EPOCHS = 25
n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
n_epochs = min(MAX_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
n_epochs = max(MIN_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)
n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"數據集有~{n_billing_tokens_in_dataset}個令牌,在訓練期間將收取費用")
print(f"默認情況下,您將在此數據集上訓練{n_epochs}個紀元")
print(f"默認情況下,您將收取~{n_epochs*n_billing_tokens_in_dataset}代幣的費用")
print("請參閱定價頁面以估算總成本")
openai.File.create(
file=open(encoded_file_path, "rb"),
purpose='fine-tune'
)
創建微調任務
openai.api_key = openai_api_key
openai.FineTuningJob.create(
training_file="file-OrxAP7HcvoSUmu9MtAbWo5s4",
model="gpt-3.5-turbo"
)
# 列出 10 個微調任務
openai.FineTuningJob.list(limit=10)
# 查詢微調的狀態
state = openai.FineTuningJob.retrieve("ftjob-qhg4yswil15TCqD4SNHn0V1D")
state["status"], state["trained_tokens"], state["finished_at"]
# 列出微調作業中最多 10 個事件
openai.FineTuningJob.list_events(id="ftjob-qhg4yswil15TCqD4SNHn0V1D", limit=10)
openai.api_key = openai_api_key
completion = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "How can i load a NER model?"}
]
)
print(completion.choices[0].message)
print(completion.choices[0].message["content"])
輸出結果:
('To load a Named Entity Recognition (NER) model in Python, you can use the '
"Hugging Face Transformers library. Here's a step-by-step guide to loading "
'and using a NER model:\n'
'\n'
'1. Install the required Hugging Face Transformers library using "pip install '
'transformers".\n'
'2. Import the AutoModelForTokenClassification class from the transformers '
'library.\n'
'3. Import the necessary tokenizer as well, which is AutoTokenizer in this '
'case.\n'
'4. Use the from_pretrained method to load the pre-trained model with its '
'respective model name or identifier.\n'
'5. Then, use the load_tokenizer method to load the tokenizer.\n'
'6. Encode your text using the loaded tokenizer, specifying the '
"'return_tensors' parameter as 'pt'.\n"
'7. Pass the input tensor to the model and it will return the predictions, '
'describing the Named Entities in the text.\n'
'\n'
'Please keep in mind that you should download the model first, replace '
"'YOUR_MODEL_NAME' with an appropriate model identifier, and make sure to "
'execute this code on a suitable device (e.g., CPU or GPU).\n'
'\n'
'Here is how the code looks:\n'
'``python\n'<br> 'from transformers import AutoModelForTokenClassification, AutoTokenizer\n'<br> 'import torch\n'<br> '\n'<br> "model = AutoModelForTokenClassification.from_pretrained('YOUR_MODEL_NAME')\n"<br> "tokenizer = AutoTokenizer.from_pretrained('YOUR_MODEL_NAME')\n"<br> '\n'<br> '# Encode your text using the loaded tokenizer\n'<br> "inputs = tokenizer(text, return_tensors='pt')\n"<br> '\n'<br> '# Pass the input tensor to the model and obtain NER predictions\n'<br> 'predictions = model(**inputs)\n'<br> '
``\n'
'\n'
"Remember to replace 'YOUR_MODEL_NAME' with an appropriate BERT NER-trained "
"model such as 'dslim/bert-base-NER'.")
Gorilla LLM 是一個突破性的LLM,可以生成準確的 API 調用并適應文檔的實時變化。該模型為未來的LLM在與工具和系統交互方面變得更加可靠和多功能鋪平了道路。Gorilla LLM 是一款面向開發人員的強大新工具。它可以節省開發人員的時間和精力,并且可以幫助他們編寫更可靠的代碼。如果您是一名開發人員,建議你可以了解一下 Gorilla LLM。LLM未來的進步可以集中在進一步減少幻覺錯誤、提高對不同 API 的適應性以及擴展其處理復雜任務的能力。潛在的應用包括充當計算基礎設施的主要接口、自動化度假預訂等流程以及促進各種 Web API 之間的無縫通信。
https://shishirpatil.github.io/gorilla/
https://github.com/ShishirPatil/gorilla
https://arxiv.org/abs/2305.15334
文章轉自微信公眾號@技術狂潮AI