llm模型训练导出部署一条龙
llm模型训练导出部署一条龙
该框架功能,标注-微调-导出-合并-部署,一整条流程都有,而且训练时消耗的gpu算力也会小一些
一,安装(推荐在linux中训练,win可以用wsl+docker)
conda create -n llamafactory python=3.10 -y
conda activate llamafactory
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
# 根据cuda版本选择安装pytoch版本
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
# 提前把gpu版本的torch安装好
pip install -e .[torch,metrics]
torch下载太慢,换代理或者用win的迅雷下载好了之后传到服务器上
# 遇到包冲突时,使用 pip install --no-deps -e . 解决
# 测试torch是否可用gpu
命令行输入python
import torch
print(torch.cuda.is_available()) #返回True则说明torch可用gpu
二,训练
1,数据集的准备和配置
参考:https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README_zh.md
# 我使用的是 角色对话 的数据集格式
[
{
"conversations": [
{
"from": "human",
"value": "人类指令"
},
{
"from": "gpt",
"value": "模型回答"
}
],
"system": "系统提示词(选填)",
}
]
需要同步修改 dataset_info.json 中的配置(开始训练时会根据这个文件去找定义好的存放数据的json文件)
"yi_6b_chat": {
"file_name": "yi_6b_chat_520_24000.json",
"formatting": "sharegpt", # 表示数据使用的格式
"tags": { # 和数据集中的格式一一对应
"role_tag": "from",
"content_tag": "value",
"user_tag": "human",
"assistant_tag": "gpt"
}
},
2,训练,启动web ui界面
CUDA_VISIBLE_DEVICES=0 GRADIO_SHARE=1 llamafactory-cli webui
多卡lora训练:(框架只支持命令行多卡训练)
NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path /llm_train/model/Yi-1.5-34B-Chat-16K-GPTQ-Int4 \
--finetuning_type lora \
--quantization_bit 4 \
--template default \
--flash_attn auto \
--dataset_dir data \
--dataset yi_6b_chat \
--cutoff_len 1024 \
--learning_rate 5e-05 \
--num_train_epochs 30.0 \
--max_samples 1000 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--optim adamw_torch \
--packing False \
--report_to none \
--output_dir saves/Custom/lora/train_2024-07-09-09-11-33 \
--fp16 True \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target q_proj,v_proj \
--plot_loss True
导出(只支持非量化模型的导出,量化模型可以用同时加载模型权重和lora权重的方法,后文会有部署方法)
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
评估
llamafactory-cli eval examples/train_lora/llama3_lora_eval.yaml
tansformers部署
vllm部署
1,安装vllm
python : 3.10.14
cuda : 12.1
vllm默认安装的配套cuda12.1版本的,用北外镜像下载更快
pip installl vllm -i https://mirrors.bfsu.edu.cn/pypi/web/simple/
cuda 11.8 版本安装
pip install https://github.com/vllm-project/vllm/releases/download/v0.6.0/vllm-0.6.0+cu118-cp310-cp310-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
部署
CUDA_VISIBLE_DEVICES=1 \
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 10000 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--served-model-name Qwen2.5-32B-Instruct-GPTQ-Int4 \
--model /home/zsh/vllm/model/Qwen2.5-32B-Instruct-GPTQ-Int4 \
--enable-lora \
--lora-modules my_lora=/home/zsh/vllm/lora_model/qwen2.5_32b_train_001/checkpoint-600 \ # 可同时加载多个lora模型,在训练的时候可以配置,训练多少个epoch保存一次lora权重,测试时可同时加载只需切换名字即可测试多个lora的效果,择优选择。
--max_num_seqs 32
--gpu-memory-utilization 0.9 # vllm预占用显存的最大显存(4090为例,即使你的模型参数只会占用12g显存,该参数设置为0.9,vllm也会预分配显卡显存的百分之90的显存,应该是vllm的加速机制)
--max_num_seqs 32 # 按需调整,模型的最大并发量,正常情况是并发量越少,处理速度越快,实时语音项目,我调整为10-30之间(qwen2.5-32b-int4 用默认的max_num_seqs 刚好可以跑起来,如果添加lora权重,则跑不动,此时调整max_num_seqs可以减少部分显存消耗,不合并lora权重会有一些性能损失,大部分情况下都可以忽略)
--lora-modules 该配置启用lora权重
--lora-modules my_lora=lora_model_path 制定lora权重的路径,包含adapter_model.safetensors该文件的目录就行了,调用时访问my_lora就是lora训练后的权重,访问--served-model-name Qwen2.5-32B-Instruct-GPTQ-Int4就是原始model 权重
vllm调用–单轮对话(客户端代码)
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:9000/v1" # vllm启动的地址(该方法是使用openai调用的方法,vllm官方也提供了不用openai的调用方法)
chat_response = client.chat.completions.create(
model="lora_600",
messages=messages,
max_tokens=128, # 模型单词的最大输出token
top_p=0.95, # 越小模型输出越“稳定”,这个稳定有时候不一定是我们想要的东西,比如训练的角色扮演model,就需要适当的增加top_p 和 temperature一些配置的值,使其回复的更加具有多样性,这样模型的回复才不会让人觉得呆板
temperature=0.9, # 温度系数,越大模型的回复越有多样性,如果太大的话,模型就可能会出现乱回复和复读机的问题,这也和模型的训练效果有关,训练之后需调试参数,正常情况是1以下就行
frequency_penalty=0.7, # 效果同上,只是在处理模型的回复多样性和重复度的原理不一样,实际的作用都是为了让模型回复更符合自己预期的数据
presence_penalty=0.7, # 效果同上
)
assistant_content = chat_response.choices[0].message.content # 模型的回复
completion_tokens = chat_response.usage.completion_tokens # 本次消耗的token
prompt_tokens = chat_response.usage.prompt_tokens # 本次提示词消耗的token
total_tokens = chat_response.usage.total_tokens # 总共使用了多少token (4k的model,那么在total_tokens>4k的时候模型就会出现截断的情况,通过查看total_tokens尽量避免模型出现截断的情况)
vllm调用–单人多轮对话(客户端代码)可根据自己的需求修改部分代码,配置和单人同理,只是增加了history用来存放对话历史记录。
system:系统提示词
user:用户的输入
assistant:模型的回复
from openai import OpenAI
import time
openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:9000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
def vllm_chat_run(data, index):
user_input, messages, history = data['userInput'], data['messages'], data['history']
current_session = [user_input, None]
history.append(current_session)
for val in history:
if val[0]:
messages.append({"role": "user", "content": val[0]})
if val[1]:
messages.append({"role": "assistant", "content": val[1]})
# print(messages)
if index<=20:
chat_response = client.chat.completions.create(
model="lora_600",
messages=messages,
max_tokens=128,
top_p=0.95,
temperature=0.9,
frequency_penalty=0.7,
presence_penalty=0.7,
)
else:
chat_response = client.chat.completions.create(
model="lora_600",
messages=messages,
max_tokens=128,
top_p=0.95,
temperature=0.9,
)
return chat_response
def run(history, system_prompt):
"""主程序 start"""
index = 0
while True:
index+=1
print(f"---- 对话轮数 {index} ----")
userInput = input("《用户》:")
data = {'userInput': userInput,
'history': history,
'messages': [{'role': 'system', 'content': system_prompt}],
}
start_time = time.time()
chat_response = vllm_chat_run(data, index)
end_time = time.time()
print(f"消耗时间:{end_time-start_time}")
content = chat_response.choices[0].message.content
print("《模型》:",end='')
print(content)
completion_tokens = chat_response.usage.completion_tokens
prompt_tokens = chat_response.usage.prompt_tokens
total_tokens = chat_response.usage.total_tokens
print(f"本轮消耗token:{completion_tokens}")
print(f"当前提示词消耗token:{prompt_tokens}")
print(f"总共消耗token:{total_tokens}")
history[-1][-1]=content
if __name__ == '__main__':
with open('ai_prompt.txt', 'r', encoding='utf-8') as file:
ai_prompt = file.read()
with open('system_prompt.txt', 'r', encoding='utf-8') as file:
system_prompt = file.read()
# history = [["你在干嘛?😡", ai_prompt]]
history = []
run(history, system_prompt)
使用gradio作为前端快速部署vllm代码
import gradio as gr
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://39.105.36.188:9000/v1"
openai_api_base2 = "http://39.105.36.188:9001/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
client2 = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base2,
)
# 模型名称,可以修改
models_9b = [
'lora7','lora9','lora1', 'lora2',
'lora3','lora4', 'lora5',
'lora6', 'lora8',
]
models_34b = [
'lora4','lora9',
'lora1', 'lora2', 'lora3',
'lora5', 'lora6',
'lora7', 'lora8',
]
block_css = """.importantButton {
background: linear-gradient(45deg, #7e0570,#5d1c99, #6e00ff) !important;
border: none !important;
}
.importantButton:hover {
background: linear-gradient(45deg, #ff00e0,#8500ff, #6e00ff) !important;
border: none !important;
}"""
custom_css = """
#clearLabel34b {
font-family: 'Arial', sans-serif; /* 设置字体族 */
font-size: 10px; /* 设置字体大小 */
color: #333; /* 设置字体颜色 */
}
"""
default_theme_args = dict(
font=["Source Sans Pro", 'ui-sans-serif', 'system-ui', 'sans-serif'],
font_mono=['IBM Plex Mono', 'ui-monospace', 'Consolas', 'monospace'],
)
init_message = "欢迎使用 ChatGPT Gradio UI!"
def vllm_chat_run_9b(data):
user_input, messages, history, model, temperature = data['userInput'], data['messages'], data['history'], data['model'], data['temperature']
current_session = [user_input, None]
history.append(current_session)
for val in history:
if val[0]:
messages.append({"role": "user", "content": val[0]})
if val[1]:
messages.append({"role": "assistant", "content": val[1]})
# print(messages)
chat_response = client2.chat.completions.create(
model=model, # lora:1-9
messages=messages,
max_tokens=64,
top_p=0.7,
temperature=temperature
# temperature=1.2
# stream=True,
)
return chat_response
def vllm_chat_run_34b(data):
user_input, messages, history, model, temperature = data['userInput'], data['messages'], data['history'], data['model'], data['temperature']
current_session = [user_input, None]
history.append(current_session)
for val in history:
if val[0]:
messages.append({"role": "user", "content": val[0]})
if val[1]:
messages.append({"role": "assistant", "content": val[1]})
# print(messages)
chat_response = client.chat.completions.create(
model=model, # lora:1-9
messages=messages,
max_tokens=64,
top_p=0.7,
temperature=temperature
# temperature=1.2
# stream=True,
)
return chat_response
# # 使用gr.State来管理聊天历史
# chat_history9b = gr.State([])
# chat_history34b = gr.State([])
def respond_9b(query, model, temperature, system_prompt, ai_prompt_input_9b):
global chat_history9b
if not chat_history9b.value:
chat_history9b.value = [[None, ai_prompt_input_9b]]
if not system_prompt:
system_prompt = '请提示用户设置system_prompt'
data = {'userInput': query,
'history': chat_history9b.value,
'messages': [{'role': 'system', 'content': system_prompt}],
'model': model,
'temperature': temperature
}
chat_response = vllm_chat_run_9b(data=data)
content = chat_response.choices[0].message.content
chat_history9b.value[-1][-1]=content
return "", chat_history9b.value
def respond_34b(query, model, temperature, system_prompt, ai_prompt_input_34b):
global chat_history34b
if not chat_history34b.value:
chat_history34b.value = [[None, ai_prompt_input_34b]]
if not system_prompt:
system_prompt = '请提示用户设置system_prompt'
data = {'userInput': query,
'history': chat_history34b.value,
'messages': [{'role': 'system', 'content': system_prompt}],
'model': model,
'temperature': temperature
}
chat_response = vllm_chat_run_34b(data=data)
content = chat_response.choices[0].message.content
chat_history34b.value[-1][-1]=content
return "", chat_history34b.value
def clear9b():
global chat_history9b
print("进入clear")
chat_history9b.value = []
return "已清除对话历史"
def clear34b():
global chat_history34b
print("进入clear")
chat_history34b.value = []
return "已清除对话历史"
def create_state():
return gr.State([])
def setting_change(model, temperature, system_prompt):
return f"设置已更新:\n 模型名称:{model} \n 温度:{temperature} \n 系统提示:{system_prompt if system_prompt else '无'}"
with gr.Blocks(css=block_css, theme=gr.themes.Default(**default_theme_args)) as demo:
gr.Markdown('ChatGPT Gradio')
# 为每个Tab创建独立的聊天历史状态
chat_history9b = create_state()
chat_history34b = create_state()
with gr.Tab("Love-9B-Chat Models"):
with gr.Row():
with gr.Column(scale=10):
chatbot_9b = gr.Chatbot(label="聊天历史")
query_9b = gr.Textbox(label="输入问题", placeholder="请输入提问内容,按回车进行提交")
clear_button_9b = gr.Button("重新对话")
with gr.Column(scale=5):
model_9b = gr.Radio(models_9b, label="请选择9B-Lora模型", value=models_9b[0])
temperature_9b = gr.Slider(0, 2, value=1.15, step=0.05, label="Temperature")
system_prompt_input_9b = gr.Textbox(label="系统提示", placeholder="请输入系统提示(可选)")
ai_prompt_input_9b = gr.Textbox(label="模型开头第一句话", placeholder="根据system故事来编写(适当的开头可以增加性能)")
settings_button_9b = gr.Button("更新设置")
with gr.Column(scale=5,elem_id="clearLabel34b"):
clear_label_9b = gr.Textbox(label="信息提示")
settings_button_9b.click(fn=setting_change, inputs=[model_9b, temperature_9b, system_prompt_input_9b], outputs=[clear_label_9b])
clear_button_9b.click(fn=clear9b, inputs=[], outputs=[clear_label_9b])
query_9b.submit(respond_9b, [query_9b, model_9b, temperature_9b, system_prompt_input_9b, ai_prompt_input_9b], [query_9b, chatbot_9b])
with gr.Tab("Love-34B-Chat Models"):
with gr.Row():
with gr.Column(scale=10):
chatbot_34b = gr.Chatbot(label="聊天历史")
query_34b = gr.Textbox(label="输入问题", placeholder="请输入提问内容,按回车进行提交")
clear_button_34b = gr.Button("重新对话")
with gr.Column(scale=5):
model_34b = gr.Radio(models_34b, label="请选择34B-Lora模型", value=models_34b[0])
temperature_34b = gr.Slider(0, 2, value=1.15, step=0.05, label="Temperature")
system_prompt_input_34b = gr.Textbox(label="系统提示", placeholder="请输入系统提示(可选)")
ai_prompt_input_34b = gr.Textbox(label="模型开头第一句话", placeholder="根据system故事来编写(适当的开头可以增加性能)")
settings_button_34b = gr.Button("更新设置")
with gr.Column(scale=5,elem_id="clearLabel34b"):
clear_label_34b = gr.Textbox(label="信息提示")
settings_button_34b.click(fn=setting_change, inputs=[model_34b, temperature_34b, system_prompt_input_34b], outputs=[clear_label_34b])
clear_button_34b.click(fn=clear34b, inputs=[], outputs=[clear_label_34b])
query_34b.submit(respond_34b, [query_34b, model_34b, temperature_34b, system_prompt_input_34b, ai_prompt_input_34b], [query_34b, chatbot_34b])
user_info = [
("admin", "password"),
("guest", "password")
]
demo.launch(
server_name='0.0.0.0',
server_port=9090,
share=False,
debug=False,
auth=user_info,
auth_message='欢迎登录大模型演示平台!'
)
直接用vllm加载模型调用代码:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
"""
vllm 部署demo
"""
def run(prompt ,model_path, max_model_len, temperature,top_p, max_tokens):
# Create a sampling params object.
# stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(
temperature=temperature,
top_p=top_p,
max_tokens=max_tokens,
)
# tokenizer = None
# 加载分词器后传入vLLM 模型,但不是必要的。
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
# Create an LLM.
llm = LLM(
model=model_path,
tokenizer=tokenizer,
max_model_len=max_model_len,
trust_remote_code=True
)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
output = llm.generate(prompt, sampling_params)
# Print the outputs.
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == "__main__":
# Sample prompts.
prompt = "Hello, my name is"
model_path = '/path/model'
max_model_len=2048
temperature=0.8
top_p=0.95
max_tokens=512
run(prompt ,model_path, max_model_len, temperature,top_p, max_tokens)
获取模型参数:
from transformers import AutoModel
model = AutoModel.from_pretrained(r'G:\zsh\LLaMA-Factory\model\01-ai\Yi-1.5-9B-Chat')
# model = AutoModel.from_pretrained(r'G:\zsh\LLaMA-Factory\output\Yi-1.5-9B-Chat-Pro')
# 获取模型参数量
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Number of parameters: {num_params}")
output:
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 4/4 [00:22<00:00, 5.61s/it]
Number of parameters: 8567263232
统计数据集token长度脚本:
作用:整理数据集的平均每条的token长度,找出最长的数据token和最短的token
from transformers import AutoTokenizer
import json
# 初始化tokenizer
model_path = r'G:\zsh\LLaMA-Factory\model\01-ai\Yi-1.5-9B-Chat'
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 加载JSON文件
def load_json_file(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
data = json.load(file)
return data
# 加载ShareGPT格式数据集
file_path = "./all_chat.json"
data = load_json_file(file_path)
# 存储每个条目的token长度
token_lengths = []
for item in data:
conversations = item['conversations']
total_length = 0
for conversation in conversations:
total_length += len(tokenizer.encode(conversation['value'], add_special_tokens=False))
token_lengths.append(total_length)
# 计算最大、最小和平均长度
max_length = max(token_lengths)
min_length = min(token_lengths)
average_length = sum(token_lengths) / len(token_lengths)
max_index = token_lengths.index(max_length)
min_index = token_lengths.index(min_length)
# 输出结果
print(f"最大token长度: {max_length} 在第 {max_index+1} 条")
print(f"最小token长度: {min_length} 在第 {min_index+1} 条")
print(f"平均token长度: {average_length:.2f}")
output:
最大token长度: 2950 在第 2921 条
最小token长度: 6 在第 1010 条
平均token长度: 282.68
还在持续添加数据处理脚本和训练部署脚本代码,本文也当做我自己在工作中的记录
原文地址:https://blog.csdn.net/qq_45437316/article/details/143896351
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!