自学内容网 自学内容网

llm模型训练导出部署一条龙

llm模型训练导出部署一条龙

该框架功能,标注-微调-导出-合并-部署,一整条流程都有,而且训练时消耗的gpu算力也会小一些

一,安装(推荐在linux中训练,win可以用wsl+docker)

conda create -n llamafactory python=3.10 -y
conda activate llamafactory

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory

# 根据cuda版本选择安装pytoch版本

conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia

# 提前把gpu版本的torch安装好
pip install -e .[torch,metrics]

torch下载太慢,换代理或者用win的迅雷下载好了之后传到服务器上
# 遇到包冲突时,使用 pip install --no-deps -e . 解决
# 测试torch是否可用gpu
命令行输入python

import torch
print(torch.cuda.is_available())  #返回True则说明torch可用gpu

二,训练
1,数据集的准备和配置
参考:https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README_zh.md

# 我使用的是 角色对话 的数据集格式

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "人类指令"
      },
      {
        "from": "gpt",
        "value": "模型回答"
      }
    ],
    "system": "系统提示词(选填)",
  }
]

需要同步修改 dataset_info.json 中的配置(开始训练时会根据这个文件去找定义好的存放数据的json文件)

 "yi_6b_chat": {
    "file_name": "yi_6b_chat_520_24000.json",
    "formatting": "sharegpt", # 表示数据使用的格式
    "tags": { # 和数据集中的格式一一对应 
      "role_tag": "from",
      "content_tag": "value",
      "user_tag": "human",
      "assistant_tag": "gpt"
    }
  },

2,训练,启动web ui界面

CUDA_VISIBLE_DEVICES=0 GRADIO_SHARE=1 llamafactory-cli webui

多卡lora训练:(框架只支持命令行多卡训练)

NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path /llm_train/model/Yi-1.5-34B-Chat-16K-GPTQ-Int4 \
    --finetuning_type lora \
    --quantization_bit 4 \
    --template default \
    --flash_attn auto \
    --dataset_dir data \
    --dataset yi_6b_chat \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 30.0 \
    --max_samples 1000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --packing False \
    --report_to none \
    --output_dir saves/Custom/lora/train_2024-07-09-09-11-33 \
    --fp16 True \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --lora_target q_proj,v_proj \
    --plot_loss True

导出(只支持非量化模型的导出,量化模型可以用同时加载模型权重和lora权重的方法,后文会有部署方法)

llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

评估

llamafactory-cli eval examples/train_lora/llama3_lora_eval.yaml

tansformers部署


vllm部署

1,安装vllm
python : 3.10.14
cuda : 12.1

vllm默认安装的配套cuda12.1版本的,用北外镜像下载更快
pip installl vllm -i https://mirrors.bfsu.edu.cn/pypi/web/simple/

cuda 11.8 版本安装
pip install https://github.com/vllm-project/vllm/releases/download/v0.6.0/vllm-0.6.0+cu118-cp310-cp310-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118



部署
CUDA_VISIBLE_DEVICES=1 \
python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 10000 \
    --gpu-memory-utilization 0.9 \ 
    --max-model-len 4096 \
    --served-model-name Qwen2.5-32B-Instruct-GPTQ-Int4 \
    --model /home/zsh/vllm/model/Qwen2.5-32B-Instruct-GPTQ-Int4 \
    --enable-lora \
    --lora-modules my_lora=/home/zsh/vllm/lora_model/qwen2.5_32b_train_001/checkpoint-600 \ # 可同时加载多个lora模型,在训练的时候可以配置,训练多少个epoch保存一次lora权重,测试时可同时加载只需切换名字即可测试多个lora的效果,择优选择。
    --max_num_seqs 32


--gpu-memory-utilization 0.9 # vllm预占用显存的最大显存(4090为例,即使你的模型参数只会占用12g显存,该参数设置为0.9,vllm也会预分配显卡显存的百分之90的显存,应该是vllm的加速机制)
--max_num_seqs 32 # 按需调整,模型的最大并发量,正常情况是并发量越少,处理速度越快,实时语音项目,我调整为10-30之间(qwen2.5-32b-int4 用默认的max_num_seqs 刚好可以跑起来,如果添加lora权重,则跑不动,此时调整max_num_seqs可以减少部分显存消耗,不合并lora权重会有一些性能损失,大部分情况下都可以忽略)

--lora-modules 该配置启用lora权重
--lora-modules my_lora=lora_model_path 制定lora权重的路径,包含adapter_model.safetensors该文件的目录就行了,调用时访问my_lora就是lora训练后的权重,访问--served-model-name Qwen2.5-32B-Instruct-GPTQ-Int4就是原始model 权重



vllm调用–单轮对话(客户端代码)

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:9000/v1" # vllm启动的地址(该方法是使用openai调用的方法,vllm官方也提供了不用openai的调用方法)


chat_response = client.chat.completions.create(
model="lora_600",
messages=messages,
max_tokens=128, # 模型单词的最大输出token
top_p=0.95, # 越小模型输出越“稳定”,这个稳定有时候不一定是我们想要的东西,比如训练的角色扮演model,就需要适当的增加top_p 和 temperature一些配置的值,使其回复的更加具有多样性,这样模型的回复才不会让人觉得呆板  
temperature=0.9, # 温度系数,越大模型的回复越有多样性,如果太大的话,模型就可能会出现乱回复和复读机的问题,这也和模型的训练效果有关,训练之后需调试参数,正常情况是1以下就行
frequency_penalty=0.7, # 效果同上,只是在处理模型的回复多样性和重复度的原理不一样,实际的作用都是为了让模型回复更符合自己预期的数据
presence_penalty=0.7, # 效果同上
)



assistant_content = chat_response.choices[0].message.content # 模型的回复
completion_tokens = chat_response.usage.completion_tokens # 本次消耗的token
prompt_tokens = chat_response.usage.prompt_tokens # 本次提示词消耗的token
total_tokens = chat_response.usage.total_tokens # 总共使用了多少token (4k的model,那么在total_tokens>4k的时候模型就会出现截断的情况,通过查看total_tokens尽量避免模型出现截断的情况)


vllm调用–单人多轮对话(客户端代码)可根据自己的需求修改部分代码,配置和单人同理,只是增加了history用来存放对话历史记录。
system:系统提示词
user:用户的输入
assistant:模型的回复

from openai import OpenAI
import time

openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:9000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)


def vllm_chat_run(data, index):
    user_input, messages, history = data['userInput'], data['messages'], data['history']
    current_session = [user_input, None]
    history.append(current_session)

    for val in history:
        if val[0]:
            messages.append({"role": "user", "content": val[0]})
        if val[1]:
            messages.append({"role": "assistant", "content": val[1]})
    
    # print(messages)

    if index<=20:
        chat_response = client.chat.completions.create(
            model="lora_600",
            messages=messages,
            max_tokens=128,
            top_p=0.95,
            temperature=0.9,
            frequency_penalty=0.7,
            presence_penalty=0.7,
        )
    else:
        chat_response = client.chat.completions.create(
            model="lora_600",
            messages=messages,
            max_tokens=128,
            top_p=0.95,
            temperature=0.9,
        )

    return chat_response


def run(history, system_prompt):
    """主程序 start"""

    index = 0
    while True:

        index+=1
        print(f"---- 对话轮数 {index} ----")
        userInput = input("《用户》:")

        data = {'userInput': userInput, 
            'history': history,
            'messages': [{'role': 'system', 'content': system_prompt}],
        }

        start_time = time.time()
        chat_response = vllm_chat_run(data, index)
        end_time = time.time()
        print(f"消耗时间:{end_time-start_time}")
        content = chat_response.choices[0].message.content
        print("《模型》:",end='')
        print(content)

        completion_tokens = chat_response.usage.completion_tokens
        prompt_tokens = chat_response.usage.prompt_tokens
        total_tokens = chat_response.usage.total_tokens

        print(f"本轮消耗token:{completion_tokens}")
        print(f"当前提示词消耗token:{prompt_tokens}")
        print(f"总共消耗token:{total_tokens}")

        history[-1][-1]=content
        
if __name__ == '__main__':
    with open('ai_prompt.txt', 'r', encoding='utf-8') as file:
        ai_prompt = file.read()  
    with open('system_prompt.txt', 'r', encoding='utf-8') as file:
        system_prompt = file.read()

    # history = [["你在干嘛?😡", ai_prompt]]
    history = []
    run(history, system_prompt)

使用gradio作为前端快速部署vllm代码

import gradio as gr

from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://39.105.36.188:9000/v1"
openai_api_base2 = "http://39.105.36.188:9001/v1"


client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

client2 = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base2,
)

# 模型名称,可以修改
models_9b = [
    'lora7','lora9','lora1', 'lora2', 
    'lora3','lora4', 'lora5', 
    'lora6', 'lora8',
]
models_34b = [
    'lora4','lora9',
    'lora1', 'lora2', 'lora3',
    'lora5', 'lora6',
    'lora7', 'lora8',
]



block_css = """.importantButton {
    background: linear-gradient(45deg, #7e0570,#5d1c99, #6e00ff) !important;
    border: none !important;
}
.importantButton:hover {
    background: linear-gradient(45deg, #ff00e0,#8500ff, #6e00ff) !important;
    border: none !important;
}"""


custom_css = """
    #clearLabel34b {
        font-family: 'Arial', sans-serif; /* 设置字体族 */
        font-size: 10px; /* 设置字体大小 */
        color: #333; /* 设置字体颜色 */
    }
    """

default_theme_args = dict(
    font=["Source Sans Pro", 'ui-sans-serif', 'system-ui', 'sans-serif'],
    font_mono=['IBM Plex Mono', 'ui-monospace', 'Consolas', 'monospace'],
)

init_message = "欢迎使用 ChatGPT Gradio UI!"


def vllm_chat_run_9b(data):
    user_input, messages, history, model, temperature = data['userInput'], data['messages'], data['history'], data['model'], data['temperature']
    current_session = [user_input, None]
    history.append(current_session)

    for val in history:
        if val[0]:
            messages.append({"role": "user", "content": val[0]})
        if val[1]:
            messages.append({"role": "assistant", "content": val[1]})
    
    # print(messages)
    chat_response = client2.chat.completions.create(
        model=model, # lora:1-9
        messages=messages,
        max_tokens=64,
        top_p=0.7,
        temperature=temperature
        # temperature=1.2
        # stream=True,
    )
    return chat_response



def vllm_chat_run_34b(data):
    user_input, messages, history, model, temperature = data['userInput'], data['messages'], data['history'], data['model'], data['temperature']
    current_session = [user_input, None]
    history.append(current_session)

    for val in history:
        if val[0]:
            messages.append({"role": "user", "content": val[0]})
        if val[1]:
            messages.append({"role": "assistant", "content": val[1]})
    
    # print(messages)
    chat_response = client.chat.completions.create(
        model=model, # lora:1-9
        messages=messages,
        max_tokens=64,
        top_p=0.7,
        temperature=temperature
        # temperature=1.2
        # stream=True,
    )
    return chat_response



# # 使用gr.State来管理聊天历史
# chat_history9b = gr.State([])
# chat_history34b = gr.State([])


def respond_9b(query, model, temperature, system_prompt, ai_prompt_input_9b):
    global chat_history9b
    if not chat_history9b.value:
        chat_history9b.value = [[None, ai_prompt_input_9b]]

    if not system_prompt:
        system_prompt = '请提示用户设置system_prompt'

    data = {'userInput': query, 
            'history': chat_history9b.value,
            'messages': [{'role': 'system', 'content': system_prompt}],
            'model': model,
            'temperature': temperature
    }

    chat_response = vllm_chat_run_9b(data=data)

    content = chat_response.choices[0].message.content
    chat_history9b.value[-1][-1]=content
    return "", chat_history9b.value


def respond_34b(query, model, temperature, system_prompt, ai_prompt_input_34b):
    global chat_history34b
    if not chat_history34b.value:
        chat_history34b.value = [[None, ai_prompt_input_34b]]

    if not system_prompt:
        system_prompt = '请提示用户设置system_prompt'

    data = {'userInput': query, 
            'history': chat_history34b.value,
            'messages': [{'role': 'system', 'content': system_prompt}],
            'model': model,
            'temperature': temperature
    }

    chat_response = vllm_chat_run_34b(data=data)

    content = chat_response.choices[0].message.content
    chat_history34b.value[-1][-1]=content
    return "", chat_history34b.value


def clear9b():
    global chat_history9b
    print("进入clear")
    chat_history9b.value = []
    return "已清除对话历史"

def clear34b():
    global chat_history34b
    print("进入clear")
    chat_history34b.value = []
    return "已清除对话历史"

def create_state():
    return gr.State([])


def setting_change(model, temperature, system_prompt):
    return f"设置已更新:\n 模型名称:{model} \n 温度:{temperature} \n 系统提示:{system_prompt if system_prompt else '无'}"




with gr.Blocks(css=block_css, theme=gr.themes.Default(**default_theme_args)) as demo:
    gr.Markdown('ChatGPT Gradio')

     # 为每个Tab创建独立的聊天历史状态
    chat_history9b = create_state()
    chat_history34b = create_state()

    with gr.Tab("Love-9B-Chat Models"):
        with gr.Row():
            with gr.Column(scale=10):
                chatbot_9b = gr.Chatbot(label="聊天历史")
                query_9b = gr.Textbox(label="输入问题", placeholder="请输入提问内容,按回车进行提交")
                clear_button_9b = gr.Button("重新对话")
            
            with gr.Column(scale=5):
                model_9b = gr.Radio(models_9b, label="请选择9B-Lora模型", value=models_9b[0])
                temperature_9b = gr.Slider(0, 2, value=1.15, step=0.05, label="Temperature")
                system_prompt_input_9b = gr.Textbox(label="系统提示", placeholder="请输入系统提示(可选)")
                ai_prompt_input_9b = gr.Textbox(label="模型开头第一句话", placeholder="根据system故事来编写(适当的开头可以增加性能)")
                settings_button_9b = gr.Button("更新设置")
            
            with gr.Column(scale=5,elem_id="clearLabel34b"):
                clear_label_9b = gr.Textbox(label="信息提示")
                settings_button_9b.click(fn=setting_change, inputs=[model_9b, temperature_9b, system_prompt_input_9b], outputs=[clear_label_9b])
                clear_button_9b.click(fn=clear9b, inputs=[], outputs=[clear_label_9b])

        
        query_9b.submit(respond_9b, [query_9b, model_9b, temperature_9b, system_prompt_input_9b, ai_prompt_input_9b], [query_9b, chatbot_9b])
        
    
    with gr.Tab("Love-34B-Chat Models"):
        with gr.Row():
            with gr.Column(scale=10):
                chatbot_34b = gr.Chatbot(label="聊天历史")
                query_34b = gr.Textbox(label="输入问题", placeholder="请输入提问内容,按回车进行提交")
                clear_button_34b = gr.Button("重新对话")
            
            with gr.Column(scale=5):
                model_34b = gr.Radio(models_34b, label="请选择34B-Lora模型", value=models_34b[0])
                temperature_34b = gr.Slider(0, 2, value=1.15, step=0.05, label="Temperature")
                system_prompt_input_34b = gr.Textbox(label="系统提示", placeholder="请输入系统提示(可选)")
                ai_prompt_input_34b = gr.Textbox(label="模型开头第一句话", placeholder="根据system故事来编写(适当的开头可以增加性能)")
                settings_button_34b = gr.Button("更新设置")

            with gr.Column(scale=5,elem_id="clearLabel34b"):
                clear_label_34b = gr.Textbox(label="信息提示")
                settings_button_34b.click(fn=setting_change, inputs=[model_34b, temperature_34b, system_prompt_input_34b], outputs=[clear_label_34b])
                clear_button_34b.click(fn=clear34b, inputs=[], outputs=[clear_label_34b])

        query_34b.submit(respond_34b, [query_34b, model_34b, temperature_34b, system_prompt_input_34b, ai_prompt_input_34b], [query_34b, chatbot_34b])
        
        
user_info = [
    ("admin", "password"),
    ("guest", "password")
]

demo.launch(
        server_name='0.0.0.0',
        server_port=9090,
        share=False,
        debug=False,
        auth=user_info,
        auth_message='欢迎登录大模型演示平台!'
)

直接用vllm加载模型调用代码:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer


"""
vllm 部署demo
"""

def run(prompt ,model_path, max_model_len, temperature,top_p, max_tokens):
    # Create a sampling params object.
    # stop_token_ids = [151329, 151336, 151338]
    sampling_params = SamplingParams(
                temperature=temperature, 
                top_p=top_p,
                max_tokens=max_tokens,

            )

    # tokenizer = None
    # 加载分词器后传入vLLM 模型,但不是必要的。
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) 

    # Create an LLM.
    llm = LLM(
            model=model_path, 
            tokenizer=tokenizer, 
            max_model_len=max_model_len,
            trust_remote_code=True
        )

    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    output = llm.generate(prompt, sampling_params)
    # Print the outputs.
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == "__main__":
    # Sample prompts.
    prompt = "Hello, my name is"

    model_path = '/path/model'
    max_model_len=2048
    temperature=0.8
    top_p=0.95
    max_tokens=512

    run(prompt ,model_path, max_model_len, temperature,top_p, max_tokens)

获取模型参数:

from transformers import AutoModel

model = AutoModel.from_pretrained(r'G:\zsh\LLaMA-Factory\model\01-ai\Yi-1.5-9B-Chat')
# model = AutoModel.from_pretrained(r'G:\zsh\LLaMA-Factory\output\Yi-1.5-9B-Chat-Pro')

# 获取模型参数量
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Number of parameters: {num_params}")

output:
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 4/4 [00:22<00:00, 5.61s/it]
Number of parameters: 8567263232

统计数据集token长度脚本:
作用:整理数据集的平均每条的token长度,找出最长的数据token和最短的token

from transformers import AutoTokenizer
import json

# 初始化tokenizer
model_path = r'G:\zsh\LLaMA-Factory\model\01-ai\Yi-1.5-9B-Chat'
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 加载JSON文件
def load_json_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    return data


# 加载ShareGPT格式数据集
file_path = "./all_chat.json"
data = load_json_file(file_path)

# 存储每个条目的token长度
token_lengths = []

for item in data:
    conversations = item['conversations']
    total_length = 0
    for conversation in conversations:
        total_length += len(tokenizer.encode(conversation['value'], add_special_tokens=False))
    token_lengths.append(total_length)

# 计算最大、最小和平均长度
max_length = max(token_lengths)
min_length = min(token_lengths)
average_length = sum(token_lengths) / len(token_lengths)

max_index = token_lengths.index(max_length)
min_index = token_lengths.index(min_length)

# 输出结果
print(f"最大token长度: {max_length} 在第 {max_index+1} 条")
print(f"最小token长度: {min_length} 在第 {min_index+1} 条")
print(f"平均token长度: {average_length:.2f}")


output:
最大token长度: 2950 在第 2921 条
最小token长度: 6 在第 1010 条
平均token长度: 282.68

还在持续添加数据处理脚本和训练部署脚本代码,本文也当做我自己在工作中的记录


原文地址:https://blog.csdn.net/qq_45437316/article/details/143896351

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!