自学内容网 自学内容网

Mistral AI推超强边缘AI模型Ministral 8B,支持128000个token

最近,法国人工智能初创公司 Mistral AI 宣布了他们的新一代语言模型 ——Ministral3B 和 Ministral8B。

这两款新模型是 “Ministraux” 系列的一部分,专为边缘设备和边缘计算场景而设计,支持高达128,000个 token 的上下文长度。这意味着这些模型不仅处理能力强大,而且可以在数据隐私和本地处理尤为重要的情况下使用。

在这里插入图片描述
Mistral 表示,Ministraux 系列模型非常适合于一系列应用,例如本地翻译、离线智能助手、数据分析以及自主机器人技术。为了进一步提升效率,Ministraux 模型还可以与更大的语言模型(比如 Mistral Large)结合使用,作为多步骤工作流中的有效中介。

在性能上,Mistral 提供的基准测试显示,Ministral3B 和8B 在多个类别中都超过了许多同类模型,比如谷歌的 Gemma 2 2B 和 Meta 的 Llama3.1 8B。值得一提的是,尽管 Ministral3B 的参数数量较少,但在某些测试中,它的表现超越了其前身 Mistral 7B。

实际上,Mistral 8B 在所有测试中都表现优异,尤其是在知识、常识、功能调用和多语言能力等方面。

关于定价,Ministral AI 的这两款新模型已经可以通过 API 获取。Ministral 8B 的费用为每百万个 token0.10美元,而 Ministral 3B 则是0.04美元。此外,Mistral 还为研究用途提供了 Ministral 8B Instruct 的模型权重。值得注意的是,Mistral 的这两款新模型很快也会通过谷歌 Vertex 和 AWS 等云合作伙伴上线。

mistralai/Ministral-8B-Instruct-2410

我们为本地智能、设备计算和边缘用例引入了两种新的先进模型。 我们称之为 Ministraux: Ministral 3B 和 Ministral 8B。 Ministral-8B-Instruct-2410 语言模型是根据 Mistral Research License 发布的微调模型,其性能明显优于现有的同类模型。 如果您有兴趣在商业上使用 Ministral-3B 或 Ministral-8B,并使其性能优于 Mistral-7B,请联系我们。 有关 Ministraux 的更多详情,请参阅我们的发布博文。

Ministral 8B 主要特点

  • 根据 Mistral 研究许可证发布,如需商业许可证,请联系我们
  • 使用 128k 上下文窗口和交错滑动窗口注意力进行训练
  • 在大量多语言和代码数据上进行训练
  • 支持函数调用 词汇量为 131k,使用 V3-Tekken 标记化器

基本指令模板(V3-Tekken)

<s>[INST]user message[/INST]assistant response</s>[INST]new user message[/INST]

Ministral 8B 架构

FeatureValue
ArchitectureDense Transformer
Parameters8,019,808,256
Layers36
Heads32
Dim4096
KV Heads (GQA)8
Hidden Dim12288
Head Dim128
Vocab Size131,072
Context Length128k
Attention PatternRagged (128k,32k,32k,32k)

基准

Base Model

知识与常识

ModelMMLUAGIEvalWinograndeArc-cTriviaQA
Mistral 7B Base62.542.574.267.962.5
Llama 3.1 8B Base64.744.474.646.060.2
Ministral 8B Base65.048.375.371.965.5
Gemma 2 2B Base52.433.868.742.647.8
Llama 3.2 3B Base56.237.459.643.150.7
Ministral 3B Base60.942.172.764.256.7

代码与数学

ModelHumanEval pass@1GSM8K maj@8
Mistral 7B Base26.832.0
Llama 3.1 8B Base37.842.2
Ministral 8B Base34.864.5
Gemma 2 2B20.135.5
Llama 3.2 3B14.633.5
Ministral 3B34.250.9

多种语言

ModelFrench MMLUGerman MMLUSpanish MMLU
Mistral 7B Base50.649.651.4
Llama 3.1 8B Base50.852.854.6
Ministral 8B Base57.557.459.6
Gemma 2 2B Base41.040.141.7
Llama 3.2 3B Base42.342.243.1
Ministral 3B Base49.148.349.5

Instruct Models

ModelMTBenchArena HardWild bench
Mistral 7B Instruct v0.36.744.333.1
Llama 3.1 8B Instruct7.562.437.0
Gemma 2 9B Instruct7.668.743.8
Ministral 8B Instruct8.370.941.3
Gemma 2 2B Instruct7.551.732.5
Llama 3.2 3B Instruct7.246.027.2
Ministral 3B Instruct8.164.336.3

代码与数学

ModelMBPP pass@1HumanEval pass@1Math maj@1
Mistral 7B Instruct v0.350.238.413.2
Gemma 2 9B Instruct68.567.747.4
Llama 3.1 8B Instruct69.767.149.3
Ministral 8B Instruct70.076.854.5
Gemma 2 2B Instruct54.542.722.8
Llama 3.2 3B Instruct64.661.038.4
Ministral 3B Instruct67.777.451.7

Demo

vLLM

pip install --upgrade vllm
pip install --upgrade mistral_common
from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "mistralai/Ministral-8B-Instruct-2410"

sampling_params = SamplingParams(max_tokens=8192)

# note that running Ministral 8B on a single GPU requires 24 GB of GPU RAM
# If you want to divide the GPU requirement over multiple devices, please add *e.g.* `tensor_parallel=2`
llm = LLM(model=model_name, tokenizer_mode="mistral", config_format="mistral", load_format="mistral")

prompt = "Do we need to think for 10 seconds to find the answer of 1 + 1?"

messages = [
    {
        "role": "user",
        "content": prompt
    },
]

outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
# You don't need to think for 10 seconds to find the answer to 1 + 1. The answer is 2,
# and you can easily add these two numbers in your mind very quickly without any delay.

Server

vllm serve mistralai/Ministral-8B-Instruct-2410 --tokenizer_mode mistral --config_format mistral --load_format mistral

:在单 GPU 上运行 Ministral-8B 需要 24 GB GPU 内存。

如果要将 GPU 需求分配给多个设备,请添加 --tensor_parallel=2 等信息

Client

curl --location 'http://<your-node-url>:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
    "model": "mistralai/Ministral-8B-Instruct-2410",
    "messages": [
      {
        "role": "user",
        "content": "Do we need to think for 10 seconds to find the answer of 1 + 1?"
      }
    ]
}'

Mistral-inference

pip install mistral_inference --upgrade

下载

from huggingface_hub import snapshot_download
from pathlib import Path

mistral_models_path = Path.home().joinpath('mistral_models', '8B-Instruct')
mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistralai/Ministral-8B-Instruct-2410", allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], local_dir=mistral_models_path)

Chat

mistral-chat $HOME/mistral_models/8B-Instruct --instruct --max_tokens 256

密码检测

在本示例中,密钥信息有超过 10 万个标记,而 mistral-inference 没有分块预填充机制。 因此,运行下面的示例需要大量 GPU 内存(80 GB)。 如果想获得更节省内存的解决方案,我们建议使用 vLLM。

from mistral_inference.transformer import Transformer
from pathlib import Path
import json
from mistral_inference.generate import generate
from huggingface_hub import hf_hub_download

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

def load_passkey_request() -> ChatCompletionRequest:
    passkey_file = hf_hub_download(repo_id="mistralai/Ministral-8B-Instruct-2410", filename="passkey_example.json")

    with open(passkey_file, "r") as f:
        data = json.load(f)

    message_content = data["messages"][0]["content"]
    return ChatCompletionRequest(messages=[UserMessage(content=message_content)])

tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path, softmax_fp32=False)

completion_request = load_passkey_request()

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)  # The pass key is 13005.

指导如下

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest


tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path)

completion_request = ChatCompletionRequest(messages=[UserMessage(content="How often does the letter r occur in Mistral?")])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)

Function calling

from mistral_common.protocol.instruct.tool_calls import Function, Tool
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.tekken import SpecialTokenPolicy


tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
tekken = tokenizer.instruct_tokenizer.tokenizer
tekken.special_token_policy = SpecialTokenPolicy.IGNORE

model = Transformer.from_folder(mistral_models_path)

completion_request = ChatCompletionRequest(
    tools=[
        Tool(
            function=Function(
                name="get_current_weather",
                description="Get the current weather",
                parameters={
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "format": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The temperature unit to use. Infer this from the users location.",
                        },
                    },
                    "required": ["location", "format"],
                },
            )
        )
    ],
    messages=[
        UserMessage(content="What's the weather like today in Paris?"),
        ],
)

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)


原文地址:https://blog.csdn.net/weixin_41446370/article/details/143016703

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!