Transformers库的模板困境：apply_chat_template的版本变迁与解决方案

🕗 发布于 2025-01-16 03:09 windows linux 数据库

问题现状

在使用Transformers库中的tokenizer处理模型输入时，我们经常需要将输入文本格式化为模型可以理解的格式。这个过程在不同版本的Transformers库中有着显著的差异。

低版本（4.43及以下）的简便方式

在Transformers 4.43及更低版本中，当我们需要加载类似Command-R+等聊天模型时，可以直接使用以下代码：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model_name")
chat = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"}
]
formatted_chat = tokenizer.apply_chat_template(chat, tokenize=False)

这种方式简单直接，tokenizer会自动使用预设的模板来格式化对话。

高版本的报错问题

然而，当我们升级到更高版本的Transformers库后，相同的代码会遇到如下错误：

ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating

这个错误明确指出，在新版本中tokenizer不再包含默认的聊天模板，需要我们显式指定模板或设置tokenizer.chat_template。

原因分析

问题的根源在于Transformers库源码中对chat template处理逻辑的变更。让我们深入分析源码来理解这个变化：

旧版本的实现逻辑

在旧版本的transformers/tokenization_utils_base.py文件中，apply_chat_template函数在此获取模板

chat_template = self.get_chat_template(chat_template, tools)

具体的逻辑如下：

def get_chat_template(self, chat_template: Optional[str] = None, tools: Optional[List[Dict]] = None) -> str:
    if isinstance(self.chat_template, dict) or (
        self.chat_template is None and isinstance(self.default_chat_template, dict)
    ):
        if self.chat_template is not None:
            template_dict = self.chat_template
            using_default_dict = False
        else:
            template_dict = self.default_chat_template
            using_default_dict = True
            
        # 根据不同情况选择模板
        if chat_template is not None and chat_template in template_dict:
            chat_template = template_dict[chat_template]
            if using_default_dict:
                using_default_template = True
        elif chat_template is None:
            if tools is not None and "tool_use" in template_dict:
                chat_template = template_dict["tool_use"]
            elif "default" in template_dict:
                chat_template = template_dict["default"]

这段代码的关键在于：

判断条件包含了两种情况：
- self.chat_template是字典类型
- 或者self.chat_template为空但self.default_chat_template是字典类型
当self.chat_template为空时，会回退使用self.default_chat_template
在早期版本中，模型文件（如command-r+模型中的tokenization_cohere_fast.py）会预先定义default_chat_template，确保即使没有指定模板也能正常工作

新版本的变化

新版本中，代码被简化为：

if isinstance(self.chat_template, dict):
template_dict = self.chat_template
if chat_template is not None and chat_template in template_dict:
# The user can pass the name of a template to the chat template argument instead of an entire template
chat_template = template_dict[chat_template]
elif chat_template is None:
if tools is not None and "tool_use" in template_dict:
chat_template = template_dict["tool_use"]
elif "default" in template_dict:
chat_template = template_dict["default"]
else:
raise ValueError(
"This model has multiple chat templates with no default specified! Please either pass a chat "
"template or the name of the template you wish to use to the `chat_template` argument. Available "
f"template names are {sorted(template_dict.keys())}."
)

这个改动带来了几个重要的影响：

移除了对self.default_chat_template的支持
只检查self.chat_template是否为字典类型
导致早期模型文件中定义的默认模板不再生效
当self.chat_template为空时，直接抛出ValueError异常

这就是为什么在新版本中，即使模型文件中定义了default_chat_template，我们依然会遇到"Cannot use chat template functions because tokenizer.chat_template is not set"的错误。这个改动似乎是为了简化代码结构，但同时也破坏了向后兼容性。

解决办法

手动找到旧版模型的chat模板代码，例如找到command-r+模型的tokenization_cohere_fast.py文件，其中CohereTokenizerFast类中的default_chat_template函数代码为：

    @property
    def default_chat_template(self):
        """
        Cohere Tokenizer uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> to indicate each turn in a chat.
        Additioanlly, to indicate the source of the message, <|USER_TOKEN|>, <|CHATBOT_TOKEN|> and <|SYSTEM_TOKEN|>
        for user, assitant and system messages respectively.

        The output should look something like:
        <|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ preamble }}<|END_OF_TURN_TOKEN|><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{{ How are you? }}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{{ I am doing well! }}<|END_OF_TURN_TOKEN|>

        Use add_generation_prompt to add a prompt for the model to generate a response:
        >>> from transformers import AutoTokenizer
        >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
        >>> messages = [{"role": "user", "content": "Hello, how are you?"}]
        >>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'

        """
        default_template = (
            "{{ bos_token }}"
            "{% if messages[0]['role'] == 'system' %}"
            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
            "{% set system_message = messages[0]['content'] %}"
            "{% elif USE_DEFAULT_PROMPT == true %}"
            "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
            "{% else %}"
            "{% set loop_messages = messages %}"
            "{% set system_message = false %}"
            "{% endif %}"
            "{% if system_message != false %}"  # Start with system message
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% for message in loop_messages %}"  # Loop over all non-system messages
            "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
            "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
            "{% endif %}"
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_template = default_template.replace(
            "USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false"
        )
        default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
        default_template = default_template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)

        tool_use_template = (
            "{{ bos_token }}"
            "{% if messages[0]['role'] == 'system' %}"
            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
            "{% set system_message = messages[0]['content'] %}"
            "{% else %}"
            "{% set loop_messages = messages %}"
            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
            "{% endif %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ '# Safety Preamble' }}"
            "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
            "{{ '\n\n# System Preamble' }}"
            "{{ '\n## Basic Rules' }}"
            "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
            "{{ '\n\n# User Preamble' }}"
            "{{ '\n' + system_message }}"
            "{{'\n\n## Available Tools\nHere is a list of tools that you have available to you:\n\n'}}"
            "{% for tool in tools %}"
            "{% if loop.index0 != 0 %}"
            "{{ '\n\n'}}"
            "{% endif %}"
            "{{'```python\ndef ' + tool.name + '('}}"
            "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
            "{% if loop.index0 != 0 %}"
            "{{ ', '}}"
            "{% endif %}"
            "{{param_name}}: "
            "{% if not param_fields.required %}"
            "{{'Optional[' + param_fields.type + '] = None'}}"
            "{% else %}"
            "{{ param_fields.type }}"
            "{% endif %}"
            "{% endfor %}"
            '{{ \') -> List[Dict]:\n    """\'}}'
            "{{ tool.description }}"
            "{% if tool.parameter_definitions|length != 0 %}"
            "{{ '\n\n    Args:\n        '}}"
            "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
            "{% if loop.index0 != 0 %}"
            "{{ '\n        ' }}"
            "{% endif %}"
            "{{ param_name + ' ('}}"
            "{% if not param_fields.required %}"
            "{{'Optional[' + param_fields.type + ']'}}"
            "{% else %}"
            "{{ param_fields.type }}"
            "{% endif %}"
            "{{ '): ' + param_fields.description }}"
            "{% endfor %}"
            "{% endif %}"
            '{{ \'\n    """\n    pass\n```\' }}'
            "{% endfor %}"
            "{{ '<|END_OF_TURN_TOKEN|>'}}"
            "{% for message in loop_messages %}"
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'system' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{{'<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write \\'Action:\\' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user\\'s last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:\n```json\n[\n    {\n        \"tool_name\": title of the tool in the specification,\n        \"parameters\": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters\n    }\n]```<|END_OF_TURN_TOKEN|>'}}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_tool_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
        tool_use_template = tool_use_template.replace("DEFAULT_SYSTEM_MESSAGE", default_tool_message)

        rag_template = (
            "{{ bos_token }}"
            "{% if messages[0]['role'] == 'system' %}"
            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
            "{% set system_message = messages[0]['content'] %}"
            "{% else %}"
            "{% set loop_messages = messages %}"
            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
            "{% endif %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ '# Safety Preamble' }}"
            "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
            "{{ '\n\n# System Preamble' }}"
            "{{ '\n## Basic Rules' }}"
            "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
            "{{ '\n\n# User Preamble' }}"
            "{{ '\n' + system_message }}"
            "{{ '<|END_OF_TURN_TOKEN|>'}}"
            "{% for message in loop_messages %}"  # Loop over all non-system messages
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'system' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}"
            "{{ '<results>' }}"
            "{% for document in documents %}"  # Loop over all non-system messages
            "{{ '\nDocument: ' }}"
            "{{ loop.index0 }}\n"
            "{% for key, value in document.items() %}"
            "{{ key }}: {{value}}\n"
            "{% endfor %}"
            "{% endfor %}"
            "{{ '</results>'}}"
            "{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}"
            "{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}"
            "{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}"
            "{% if citation_mode=='accurate' %}"
            "{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}"
            "{% endif %}"
            "{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}"
            "{{ '<|END_OF_TURN_TOKEN|>' }}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_rag_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
        rag_template = rag_template.replace("DEFAULT_SYSTEM_MESSAGE", default_rag_message)

        return {"default": default_template, "tool_use": tool_use_template, "rag": rag_template}

其中使用了一些全局变量，将其复制出来做成一个函数，例如整理成如下形式：

PRETRAINED_VOCAB_FILES_MAP = {
    "tokenizer_file": {
        "Cohere/Command-nightly": "https://huggingface.co/Cohere/Command-nightly/blob/main/tokenizer.json",
    },
}

# fmt: off
DEFAULT_SYSTEM_PROMPT = "You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere."
DEFAULT_RAG_PREAMBLE = """## Task and Context
You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

## Style Guide
Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling."""
# fmt: on

def default_chat_template(self):
    """
    Cohere Tokenizer uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> to indicate each turn in a chat.
    Additioanlly, to indicate the source of the message, <|USER_TOKEN|>, <|CHATBOT_TOKEN|> and <|SYSTEM_TOKEN|>
    for user, assitant and system messages respectively.

    The output should look something like:
    <|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ preamble }}<|END_OF_TURN_TOKEN|><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{{ How are you? }}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{{ I am doing well! }}<|END_OF_TURN_TOKEN|>

    Use add_generation_prompt to add a prompt for the model to generate a response:
    >>> from transformers import AutoTokenizer
    >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
    >>> messages = [{"role": "user", "content": "Hello, how are you?"}]
    >>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'

    """
    default_template = (
        "{{ bos_token }}"
        "{% if messages[0]['role'] == 'system' %}"
        "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
        "{% set system_message = messages[0]['content'] %}"
        "{% elif USE_DEFAULT_PROMPT == true %}"
        "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
        "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
        "{% else %}"
        "{% set loop_messages = messages %}"
        "{% set system_message = false %}"
        "{% endif %}"
        "{% if system_message != false %}"  # Start with system message
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}"
        "{% endif %}"
        "{% for message in loop_messages %}"  # Loop over all non-system messages
        "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
        "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
        "{% endif %}"
        "{% set content = message['content'] %}"
        "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
        "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'assistant' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% endif %}"
        "{% endfor %}"
        "{% if add_generation_prompt %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
        "{% endif %}"
    )
    default_template = default_template.replace(
        "USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false"
    )
    default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
    default_template = default_template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)

    tool_use_template = (
        "{{ bos_token }}"
        "{% if messages[0]['role'] == 'system' %}"
        "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
        "{% set system_message = messages[0]['content'] %}"
        "{% else %}"
        "{% set loop_messages = messages %}"
        "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
        "{% endif %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
        "{{ '# Safety Preamble' }}"
        "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
        "{{ '\n\n# System Preamble' }}"
        "{{ '\n## Basic Rules' }}"
        "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
        "{{ '\n\n# User Preamble' }}"
        "{{ '\n' + system_message }}"
        "{{'\n\n## Available Tools\nHere is a list of tools that you have available to you:\n\n'}}"
        "{% for tool in tools %}"
        "{% if loop.index0 != 0 %}"
        "{{ '\n\n'}}"
        "{% endif %}"
        "{{'```python\ndef ' + tool.name + '('}}"
        "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
        "{% if loop.index0 != 0 %}"
        "{{ ', '}}"
        "{% endif %}"
        "{{param_name}}: "
        "{% if not param_fields.required %}"
        "{{'Optional[' + param_fields.type + '] = None'}}"
        "{% else %}"
        "{{ param_fields.type }}"
        "{% endif %}"
        "{% endfor %}"
        '{{ \') -> List[Dict]:\n    """\'}}'
        "{{ tool.description }}"
        "{% if tool.parameter_definitions|length != 0 %}"
        "{{ '\n\n    Args:\n        '}}"
        "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
        "{% if loop.index0 != 0 %}"
        "{{ '\n        ' }}"
        "{% endif %}"
        "{{ param_name + ' ('}}"
        "{% if not param_fields.required %}"
        "{{'Optional[' + param_fields.type + ']'}}"
        "{% else %}"
        "{{ param_fields.type }}"
        "{% endif %}"
        "{{ '): ' + param_fields.description }}"
        "{% endfor %}"
        "{% endif %}"
        '{{ \'\n    """\n    pass\n```\' }}'
        "{% endfor %}"
        "{{ '<|END_OF_TURN_TOKEN|>'}}"
        "{% for message in loop_messages %}"
        "{% set content = message['content'] %}"
        "{% if message['role'] == 'user' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'system' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'assistant' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% endif %}"
        "{% endfor %}"
        "{{'<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write \\'Action:\\' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user\\'s last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:\n```json\n[\n    {\n        \"tool_name\": title of the tool in the specification,\n        \"parameters\": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters\n    }\n]```<|END_OF_TURN_TOKEN|>'}}"
        "{% if add_generation_prompt %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
        "{% endif %}"
    )
    default_tool_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
    tool_use_template = tool_use_template.replace("DEFAULT_SYSTEM_MESSAGE", default_tool_message)

    rag_template = (
        "{{ bos_token }}"
        "{% if messages[0]['role'] == 'system' %}"
        "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
        "{% set system_message = messages[0]['content'] %}"
        "{% else %}"
        "{% set loop_messages = messages %}"
        "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
        "{% endif %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
        "{{ '# Safety Preamble' }}"
        "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
        "{{ '\n\n# System Preamble' }}"
        "{{ '\n## Basic Rules' }}"
        "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
        "{{ '\n\n# User Preamble' }}"
        "{{ '\n' + system_message }}"
        "{{ '<|END_OF_TURN_TOKEN|>'}}"
        "{% for message in loop_messages %}"  # Loop over all non-system messages
        "{% set content = message['content'] %}"
        "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
        "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'system' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'assistant' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% endif %}"
        "{% endfor %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}"
        "{{ '<results>' }}"
        "{% for document in documents %}"  # Loop over all non-system messages
        "{{ '\nDocument: ' }}"
        "{{ loop.index0 }}\n"
        "{% for key, value in document.items() %}"
        "{{ key }}: {{value}}\n"
        "{% endfor %}"
        "{% endfor %}"
        "{{ '</results>'}}"
        "{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
        "{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}"
        "{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}"
        "{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}"
        "{% if citation_mode=='accurate' %}"
        "{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}"
        "{% endif %}"
        "{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}"
        "{{ '<|END_OF_TURN_TOKEN|>' }}"
        "{% if add_generation_prompt %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
        "{% endif %}"
    )
    default_rag_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
    rag_template = rag_template.replace("DEFAULT_SYSTEM_MESSAGE", default_rag_message)

    return {"default": default_template, "tool_use": tool_use_template, "rag": rag_template}

这样在调用apply_chat_template函数时传入chat_template即可正确生成模板

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("c4ai-command-r-plus")

chat = [
    {"role": "system", "content": "你是一个人工智能助手"},
    {"role": "user", "content": "出一个谜语"},
]
tokenizer.apply_chat_template(chat, tokenize=False, chat_template=default_chat_template(tokenizer)['default'])

'''
生成结果
'<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>你是一个人工智能助手<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>出一个谜语<|END_OF_TURN_TOKEN|>'
'''

原文地址：https://blog.csdn.net/qq_41496421/article/details/145143863

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：win32汇编环境,窗口程序中对多行编辑框的操作
下一篇：【无标题】四类sql语句通用

20250117在Ubuntu20.04.6下使用灵思FPGA的刷机工具efinity刷机
周五一月 17 25 18:21:46 - Using FTDI URL (SPI = ftdi://0x0403:0x6010:FT8P1ZWI/1, JTAG = ftdi://0x0403:0
阅读更多2025-01-18
JavaScript系列（31）--装饰器详解
JavaScript之旅第三十一站
阅读更多2025-01-18
Linux测试处理fps为30、1920*1080、一分钟的视频性能
项目CMakeLists.txt。
阅读更多2025-01-18
【HarmonyOS之旅】基于ArkTS开发(二) -＞ UI开发三
除绘制基础几何图形，开发者还可以使用Path组件来绘制自定义的路线，下面进行绘制应用的Logo图案。1. 在pages文件夹下创建新的页面Logo.ets。2. Logo.ets中删掉模板代码，创建L
阅读更多2025-01-18
git 常用命令 git archive
git archive 是 Git 中用于创建一个包含指定提交或分支中所有文件的归档文件（如 .tar 或 .zip）的命令。这个命令非常适合用于分发项目快照、备份代码库或导出特定版本的文件。
阅读更多2025-01-18
JavaScript语言的正则表达式
正则表达式是一种用于描述字符串模式的工具，它由普通字符（如字母和数字）和特殊字符（称为元字符）组成。正则表达式可以用来验证字符串是否符合某种特定的模式，提取字符串中的信息，或对字符串进行替换和修改等操
阅读更多2025-01-18
【爬虫】使用 Scrapy 框架爬取豆瓣电影 Top 250 数据的完整教程
在大数据和网络爬虫领域，Scrapy是一个功能强大且广泛使用的开源爬虫框架。它能够帮助我们快速地构建爬虫项目，并高效地从各种网站中提取数据。在本篇文章中，我将带大家从零开始使用Scrapy框架，构建一
阅读更多2025-01-18
com组件技术学习第一章
组件化的技术。
阅读更多2025-01-18
ScratchLLMStepByStep：训练自己的Tokenizer
这一点非常重要，因为每个utf-8字符都是由一到多个字节组成的，将这个长度为256的编码表中的字节进行组合，理论上就能对世界上所有语言中的字符进行编码，并且还不会出现。
阅读更多2025-01-18
强推未发表！3D图！Transformer-LSTM+NSGAII工艺参数优化、工程设计优化！
强推未发表！3D图！Transformer-LSTM+NSGAII工艺参数优化、工程设计优化！
阅读更多2025-01-18