自学内容网 自学内容网

Transformers库的模板困境:apply_chat_template的版本变迁与解决方案

问题现状

在使用Transformers库中的tokenizer处理模型输入时,我们经常需要将输入文本格式化为模型可以理解的格式。这个过程在不同版本的Transformers库中有着显著的差异。

低版本(4.43及以下)的简便方式

Transformers 4.43及更低版本中,当我们需要加载类似Command-R+等聊天模型时,可以直接使用以下代码:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model_name")
chat = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"}
]
formatted_chat = tokenizer.apply_chat_template(chat, tokenize=False)

这种方式简单直接,tokenizer会自动使用预设的模板来格式化对话。

高版本的报错问题

然而,当我们升级到更高版本的Transformers库后,相同的代码会遇到如下错误:

ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating

这个错误明确指出,在新版本中tokenizer不再包含默认的聊天模板,需要我们显式指定模板或设置tokenizer.chat_template

原因分析

问题的根源在于Transformers库源码中对chat template处理逻辑的变更。让我们深入分析源码来理解这个变化:

旧版本的实现逻辑

在旧版本的transformers/tokenization_utils_base.py文件中,apply_chat_template函数在此获取模板

chat_template = self.get_chat_template(chat_template, tools)

具体的逻辑如下:

def get_chat_template(self, chat_template: Optional[str] = None, tools: Optional[List[Dict]] = None) -> str:
    if isinstance(self.chat_template, dict) or (
        self.chat_template is None and isinstance(self.default_chat_template, dict)
    ):
        if self.chat_template is not None:
            template_dict = self.chat_template
            using_default_dict = False
        else:
            template_dict = self.default_chat_template
            using_default_dict = True
            
        # 根据不同情况选择模板
        if chat_template is not None and chat_template in template_dict:
            chat_template = template_dict[chat_template]
            if using_default_dict:
                using_default_template = True
        elif chat_template is None:
            if tools is not None and "tool_use" in template_dict:
                chat_template = template_dict["tool_use"]
            elif "default" in template_dict:
                chat_template = template_dict["default"]

这段代码的关键在于:

  1. 判断条件包含了两种情况:
    • self.chat_template是字典类型
    • 或者self.chat_template为空但self.default_chat_template是字典类型
  2. self.chat_template为空时,会回退使用self.default_chat_template
  3. 在早期版本中,模型文件(如command-r+模型中的tokenization_cohere_fast.py)会预先定义default_chat_template,确保即使没有指定模板也能正常工作

新版本的变化

新版本中,代码被简化为:

if isinstance(self.chat_template, dict):
template_dict = self.chat_template
if chat_template is not None and chat_template in template_dict:
# The user can pass the name of a template to the chat template argument instead of an entire template
chat_template = template_dict[chat_template]
elif chat_template is None:
if tools is not None and "tool_use" in template_dict:
chat_template = template_dict["tool_use"]
elif "default" in template_dict:
chat_template = template_dict["default"]
else:
raise ValueError(
"This model has multiple chat templates with no default specified! Please either pass a chat "
"template or the name of the template you wish to use to the `chat_template` argument. Available "
f"template names are {sorted(template_dict.keys())}."
)

这个改动带来了几个重要的影响:

  1. 移除了对self.default_chat_template的支持
  2. 只检查self.chat_template是否为字典类型
  3. 导致早期模型文件中定义的默认模板不再生效
  4. self.chat_template为空时,直接抛出ValueError异常

这就是为什么在新版本中,即使模型文件中定义了default_chat_template,我们依然会遇到"Cannot use chat template functions because tokenizer.chat_template is not set"的错误。这个改动似乎是为了简化代码结构,但同时也破坏了向后兼容性。

解决办法

手动找到旧版模型的chat模板代码,例如找到command-r+模型的tokenization_cohere_fast.py文件,其中CohereTokenizerFast类中的default_chat_template函数代码为:

    @property
    def default_chat_template(self):
        """
        Cohere Tokenizer uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> to indicate each turn in a chat.
        Additioanlly, to indicate the source of the message, <|USER_TOKEN|>, <|CHATBOT_TOKEN|> and <|SYSTEM_TOKEN|>
        for user, assitant and system messages respectively.

        The output should look something like:
        <|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ preamble }}<|END_OF_TURN_TOKEN|><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{{ How are you? }}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{{ I am doing well! }}<|END_OF_TURN_TOKEN|>

        Use add_generation_prompt to add a prompt for the model to generate a response:
        >>> from transformers import AutoTokenizer
        >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
        >>> messages = [{"role": "user", "content": "Hello, how are you?"}]
        >>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'

        """
        default_template = (
            "{{ bos_token }}"
            "{% if messages[0]['role'] == 'system' %}"
            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
            "{% set system_message = messages[0]['content'] %}"
            "{% elif USE_DEFAULT_PROMPT == true %}"
            "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
            "{% else %}"
            "{% set loop_messages = messages %}"
            "{% set system_message = false %}"
            "{% endif %}"
            "{% if system_message != false %}"  # Start with system message
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% for message in loop_messages %}"  # Loop over all non-system messages
            "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
            "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
            "{% endif %}"
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_template = default_template.replace(
            "USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false"
        )
        default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
        default_template = default_template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)

        tool_use_template = (
            "{{ bos_token }}"
            "{% if messages[0]['role'] == 'system' %}"
            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
            "{% set system_message = messages[0]['content'] %}"
            "{% else %}"
            "{% set loop_messages = messages %}"
            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
            "{% endif %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ '# Safety Preamble' }}"
            "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
            "{{ '\n\n# System Preamble' }}"
            "{{ '\n## Basic Rules' }}"
            "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
            "{{ '\n\n# User Preamble' }}"
            "{{ '\n' + system_message }}"
            "{{'\n\n## Available Tools\nHere is a list of tools that you have available to you:\n\n'}}"
            "{% for tool in tools %}"
            "{% if loop.index0 != 0 %}"
            "{{ '\n\n'}}"
            "{% endif %}"
            "{{'```python\ndef ' + tool.name + '('}}"
            "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
            "{% if loop.index0 != 0 %}"
            "{{ ', '}}"
            "{% endif %}"
            "{{param_name}}: "
            "{% if not param_fields.required %}"
            "{{'Optional[' + param_fields.type + '] = None'}}"
            "{% else %}"
            "{{ param_fields.type }}"
            "{% endif %}"
            "{% endfor %}"
            '{{ \') -> List[Dict]:\n    """\'}}'
            "{{ tool.description }}"
            "{% if tool.parameter_definitions|length != 0 %}"
            "{{ '\n\n    Args:\n        '}}"
            "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
            "{% if loop.index0 != 0 %}"
            "{{ '\n        ' }}"
            "{% endif %}"
            "{{ param_name + ' ('}}"
            "{% if not param_fields.required %}"
            "{{'Optional[' + param_fields.type + ']'}}"
            "{% else %}"
            "{{ param_fields.type }}"
            "{% endif %}"
            "{{ '): ' + param_fields.description }}"
            "{% endfor %}"
            "{% endif %}"
            '{{ \'\n    """\n    pass\n```\' }}'
            "{% endfor %}"
            "{{ '<|END_OF_TURN_TOKEN|>'}}"
            "{% for message in loop_messages %}"
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'system' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{{'<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write \\'Action:\\' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user\\'s last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:\n```json\n[\n    {\n        \"tool_name\": title of the tool in the specification,\n        \"parameters\": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters\n    }\n]```<|END_OF_TURN_TOKEN|>'}}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_tool_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
        tool_use_template = tool_use_template.replace("DEFAULT_SYSTEM_MESSAGE", default_tool_message)

        rag_template = (
            "{{ bos_token }}"
            "{% if messages[0]['role'] == 'system' %}"
            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
            "{% set system_message = messages[0]['content'] %}"
            "{% else %}"
            "{% set loop_messages = messages %}"
            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
            "{% endif %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ '# Safety Preamble' }}"
            "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
            "{{ '\n\n# System Preamble' }}"
            "{{ '\n## Basic Rules' }}"
            "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
            "{{ '\n\n# User Preamble' }}"
            "{{ '\n' + system_message }}"
            "{{ '<|END_OF_TURN_TOKEN|>'}}"
            "{% for message in loop_messages %}"  # Loop over all non-system messages
            "{% set content = message['content'] %}"
            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'system' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% elif message['role'] == 'assistant' %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
            "{% endif %}"
            "{% endfor %}"
            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}"
            "{{ '<results>' }}"
            "{% for document in documents %}"  # Loop over all non-system messages
            "{{ '\nDocument: ' }}"
            "{{ loop.index0 }}\n"
            "{% for key, value in document.items() %}"
            "{{ key }}: {{value}}\n"
            "{% endfor %}"
            "{% endfor %}"
            "{{ '</results>'}}"
            "{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
            "{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}"
            "{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}"
            "{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}"
            "{% if citation_mode=='accurate' %}"
            "{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}"
            "{% endif %}"
            "{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}"
            "{{ '<|END_OF_TURN_TOKEN|>' }}"
            "{% if add_generation_prompt %}"
            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
            "{% endif %}"
        )
        default_rag_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
        rag_template = rag_template.replace("DEFAULT_SYSTEM_MESSAGE", default_rag_message)

        return {"default": default_template, "tool_use": tool_use_template, "rag": rag_template}

其中使用了一些全局变量,将其复制出来做成一个函数,例如整理成如下形式:

PRETRAINED_VOCAB_FILES_MAP = {
    "tokenizer_file": {
        "Cohere/Command-nightly": "https://huggingface.co/Cohere/Command-nightly/blob/main/tokenizer.json",
    },
}

# fmt: off
DEFAULT_SYSTEM_PROMPT = "You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere."
DEFAULT_RAG_PREAMBLE = """## Task and Context
You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

## Style Guide
Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling."""
# fmt: on

def default_chat_template(self):
    """
    Cohere Tokenizer uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> to indicate each turn in a chat.
    Additioanlly, to indicate the source of the message, <|USER_TOKEN|>, <|CHATBOT_TOKEN|> and <|SYSTEM_TOKEN|>
    for user, assitant and system messages respectively.

    The output should look something like:
    <|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ preamble }}<|END_OF_TURN_TOKEN|><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{{ How are you? }}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{{ I am doing well! }}<|END_OF_TURN_TOKEN|>

    Use add_generation_prompt to add a prompt for the model to generate a response:
    >>> from transformers import AutoTokenizer
    >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
    >>> messages = [{"role": "user", "content": "Hello, how are you?"}]
    >>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'

    """
    default_template = (
        "{{ bos_token }}"
        "{% if messages[0]['role'] == 'system' %}"
        "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
        "{% set system_message = messages[0]['content'] %}"
        "{% elif USE_DEFAULT_PROMPT == true %}"
        "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
        "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
        "{% else %}"
        "{% set loop_messages = messages %}"
        "{% set system_message = false %}"
        "{% endif %}"
        "{% if system_message != false %}"  # Start with system message
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}"
        "{% endif %}"
        "{% for message in loop_messages %}"  # Loop over all non-system messages
        "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
        "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
        "{% endif %}"
        "{% set content = message['content'] %}"
        "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
        "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'assistant' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% endif %}"
        "{% endfor %}"
        "{% if add_generation_prompt %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
        "{% endif %}"
    )
    default_template = default_template.replace(
        "USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false"
    )
    default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
    default_template = default_template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)

    tool_use_template = (
        "{{ bos_token }}"
        "{% if messages[0]['role'] == 'system' %}"
        "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
        "{% set system_message = messages[0]['content'] %}"
        "{% else %}"
        "{% set loop_messages = messages %}"
        "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
        "{% endif %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
        "{{ '# Safety Preamble' }}"
        "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
        "{{ '\n\n# System Preamble' }}"
        "{{ '\n## Basic Rules' }}"
        "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
        "{{ '\n\n# User Preamble' }}"
        "{{ '\n' + system_message }}"
        "{{'\n\n## Available Tools\nHere is a list of tools that you have available to you:\n\n'}}"
        "{% for tool in tools %}"
        "{% if loop.index0 != 0 %}"
        "{{ '\n\n'}}"
        "{% endif %}"
        "{{'```python\ndef ' + tool.name + '('}}"
        "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
        "{% if loop.index0 != 0 %}"
        "{{ ', '}}"
        "{% endif %}"
        "{{param_name}}: "
        "{% if not param_fields.required %}"
        "{{'Optional[' + param_fields.type + '] = None'}}"
        "{% else %}"
        "{{ param_fields.type }}"
        "{% endif %}"
        "{% endfor %}"
        '{{ \') -> List[Dict]:\n    """\'}}'
        "{{ tool.description }}"
        "{% if tool.parameter_definitions|length != 0 %}"
        "{{ '\n\n    Args:\n        '}}"
        "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
        "{% if loop.index0 != 0 %}"
        "{{ '\n        ' }}"
        "{% endif %}"
        "{{ param_name + ' ('}}"
        "{% if not param_fields.required %}"
        "{{'Optional[' + param_fields.type + ']'}}"
        "{% else %}"
        "{{ param_fields.type }}"
        "{% endif %}"
        "{{ '): ' + param_fields.description }}"
        "{% endfor %}"
        "{% endif %}"
        '{{ \'\n    """\n    pass\n```\' }}'
        "{% endfor %}"
        "{{ '<|END_OF_TURN_TOKEN|>'}}"
        "{% for message in loop_messages %}"
        "{% set content = message['content'] %}"
        "{% if message['role'] == 'user' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'system' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'assistant' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% endif %}"
        "{% endfor %}"
        "{{'<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write \\'Action:\\' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user\\'s last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:\n```json\n[\n    {\n        \"tool_name\": title of the tool in the specification,\n        \"parameters\": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters\n    }\n]```<|END_OF_TURN_TOKEN|>'}}"
        "{% if add_generation_prompt %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
        "{% endif %}"
    )
    default_tool_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
    tool_use_template = tool_use_template.replace("DEFAULT_SYSTEM_MESSAGE", default_tool_message)

    rag_template = (
        "{{ bos_token }}"
        "{% if messages[0]['role'] == 'system' %}"
        "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
        "{% set system_message = messages[0]['content'] %}"
        "{% else %}"
        "{% set loop_messages = messages %}"
        "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
        "{% endif %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
        "{{ '# Safety Preamble' }}"
        "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
        "{{ '\n\n# System Preamble' }}"
        "{{ '\n## Basic Rules' }}"
        "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
        "{{ '\n\n# User Preamble' }}"
        "{{ '\n' + system_message }}"
        "{{ '<|END_OF_TURN_TOKEN|>'}}"
        "{% for message in loop_messages %}"  # Loop over all non-system messages
        "{% set content = message['content'] %}"
        "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
        "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'system' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% elif message['role'] == 'assistant' %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
        "{% endif %}"
        "{% endfor %}"
        "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}"
        "{{ '<results>' }}"
        "{% for document in documents %}"  # Loop over all non-system messages
        "{{ '\nDocument: ' }}"
        "{{ loop.index0 }}\n"
        "{% for key, value in document.items() %}"
        "{{ key }}: {{value}}\n"
        "{% endfor %}"
        "{% endfor %}"
        "{{ '</results>'}}"
        "{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
        "{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}"
        "{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}"
        "{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}"
        "{% if citation_mode=='accurate' %}"
        "{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}"
        "{% endif %}"
        "{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}"
        "{{ '<|END_OF_TURN_TOKEN|>' }}"
        "{% if add_generation_prompt %}"
        "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
        "{% endif %}"
    )
    default_rag_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
    rag_template = rag_template.replace("DEFAULT_SYSTEM_MESSAGE", default_rag_message)

    return {"default": default_template, "tool_use": tool_use_template, "rag": rag_template}

这样在调用apply_chat_template函数时传入chat_template即可正确生成模板

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("c4ai-command-r-plus")

chat = [
    {"role": "system", "content": "你是一个人工智能助手"},
    {"role": "user", "content": "出一个谜语"},
]
tokenizer.apply_chat_template(chat, tokenize=False, chat_template=default_chat_template(tokenizer)['default'])

'''
生成结果
'<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>你是一个人工智能助手<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>出一个谜语<|END_OF_TURN_TOKEN|>'
'''

原文地址:https://blog.csdn.net/qq_41496421/article/details/145143863

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!