【Transformers基础入门篇2】基础组件之Pipeline

🕗 发布于 2024-09-23 20:41 Transformers Pipeline nlp

文章目录

一、什么是Pipeline
二、查看PipeLine支持的任务类型
三、Pipeline的创建和使用
四、Pipeline的背后实现

本文为 https://space.bilibili.com/21060026/channel/collectiondetail?sid=1357748的视频学习笔记

项目地址为：https://github.com/zyds/transformers-code

一、什么是Pipeline

将数据预处理、模型调用、结果后处理三部分组装成的流水线，如下流程图
使我们能够直接输入文本便获得最终的答案，不需要我们关注细节

二、查看PipeLine支持的任务类型

from transformers.pipelines import SUPPORTED_TASKS
from pprint import pprint
for k, v in SUPPORTED_TASKS.items():
    print(k, v)

输出但其概念PipeLine支持的任务类型以及可以调用的
举例输出：

audio-classification {'impl': <class 'transformers.pipelines.audio_classification.AudioClassificationPipeline'>, 'tf': (), 'pt': (<class 'transformers.models.auto.modeling_auto.AutoModelForAudioClassification'>,), 'default': {'model': {'pt': ('superb/wav2vec2-base-superb-ks', '372e048')}}, 'type': 'audio'}

key: 任务的名称，如音频分类
v：关于任务的实现，如具体哪个Pipeline，有没有TF模型，有没有pytorch模型，模型具体是哪一个

三、Pipeline的创建和使用

3.1 根据任务类型，直接创建Pipeline，默认是英文模型

from transformers import pipeline
pipe = pipeline("text-classification") # 根据pipeline直接创建一个任务类
pipe("very good") # 测试一个句子，输出结果

3.2 指定任务类型，再指定模型，创建基于指定模型的Pipeline

注，这里我已经将模型离线下载到本地了

# https://huggingface.co/models
pipe = pipeline("text-classification", model="./models/roberta-base-finetuned-dianping-chinese")

3.3 预先加载模型，再创建Pipeline

rom transformers import AutoModelForSequenceClassification, AutoTokenizer

# 这种方式，必须同时指定model和tokenizer
model = AutoModelForSequenceClassification.from_pretrained("./models_roberta-base-finetuned-dianping-chinese")
tokenizer = AutoTokenizer.from_pretrained("./models_roberta-base-finetuned-dianping-chinese")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

3.4 使用Gpu进行推理

pipe = pipeline("text-classification", model="./models_roberta-base-finetuned-dianping-chinese", device=0)

3.5 查看Device

pipe.model.device

3.6 测试一下耗时

import torch
import time
times = []
for i in range(100):
    torch.cuda.synchronize()
    start = time.time()
    pipe("我觉得不太行！")
    torch.cuda.synchronize()
    end = time.time()
    times.append(end - start)
print(sum(times) / 100)

3.7 确定的Pipeline的参数

# 先创建一个pipeline
qa_pipe = pipeline("question-answering", model="../../models/models")
qa_pipe

输出
在这里插入图片描述 QuestionAnsweringPipeline

查看定义，会告诉我们这个pipeline该如何使用

class QuestionAnsweringPipeline(ChunkPipeline):
    """
    Question Answering pipeline using any `ModelForQuestionAnswering`. See the [question answering
    examples](../task_summary#question-answering) for more information.

    Example:

    ```python
    >>> from transformers import pipeline

    >>> oracle = pipeline(model="deepset/roberta-base-squad2")
    >>> oracle(question="Where do I live?", context="My name is Wolfgang and I live in Berlin")
    {'score': 0.9191, 'start': 34, 'end': 40, 'answer': 'Berlin'}
    ```

    Learn more about the basics of using a pipeline in the [pipeline tutorial](../pipeline_tutorial)

    This question answering pipeline can currently be loaded from [`pipeline`] using the following task identifier:
    `"question-answering"`.

    The models that this pipeline can use are models that have been fine-tuned on a question answering task. See the
    up-to-date list of available models on
    [huggingface.co/models](https://huggingface.co/models?filter=question-answering).
    """

进入pipeline，看__call__，查看可以支持的更多的参数
列出了更多的参数

    def __call__(self, *args, **kwargs):
        """
        Answer the question(s) given as inputs by using the context(s).

        Args:
            args ([`SquadExample`] or a list of [`SquadExample`]):
                One or several [`SquadExample`] containing the question and context.
            X ([`SquadExample`] or a list of [`SquadExample`], *optional*):
                One or several [`SquadExample`] containing the question and context (will be treated the same way as if
                passed as the first positional argument).
            data ([`SquadExample`] or a list of [`SquadExample`], *optional*):
                One or several [`SquadExample`] containing the question and context (will be treated the same way as if
                passed as the first positional argument).
            question (`str` or `List[str]`):
                One or several question(s) (must be used in conjunction with the `context` argument).
            context (`str` or `List[str]`):
                One or several context(s) associated with the question(s) (must be used in conjunction with the
                `question` argument).
            topk (`int`, *optional*, defaults to 1):
                The number of answers to return (will be chosen by order of likelihood). Note that we return less than
                topk answers if there are not enough options available within the context.
            doc_stride (`int`, *optional*, defaults to 128):
                If the context is too long to fit with the question for the model, it will be split in several chunks
                with some overlap. This argument controls the size of that overlap.
            max_answer_len (`int`, *optional*, defaults to 15):
                The maximum length of predicted answers (e.g., only answers with a shorter length are considered).
            max_seq_len (`int`, *optional*, defaults to 384):
                The maximum length of the total sentence (context + question) in tokens of each chunk passed to the
                model. The context will be split in several chunks (using `doc_stride` as overlap) if needed.
            max_question_len (`int`, *optional*, defaults to 64):
                The maximum length of the question after tokenization. It will be truncated if needed.
            handle_impossible_answer (`bool`, *optional*, defaults to `False`):
                Whether or not we accept impossible as an answer.
            align_to_words (`bool`, *optional*, defaults to `True`):
                Attempts to align the answer to real words. Improves quality on space separated langages. Might hurt on
                non-space-separated languages (like Japanese or Chinese)

        Return:
            A `dict` or a list of `dict`: Each result comes as a dictionary with the following keys:

            - **score** (`float`) -- The probability associated to the answer.
            - **start** (`int`) -- The character start index of the answer (in the tokenized version of the input).
            - **end** (`int`) -- The character end index of the answer (in the tokenized version of the input).
            - **answer** (`str`) -- The answer to the question.
        """

如下面的例子

我们输出问题：中国的首都是哪里？给的上下文是：中国的首都是北京

qa_pipe(question="中国的首都是哪里？", context="中国的首都是北京")

在这里插入图片描述

如果通过 max_answer_len参数来限定输出的最大长度，会进行强行截断

qa_pipe(question="中国的首都是哪里？", context="中国的首都是北京", max_answer_len=1)

在这里插入图片描述

四、Pipeline的背后实现

step1 初始化组件，Tokenizer，model

# step1 初始化tokenizer， model
tokenizer = AutoTokenizer.from_pretrained("../../models/models_roberta-base-finetuned-dianping-chinese")
model = AutoModelForSequenceClassification.from_pretrained("../../models/models_roberta-base-finetuned-dianping-chinese")

step2 预处理

# 预处理，返回pytorch的tensor，是一个dict
input_text = "我觉得不太行！"
inputs = tokenizer(input_text, return_tensors="pt")
inputs

在这里插入图片描述

step3 模型预测

res = model(**inputs)
res

在这里插入图片描述
预测的结果，包括的内容有点多，如loss,logits等

step4 结果后处理

logits = res.logits
logits = torch.softmax(logits, dim=-1)
pred = torch.argmax(logits).item()
result = model.config.id2label.get(pred)
result

在这里插入图片描述

原文地址：https://blog.csdn.net/hjxu2016/article/details/142455682

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Acwing 树与图的遍历、拓扑排序
下一篇：深入理解MySQL InnoDB中的B+索引机制

数位dp(算法篇)
数位dp也就是将动态规划的思想放入到对数的每一位进行，也是一种动态规划的一种，本篇文章将会带大家简单学习一下
阅读更多2024-09-24
Gradel离线、在线编译Java项目
离线编译Java项目，需要jdk、gradle。下载所需版本的gradle和jdk，gradle不区分windows环境和linux环境，但是jdk区分。以下基于windows环境，Linux环境把g
阅读更多2024-09-24
AI画图用到的网站与资源
1、画图2、素材3、
阅读更多2024-09-24
鹏哥C语言43---函数的嵌套调用和链式访问
/-----------------------------------------------------------------------------------------------5.2
阅读更多2024-09-24
拦截器filter
前端的请求会经过网关（gateway），网关用的是netty服务器，会和web默认的tomcat服务器冲突，但是前端过来的请求也需要校验请求头是否携带了token，要怎么实现呢？：只需要在gatewa
阅读更多2024-09-24
网络安全技术的发展趋势
综上所述，网络安全技术的发展趋势涵盖了从人工智能的应用、基础设施的加强、物联网安全的重要性提升，到远程办公的影响等多个方面。这些趋势预示着网络安全领域将朝着更加智能化、协同化和实战化的方向发展，同时也
阅读更多2024-09-24
自学前端的正确姿势是...
师傅带进门，修行在个人。在前端自学成才的道路上，有些人走的很快，有些人却举步维艰。为什么会这样子呢？因为他们没有掌握自学前端的正确姿势。
阅读更多2024-09-24
ThreadX源码：Cortex-A7的tx_thread_context_save.S（线程上下文保存）汇编代码分析
ThreadX源码：Cortex-A7的tx_thread_context_save.S（线程上下文保存）汇编代码分析
阅读更多2024-09-24
tcp协议详解
TCP（Transmission Control Protocol，传输控制协议）是面向连接的、可靠的、基于字节流的通信协议，它用于网络中两个主机之间的数据传输。TCP协议的主要功能是确保数据的正确传
阅读更多2024-09-24
golang学习笔记32——golang Beego 框架使用详解
Beego 框架是一个功能强大的 Go 语言 Web 应用框架，它提供了丰富的功能和工具，可以帮助开发者快速构建高效、稳定的 Web 应用。本文介绍了 Beego 框架的安装、创建项目、项目结构、配置
阅读更多2024-09-24

【Transformers基础入门篇2】基础组件之Pipeline

文章目录

一、什么是Pipeline

二、查看PipeLine支持的任务类型

三、Pipeline的创建和使用

3.1 根据任务类型，直接创建Pipeline，默认是英文模型

3.2 指定任务类型，再指定模型，创建基于指定模型的Pipeline

3.3 预先加载模型，再创建Pipeline

3.4 使用Gpu进行推理

3.5 查看Device

3.6 测试一下耗时

3.7 确定的Pipeline的参数

四、Pipeline的背后实现

相关文章