XLA中生成Causal Mask上三角-inf矩阵

🕗 发布于 2024-11-11 09:25 矩阵线性代数 XLA pytorch transformers

transformers生成CausalAttentionMask的上三角-inf矩阵：
参考transformers源码

import torch
import torch_xla
import torch_xla.core.xla_model as xm
import os

os.environ['PJRT_DEVICE']='IPU'
# os.environ['PJRT_DEVICE']='GPU'
# os.environ['XLA_FLAGS']='--xla_dump_to=gen_AttnFwd-XLA_GPU'

tgt_len = 10
dtype=torch.float32
device = xm.xla_device()

# src/transformers/modeling_attn_mask_utils.py#AttentionMaskConverter::_make_causal_mask
mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
mask_cond = torch.arange(mask.size(-1), device=device)
mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
mask = mask.to(dtype)
print(mask)
# print(mask.size())
# print(mask[3][3])

"""
2024-11-07 07:16:18.824506: F tensorflow/compiler/xla/service/hlo_computation.cc:70] Check failed: nullptr != root (nullptr vs. 0)
Aborted (core dumped)
"""

'''
module @SyncTensorsGraph.25 {
  func.func @main() -> tuple<tensor<10x10xf32>> {
    %0 = mhlo.constant dense<[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]> : tensor<10xi64>
    %1 = "mhlo.broadcast_in_dim"(%0) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<10xi64>) -> tensor<10x10xi64>
    %2 = mhlo.constant dense<[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]> : tensor<10xi64>
    %3 = "mhlo.broadcast_in_dim"(%2) {broadcast_dimensions = dense<0> : tensor<1xi64>} : (tensor<10xi64>) -> tensor<10x10xi64>
    %4 = mhlo.compare  LT, %1, %3 : (tensor<10x10xi64>, tensor<10x10xi64>) -> tensor<10x10xi1>
    %5 = mhlo.constant dense<false> : tensor<i1>
    %6 = "mhlo.broadcast_in_dim"(%5) {broadcast_dimensions = dense<> : tensor<0xi64>} : (tensor<i1>) -> tensor<10x10xi1>
    %7 = mhlo.compare  NE, %4, %6 : (tensor<10x10xi1>, tensor<10x10xi1>) -> tensor<10x10xi1>
    %8 = mhlo.constant dense<0.000000e+00> : tensor<f32>
    %9 = "mhlo.broadcast_in_dim"(%8) {broadcast_dimensions = dense<> : tensor<0xi64>} : (tensor<f32>) -> tensor<10x10xf32>
    %10 = mhlo.constant dense<-3.40282347E+38> : tensor<f32>
    %11 = "mhlo.broadcast_in_dim"(%10) {broadcast_dimensions = dense<> : tensor<0xi64>} : (tensor<f32>) -> tensor<10x10xf32>
    %12 = "mhlo.select"(%7, %9, %11) : (tensor<10x10xi1>, tensor<10x10xf32>, tensor<10x10xf32>) -> tensor<10x10xf32>
    %13 = "mhlo.tuple"(%12) {xla_shape = "(f32[10,10]{1,0})"} : (tensor<10x10xf32>) -> tuple<tensor<10x10xf32>>
    return %13 : tuple<tensor<10x10xf32>>
  }
}
'''

'''
XLA_GPU甚至给出了完整的mhlo实现：
gen_AttnFwd-XLA_GPU/module_0000.SyncTensorsGraph.25.sm_8.0_gpu_after_optimizations.txt

HloModule SyncTensorsGraph.25, entry_computation_layout={(f32[])->(f32[10,10]{1,0})}

fused_computation {
  iota.3 = s64[10,10]{1,0} iota(), iota_dimension=1
  iota.2 = s64[10]{0} iota(), iota_dimension=0
  constant_5 = s64[] constant(1)
  broadcast.7 = s64[10]{0} broadcast(constant_5), dimensions={}
  add.0 = s64[10]{0} add(iota.2, broadcast.7)
  broadcast.6 = s64[10,10]{1,0} broadcast(add.0), dimensions={0}
  compare.1 = pred[10,10]{1,0} compare(iota.3, broadcast.6), direction=LT
  constant_3 = pred[] constant(false)
  broadcast.4 = pred[10,10]{1,0} broadcast(constant_3), dimensions={}
  compare.0 = pred[10,10]{1,0} compare(compare.1, broadcast.4), direction=NE
  constant_0 = f32[] constant(0)
  broadcast.3 = f32[10,10]{1,0} broadcast(constant_0), dimensions={}
  param_0.1 = f32[] parameter(0)
  broadcast.2 = f32[10,10]{1,0} broadcast(param_0.1), dimensions={}
  ROOT select.0 = f32[10,10]{1,0} select(compare.0, broadcast.3, broadcast.2)
}

ENTRY SyncTensorsGraph.25 {
  p0.13 = f32[] parameter(0)
  fusion = f32[10,10]{1,0} fusion(p0.13), kind=kLoop, calls=fused_computation
  ROOT tuple.24 = (f32[10,10]{1,0}) tuple(fusion)
}

-----
INFO:torch_xla:Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2024-11-07 11:50:41.174644: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x905c190 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-11-07 11:50:41.174714: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): NVIDIA A100-SXM4-80GB, Compute Capability 8.0
2024-11-07 11:50:41.175641: I tensorflow/compiler/xla/pjrt/gpu/se_gpu_pjrt_client.cc:194] Using BFC allocator.
2024-11-07 11:50:41.175713: I tensorflow/compiler/xla/pjrt/gpu/gpu_helpers.cc:105] XLA backend allocating 75175958937 bytes on device 0 for BFCAllocator.
2024-11-07 11:50:42.013482: I tensorflow/compiler/xla/service/dump.cc:485] HloModule dump enabled with path prefix: , suffix: before_optimizations
2024-11-07 11:50:42.037845: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
tensor([[ 0.0000e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38,
         -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38],
        [ 0.0000e+00,  0.0000e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38,
         -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00, -3.4028e+38, -3.4028e+38,
         -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -3.4028e+38,
         -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00, -3.4028e+38, -3.4028e+38],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -3.4028e+38],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00]],
       device='xla:0')

'''

原文地址：https://blog.csdn.net/liuzonrze/article/details/143606460

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Redis三种集群模式：主从模式、哨兵模式和Cluster模式
下一篇：Python中的常见配置文件写法

Win10下完全卸载Anaconda
在数据科学和机器学习的世界中，Anaconda是一款非常受欢迎的工具，它提供了一个方便的包管理系统和预装的科学计算库。然而，有时我们可能需要从系统中卸载Anaconda。本文将介绍在Windows 1
阅读更多2024-11-17
2025 年请假攻略！
今日面试题：什么是 Java 内部类？它有什么作用？
阅读更多2024-11-16
UDP协议
源端口：发送方进程bind的端口目的端口：接受方进程bind的端口udp的长度：包括报头和有效载荷最大为216（65535byte) 这就要求应用层将超过udp最大长度的数据，进行分割，分割为小于等
阅读更多2024-11-16
项目风险管理的3大要素
在项目管理领域，风险是一个具有双重性质的概念，它既包含可能带来积极影响的机会，也包含可能产生消极影响的威胁，然而，在日常交流中，人们往往只关注风险的负面方面，这种偏见可能导致错失利用潜在机会的可能性。
阅读更多2024-11-16
第3关 Java分支结构之多重if
多重 if 结构在 Java 编程中非常实用，可以根据不同的情况执行不同的代码，使程序更加灵活。但在使用时，要注意条件的顺序和合理性，以确保程序的正确性。在 Java 编程中，分支结构是控制程序流程的
阅读更多2024-11-16
用户态协议栈与内核模块通信机制
在传统的操作系统架构中，网络协议栈通常运行在内核态中，而应用程序则运行在用户态中。随着一些现代操作系统架构的变化，用户态协议栈逐渐成为一种趋势，尤其是对于高性能网络应用和定制协议栈的开发（例如：DPD
阅读更多2024-11-16
生成模型——PixelRNN与PixelCNN
PixelRNN 是一种基于循环神经网络（RNN）的像素级生成模型，通过逐个像素地生成图像来构建完整的图像，其核心思想是将图像中的像素视为序列，并利用 RNN 的能力来捕捉像素之间的依赖关系。Pixe
阅读更多2024-11-16
C/C++静态库引用过程中出现符号未定义的处理方式
【代码】静态库引用出现符号未定义的处理方式。
阅读更多2024-11-16
Docker compose部署Activemq
整个工具的代码都在Gitee或者Github地址内。
阅读更多2024-11-16
安全见闻8
声明：学习视频来自b站up主泷羽sec，如涉及侵权马上删除文章声明：本文主要用作技术分享，所有内容仅供参考。任何使用或依赖于本文信息所造成的法律后果均与本人无关。请读者自行判断风险，并遵循相关法律法
阅读更多2024-11-16

XLA中生成Causal Mask上三角-inf矩阵

相关文章