大模型推理加速——ALISA

🕗 发布于 2024-11-05 23:54 人工智能 深度学习

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

ISCA’24

Abstract

Algorithm and system co-design
Algorithm：Sparse Window Attention (SWA) Algorithm
System：three-phase token-level dynamical scheduling and optimizes the trade-off between caching and recomputation

1 Introduction

KV caching： reuse Intermediate states
the quadratic-complexity computation is reduced to linear complexity computation and memory accesses

Challenge：
LLM inference with KV caching is predominantly bottlenecked by memory
原因：

MM 和 Softmax 本身是 memory-bound
Weights 和 activations 本身已经达到 memory capacity
KV caching进一步加剧了内存容量的需求（随着句子长度和batch size线性增长）

在这里插入图片描述
OPT-6.7B inference on one NVIDIA Tesla V100 GPU with 32 GB memory under different workloads，b, s, and n for workloads refer to the batch size, and input and output sequence length

为缓解内存压力 -> offloading KV Cache 到 CPU memory 甚至 secondary Storage
问题：offloading, reloading (data transfer overhead) 成为新的瓶颈

Proposal：
algorithm system co-design solution to accelerate LLM inference via sparsity-aware KV caching for single GPU-CPU systems

key observation：
during the autoregressive inference process, the attention weight matrix is highly sparse, and larger LLMs exhibit higher attention weight sparsity

Sparse Window Attention (SWA) algorithm

identify which tokens are important
globally dynamic and locally static sparse patterns

加速LLMs不只是一个计算问题，更是一个内存问题，因 memory footprint 巨大
挑战：

Sparse KV Cache大小最终会超出内存限制，同时稠密 LLMs 中的长延迟 GPU - CPU 访存重现
KV 张量的稀疏特性会导致不可预测的内存访问，而更长的序列会加剧这种情况
高精度的 (本工作中FP16) KV 张量仍然表现出较大的内存占用，从而具有较高的内存访问延迟

解决方法：

dynamically schedule the KV tensors at the token level
balance between caching and recomputation for best performance gain
KV Cache量化至 INT8

2 Background

LLM

在这里插入图片描述

$\begin{align*} AW(Q,K) &=\sigma\left(\frac{QK^{T}}{\sqrt d}\right)\\\\ Attn(Q, K,V) &=AW(Q,K)\cdot V \end{align*}$

KV Caching 将原本具有二次复杂度的矩阵乘法转化为具有线性复杂度的向量-矩阵乘法和存储器访问，从而显著提升了性能

如图 2 (c) ，在没有KV缓存的情况下，执行时间迅速增加；在KV缓存下，只有新生成的令牌的注意力权重和分数被计算为向量-矩阵乘法，在不同的步骤中，执行时间几乎保持不变。这种运行时间的减少是以GPU内存使用量为代价的，随着时间的推移，GPU内存使用量逐渐增加，这是由于KV张量的规模越来越大。

与前人方法的对比：
在这里插入图片描述

ALISA Summarize:

ALISA同时设计了算法和系统来充分利用稀疏注意力以获得更高的吞吐量
在单个Token的粒度上执行KV缓存，允许灵活的KV张量分配，这对于稀疏性驱动的协同设计至关重要
ALISA采用合适的动态调度器来执行缓存和重新计算

3 Challenges and Opportunities

A Challenges

batch size, sequence length, and model configuration

orchestrates when, how, and what to offload and reload in resource-constrained systems, so that the overall execution time is minimized
在资源受限的系统中调度何时、如何、以及卸载和重加载什么，使得整体执行时间最小化

B Opportunities

profiling the sparsity in attention weights

在这里插入图片描述

two key observations:

the attention weights in LLMs are highly sparse
larger LLMs exhibit higher sparsity

motivates and validates our solution to create sparse KV tensors by skipping unimportant tokens in LLM inference

C Objective

Identifying Important Tokens

nondeterministic nature of language, the attention weights for each token vary from step to step

low-cost mechanism to distinguish important tokens without hurting accuracy significantly for LLM inference

Caching KV Tensors

store partial KV tensors in CPU memory for future reuse
naive implementation: Belady’s algorithm -> need future knowledge and large resources

low-cost caching policy to allocate sparse KV tensors and ensure a relatively low miss rate

Caching vs. Recomputation

Problem: As the sequence length grows, the benefit of KV caching diminishes at a certain threshold since the time for accessing CPU memory might outweigh that for recomputing partial KV tensors.

sequence length threshold varies across batch sizes and model configurations

dynamic scheduling strategy that balances KV caching and recomputation at the token level

4 ALISA Algorithm Design

A. Attention Analysis

之前的方法：
Fixed-size sliding Window
strided Attention

why the previous attention methods fail upon long sequences?
attention weights with larger values do not exhibit a specific pattern

在这里插入图片描述

Only using the most recent tokens cannot accurately represent the distribution of the entire attention weights, since the tokens with large attention weights (therefore more important) are often far from the current token

在这里插入图片描述

dense attention scores follow a near power-law distribution

the attention score distributions generated by local and strided attention show close to zero correlation to that of dense attention
相关性度量为：Spearman correlation score

B. Sparse Window Attention (SWA)

locally static and globally dynamic sparse patterns

在这里插入图片描述

The importance of the prior tokens for future token generation is determined by the sum of the local attention weights

在这里插入图片描述

First, the algorithm entails a caching ratio to determine how many tokens to keep at each step for KV sparsity and apply the sparse masks at the token level. While irregular sparsity could exist across tokens, each token is still a dense tensor.
Second, we use gather operations to pack sparse KV tensors into a dense one and perform dense matrix operations. Therefore, despite the multi-step attention calculation in SWA, both the computation and memory access for SWA remain regular, if we target a proper granularity.

5 ALISA System Design

Dynamic Scheduling 以及 KV Compression

SWA identifies important KV tensors and generates sparse patterns
Dynamic scheduling utilizes important tokens and user-specified caching ratio to balance sparsity-aware caching and recomputation at the token level during LLM inference
KV compression further reduces the overall memory footprint of KV tensors by quantizing them into INT8 format

A. Dynamic Scheduling

Three-Phase Scheduling

Phase I: GPU Caching
Phase II: GPU-CPU Caching
Phase III: Recomputation-Caching

在这里插入图片描述

keep the KV tensors for the locally static tokens in the GPU
store the preceding ones in the CPU

Sparsity-Aware Caching

how to determine the phase switch step and offload and recompute ratio of KV tensors

optimization Problem to minimize the total execution time

size of KV tensors for each token is $4 \cdot b \cdot l \cdot h$ bytes
the number of tokens moved from GPU to CPU: $\theta_{j}^{c}(\alpha)=\alpha(j+s)$
the number of tokens moved from CPU to GPU: $\theta_{j}^{g}$
$T_{j}^{m}(\alpha)=\frac{4\cdot b\cdot l\cdot h\cdot (\theta_{j}^{c}+\theta_{j}^{g})}{B} \quad \ p_{1}\le j\lt n,\ 0\le \theta_{j}^{g}\le\lfloor(s+j)r\rceil$

在这里插入图片描述

B. KV Compression

将KV Cache量化至8bit

$x_{quant}=\text{round}\left(\frac{1}{\lambda}x+z\right),\quad x=\lambda(x_{quant}-z)$
$\lambda= \frac{max-min}{2^{b}-1}, \quad z=\text{round}\left(\frac{-2^{b}}{max-min}\right)$

6 Evaluation

A Experimental Setup

Models and datasets.

原文地址：https://blog.csdn.net/qq_42047140/article/details/143507571

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Spring学习笔记(一)
下一篇：【jvm】如何设置新生代和老年代的比例

TiDB 概念简述
TiDB 是一个适用于互联网和传统行业大规模数据处理需求的分布式数据库解决方案。它结合了传统 RDBMS 的易用性和 NoSQL 的可扩展性，提供了一种新的数据库技术选择。
阅读更多2024-11-06
记录一个跳跃的小游戏
【代码】记录一个跳跃的小游戏。
阅读更多2024-11-06
优化文本嵌入，大幅提升RAG检索速度
大家好，文本嵌入技术能够将文字信息转换成高维向量表示的数字，提供了一种理解和处理文本数据的新方式，帮助我们更好地理解和处理文本数据。这些向量能够捕捉文本的深层特征，进而支持多种应用，比如理解语义、进行
阅读更多2024-11-06
Ubuntu22.04在Docker下安装Mysql5.7
使用Ubuntu22.04在docker下安装mysql5.7的详细过程
阅读更多2024-11-06
SpringFactoriesLoader
SpringFactoriesLoader类的主要作用是通过类路径下的文件获取工厂类接口的实现类，初始化并保存在缓存中，以供Springboot启动过程中各个阶段的调用。Spring的自动化配置功能，
阅读更多2024-11-06
深度学习基础—双向RNN和深层RNN
要识别Teddy是否是人名的一部分，普通RNN在第3个时间步时只能学习到Teddy以前的内容，而比较关键的词在第4个时间步，因此要想解决这个问题，就需要让网络有预知“未来”的能力，双向循环神经网络正是
阅读更多2024-11-06
Python软体中使用Pandas库读取数据并绘制柱状图的实用指南
通过本教程，我们学习了如何使用Pandas库读取CSV文件，并利用Matplotlib库绘制柱状图。我们从数据读取、处理到可视化的整个过程进行了详细的讲解，并提供了优化图表的技巧和保存图表的方法。数据
阅读更多2024-11-06
如何使用python完成时间序列的数据分析？
时间序列是指在时间上有序的一组数据点。时间序列数据可以是定期收集的（如每日、每月、每年）或不定期收集的。时间序列的主要特征包括趋势、季节性、周期性和随机性。
阅读更多2024-11-06
数据结构 C/C++(实验一:线性表)
1．掌握线性表的顺序存储表示和链式存储表示。2．掌握顺序表和链表的基本操作算法，包括创建、取值、查找、插入、删除等基本操作的实现。3．了解线性表两种不同存储结构的特点，会灵活运用线性表解决某些实际问题
阅读更多2024-11-06
数据库-＞视图
视图是⼀个虚拟的表，它是基于⼀个或多个基本表或其他视图的查询结果集。视图本⾝不存储数据，⽽是通过执⾏查询来动态⽣成数据。⽤⼾可以像操作普通表⼀样使⽤视图进⾏查询、更新和管理。视图本⾝并不占⽤物理存储空
阅读更多2024-11-06

大模型推理加速——ALISA

Abstract

1 Introduction

2 Background

LLM

3 Challenges and Opportunities

A Challenges

B Opportunities

C Objective

Identifying Important Tokens

Caching KV Tensors

Caching vs. Recomputation

4 ALISA Algorithm Design

A. Attention Analysis

B. Sparse Window Attention (SWA)

5 ALISA System Design

A. Dynamic Scheduling

Three-Phase Scheduling

Sparsity-Aware Caching

B. KV Compression

6 Evaluation

A Experimental Setup

Models and datasets.

相关文章