自学内容网 自学内容网

大模型推理加速——ALISA

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

ISCA’24

Abstract

Algorithm and system co-design
Algorithm:Sparse Window Attention (SWA) Algorithm
System:three-phase token-level dynamical scheduling and optimizes the trade-off between caching and recomputation

1 Introduction

KV caching: reuse Intermediate states
the quadratic-complexity computation is reduced to linear complexity computation and memory accesses

Challenge:
LLM inference with KV caching is predominantly bottlenecked by memory
原因:

  1. MM 和 Softmax 本身是 memory-bound
  2. Weights 和 activations 本身已经达到 memory capacity
  3. KV caching进一步加剧了内存容量的需求(随着句子长度和batch size线性增长)

在这里插入图片描述
OPT-6.7B inference on one NVIDIA Tesla V100 GPU with 32 GB memory under different workloads,b, s, and n for workloads refer to the batch size, and input and output sequence length

为缓解内存压力 -> offloading KV Cache 到 CPU memory 甚至 secondary Storage
问题:offloading, reloading (data transfer overhead) 成为新的瓶颈

Proposal:
algorithm system co-design solution to accelerate LLM inference via sparsity-aware KV caching for single GPU-CPU systems

key observation:
during the autoregressive inference process, the attention weight matrix is highly sparse, and larger LLMs exhibit higher attention weight sparsity

Sparse Window Attention (SWA) algorithm

  1. identify which tokens are important
  2. globally dynamic and locally static sparse patterns

加速LLMs不只是一个计算问题,更是一个内存问题,因 memory footprint 巨大
挑战:

  1. Sparse KV Cache大小最终会超出内存限制,同时稠密 LLMs 中的长延迟 GPU - CPU 访存重现
  2. KV 张量的稀疏特性会导致不可预测的内存访问,而更长的序列会加剧这种情况
  3. 高精度的 (本工作中FP16) KV 张量仍然表现出较大的内存占用,从而具有较高的内存访问延迟

解决方法:

  1. dynamically schedule the KV tensors at the token level
  2. balance between caching and recomputation for best performance gain
  3. KV Cache量化至 INT8

2 Background

LLM

在这里插入图片描述

A W ( Q , K ) = σ ( Q K T d ) A t t n ( Q , K , V ) = A W ( Q , K ) ⋅ V \begin{align*} AW(Q,K) &=\sigma\left(\frac{QK^{T}}{\sqrt d}\right)\\\\ Attn(Q, K,V) &=AW(Q,K)\cdot V \end{align*} AW(Q,K)Attn(Q,K,V)=σ(d QKT)=AW(Q,K)V

KV Caching 将原本具有二次复杂度的 矩阵乘法 转化为具有线性复杂度的 向量-矩阵乘法 和存储器访问,从而显著提升了性能

图 2 (c) ,在没有KV缓存的情况下,执行时间迅速增加;在KV缓存下,只有新生成的令牌的注意力权重和分数被计算为向量-矩阵乘法,在不同的步骤中,执行时间几乎保持不变。这种运行时间的减少是以GPU内存使用量为代价的,随着时间的推移,GPU内存使用量逐渐增加,这是由于KV张量的规模越来越大。

与前人方法的对比:
在这里插入图片描述

ALISA Summarize:

  1. ALISA同时设计了算法和系统来充分利用稀疏注意力以获得更高的吞吐量
  2. 在单个Token的粒度上执行KV缓存,允许灵活的KV张量分配,这对于稀疏性驱动的协同设计至关重要
  3. ALISA采用合适的动态调度器来执行缓存和重新计算

3 Challenges and Opportunities

A Challenges

batch size, sequence length, and model configuration

orchestrates when, how, and what to offload and reload in resource-constrained systems, so that the overall execution time is minimized
在资源受限的系统中调度何时、如何、以及卸载和重加载什么,使得整体执行时间最小化

B Opportunities

profiling the sparsity in attention weights

在这里插入图片描述

two key observations:

  1. the attention weights in LLMs are highly sparse
  2. larger LLMs exhibit higher sparsity

motivates and validates our solution to create sparse KV tensors by skipping unimportant tokens in LLM inference

C Objective
Identifying Important Tokens

nondeterministic nature of language, the attention weights for each token vary from step to step

low-cost mechanism to distinguish important tokens without hurting accuracy significantly for LLM inference

Caching KV Tensors

store partial KV tensors in CPU memory for future reuse
naive implementation: Belady’s algorithm -> need future knowledge and large resources

low-cost caching policy to allocate sparse KV tensors and ensure a relatively low miss rate

Caching vs. Recomputation

Problem: As the sequence length grows, the benefit of KV caching diminishes at a certain threshold since the time for accessing CPU memory might outweigh that for recomputing partial KV tensors.

sequence length threshold varies across batch sizes and model configurations

dynamic scheduling strategy that balances KV caching and recomputation at the token level

4 ALISA Algorithm Design

A. Attention Analysis

之前的方法:
Fixed-size sliding Window
strided Attention

why the previous attention methods fail upon long sequences?
attention weights with larger values do not exhibit a specific pattern

在这里插入图片描述

Only using the most recent tokens cannot accurately represent the distribution of the entire attention weights, since the tokens with large attention weights (therefore more important) are often far from the current token

在这里插入图片描述

dense attention scores follow a near power-law distribution

the attention score distributions generated by local and strided attention show close to zero correlation to that of dense attention
相关性度量为:Spearman correlation score

B. Sparse Window Attention (SWA)

locally static and globally dynamic sparse patterns

在这里插入图片描述

The importance of the prior tokens for future token generation is determined by the sum of the local attention weights

在这里插入图片描述

First, the algorithm entails a caching ratio to determine how many tokens to keep at each step for KV sparsity and apply the sparse masks at the token level. While irregular sparsity could exist across tokens, each token is still a dense tensor.
Second, we use gather operations to pack sparse KV tensors into a dense one and perform dense matrix operations. Therefore, despite the multi-step attention calculation in SWA, both the computation and memory access for SWA remain regular, if we target a proper granularity.

5 ALISA System Design

Dynamic Scheduling 以及 KV Compression

SWA identifies important KV tensors and generates sparse patterns
Dynamic scheduling utilizes important tokens and user-specified caching ratio to balance sparsity-aware caching and recomputation at the token level during LLM inference
KV compression further reduces the overall memory footprint of KV tensors by quantizing them into INT8 format

A. Dynamic Scheduling
Three-Phase Scheduling
  • Phase I: GPU Caching
  • Phase II: GPU-CPU Caching
  • Phase III: Recomputation-Caching

在这里插入图片描述
在这里插入图片描述

keep the KV tensors for the locally static tokens in the GPU
store the preceding ones in the CPU

Sparsity-Aware Caching

how to determine the phase switch step and offload and recompute ratio of KV tensors

optimization Problem to minimize the total execution time

size of KV tensors for each token is 4 ⋅ b ⋅ l ⋅ h 4 · b · l · h 4blh bytes
the number of tokens moved from GPU to CPU: θ j c ( α ) = α ( j + s ) \theta_{j}^{c}(\alpha)=\alpha(j+s) θjc(α)=α(j+s)
the number of tokens moved from CPU to GPU: θ j g \theta_{j}^{g} θjg
T j m ( α ) = 4 ⋅ b ⋅ l ⋅ h ⋅ ( θ j c + θ j g ) B   p 1 ≤ j < n ,   0 ≤ θ j g ≤ ⌊ ( s + j ) r ⌉ T_{j}^{m}(\alpha)=\frac{4\cdot b\cdot l\cdot h\cdot (\theta_{j}^{c}+\theta_{j}^{g})}{B} \quad \ p_{1}\le j\lt n,\ 0\le \theta_{j}^{g}\le\lfloor(s+j)r\rceil Tjm(α)=B4blh(θjc+θjg) p1j<n, 0θjg⌊(s+j)r

在这里插入图片描述

B. KV Compression

将KV Cache量化至8bit

x q u a n t = round ( 1 λ x + z ) , x = λ ( x q u a n t − z ) x_{quant}=\text{round}\left(\frac{1}{\lambda}x+z\right),\quad x=\lambda(x_{quant}-z) xquant=round(λ1x+z),x=λ(xquantz)
λ = m a x − m i n 2 b − 1 , z = round ( − 2 b m a x − m i n ) \lambda= \frac{max-min}{2^{b}-1}, \quad z=\text{round}\left(\frac{-2^{b}}{max-min}\right) λ=2b1maxmin,z=round(maxmin2b)

6 Evaluation

A Experimental Setup
Models and datasets.

原文地址:https://blog.csdn.net/qq_42047140/article/details/143507571

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!