自学内容网 自学内容网

AI大模型架构背后的数学原理和数学公式,基于Transformer架构的数学公式有哪些?

大家好,我是微学AI,今天给大家介绍一下大模型架构大部分是基于Transformer架构的研发出来的,背后的数学原理涉及线性代数、概率论、优化理论等。以下是关键数学原理和公式的详细说明及示例。
在这里插入图片描述

大模型背后隐藏的数学原理

1. 线性变换(Linear Transformation)

大模型的核心操作之一是线性变换,公式为:
y = W x + b \mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b} y=Wx+b

  • x \mathbf{x} x 是输入向量(维度 d in d_{\text{in}} din)。
  • W \mathbf{W} W 是权重矩阵(维度 d out × d in d_{\text{out}} \times d_{\text{in}} dout×din)。
  • b \mathbf{b} b 是偏置向量(维度 d out d_{\text{out}} dout)。
  • y \mathbf{y} y 是输出向量(维度 d out d_{\text{out}} dout)。

例子
假设输入向量 x = [ 1 , 2 , 3 ] ⊤ \mathbf{x} = [1, 2, 3]^\top x=[1,2,3],权重矩阵 W = [ 1 0 1 0 1 0 ] \mathbf{W} = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix} W=[100110],偏置向量 b = [ 0.5 , − 0.5 ] ⊤ \mathbf{b} = [0.5, -0.5]^\top b=[0.5,0.5],则:
y = W x + b = [ 1 0 1 0 1 0 ] [ 1 2 3 ] + [ 0.5 − 0.5 ] = [ 4 2 ] + [ 0.5 − 0.5 ] = [ 4.5 1.5 ] \mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b} = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 0.5 \\ -0.5 \end{bmatrix} = \begin{bmatrix} 4 \\ 2 \end{bmatrix} + \begin{bmatrix} 0.5 \\ -0.5 \end{bmatrix} = \begin{bmatrix} 4.5 \\ 1.5 \end{bmatrix} y=Wx+b=[100110] 123 +[0.50.5]=[42]+[0.50.5]=[4.51.5]


2. 位置编码(Positional Encoding)

Transformer模型使用位置编码来注入序列的位置信息,公式为:
P E ( p o s , 2 i ) = sin ⁡ ( p o s 1000 0 2 i d ) , P E ( p o s , 2 i + 1 ) = cos ⁡ ( p o s 1000 0 2 i d ) PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) PE(pos,2i)=sin(10000d2ipos),PE(pos,2i+1)=cos(10000d2ipos)

  • p o s pos pos 是位置索引。
  • i i i 是维度索引。
  • d d d 是嵌入维度。

例子
假设 p o s = 1 pos = 1 pos=1 d = 4 d = 4 d=4,则:
P E ( 1 , 0 ) = sin ⁡ ( 1 1000 0 0 / 4 ) = sin ⁡ ( 1 ) , P E ( 1 , 1 ) = cos ⁡ ( 1 1000 0 0 / 4 ) = cos ⁡ ( 1 ) PE_{(1, 0)} = \sin\left(\frac{1}{10000^{0/4}}\right) = \sin(1), \quad PE_{(1, 1)} = \cos\left(\frac{1}{10000^{0/4}}\right) = \cos(1) PE(1,0)=sin(100000/41)=sin(1),PE(1,1)=cos(100000/41)=cos(1)
P E ( 1 , 2 ) = sin ⁡ ( 1 1000 0 2 / 4 ) = sin ⁡ ( 1 100 ) , P E ( 1 , 3 ) = cos ⁡ ( 1 1000 0 2 / 4 ) = cos ⁡ ( 1 100 ) PE_{(1, 2)} = \sin\left(\frac{1}{10000^{2/4}}\right) = \sin\left(\frac{1}{100}\right), \quad PE_{(1, 3)} = \cos\left(\frac{1}{10000^{2/4}}\right) = \cos\left(\frac{1}{100}\right) PE(1,2)=sin(100002/41)=sin(1001),PE(1,3)=cos(100002/41)=cos(1001)


3. 注意力机制(Attention Mechanism)

注意力机制的核心是计算查询(Query)、键(Key)和值(Value)之间的相似度,公式为:
Attention ( Q , K , V ) = softmax ( Q K ⊤ d k ) V \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V} Attention(Q,K,V)=softmax(dk QK)V

  • Q \mathbf{Q} Q 是查询矩阵(维度 n × d k n \times d_k n×dk)。
  • K \mathbf{K} K 是键矩阵(维度 m × d k m \times d_k m×dk)。
  • V \mathbf{V} V 是值矩阵(维度 m × d v m \times d_v m×dv)。
  • d k d_k dk 是键的维度。
  • softmax \text{softmax} softmax 是归一化函数。

例子
假设 Q = [ 1 0 0 1 ] \mathbf{Q} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} Q=[1001] K = [ 0 1 1 0 ] \mathbf{K} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} K=[0110] V = [ 1 2 3 4 ] \mathbf{V} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} V=[1324] d k = 2 d_k = 2 dk=2,则:
Q K ⊤ = [ 1 0 0 1 ] [ 0 1 1 0 ] = [ 0 1 1 0 ] \mathbf{Q}\mathbf{K}^\top = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} QK=[1001][0110]=[0110]
softmax ( Q K ⊤ 2 ) = softmax ( [ 0 0.707 0.707 0 ] ) ≈ [ 0.5 0.5 0.5 0.5 ] \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{2}}\right) = \text{softmax}\left(\begin{bmatrix} 0 & 0.707 \\ 0.707 & 0 \end{bmatrix}\right) \approx \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix} softmax(2 QK)=softmax([00.7070.7070])[0.50.50.50.5]
Attention ( Q , K , V ) = [ 0.5 0.5 0.5 0.5 ] [ 1 2 3 4 ] = [ 2 3 2 3 ] \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix} \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 2 & 3 \\ 2 & 3 \end{bmatrix} Attention(Q,K,V)=[0.50.50.50.5][1324]=[2233]


4. 多头注意力机制(Multi-Head Attention)

多头注意力机制通过并行计算多个注意力头来捕捉不同的特征,公式为:
MultiHead ( Q , K , V ) = Concat ( head 1 , head 2 , … , head h ) W O \text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h)\mathbf{W}^O MultiHead(Q,K,V)=Concat(head1,head2,,headh)WO
其中每个注意力头的计算为:
head i = Attention ( Q W i Q , K W i K , V W i V ) \text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V) headi=Attention(QWiQ,KWiK,VWiV)

  • W i Q , W i K , W i V \mathbf{W}_i^Q, \mathbf{W}_i^K, \mathbf{W}_i^V WiQ,WiK,WiV 是每个头的投影矩阵。
  • W O \mathbf{W}^O WO 是输出投影矩阵。
  • h h h 是注意力头的数量。

例子
假设 h = 2 h = 2 h=2 Q = [ 1 0 0 1 ] \mathbf{Q} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} Q=[1001] K = [ 0 1 1 0 ] \mathbf{K} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} K=[0110] V = [ 1 2 3 4 ] \mathbf{V} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} V=[1324],投影矩阵为:
W 1 Q = [ 1 0 0 1 ] , W 1 K = [ 1 0 0 1 ] , W 1 V = [ 1 0 0 1 ] \mathbf{W}_1^Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \quad \mathbf{W}_1^K = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \quad \mathbf{W}_1^V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} W1Q=[1001],W1K=[1001],W1V=[1001]
W 2 Q = [ 0 1 1 0 ] , W 2 K = [ 0 1 1 0 ] , W 2 V = [ 0 1 1 0 ] \mathbf{W}_2^Q = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad \mathbf{W}_2^K = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad \mathbf{W}_2^V = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} W2Q=[0110],W2K=[0110],W2V=[0110]
则:
head 1 = Attention ( Q W 1 Q , K W 1 K , V W 1 V ) = Attention ( Q , K , V ) \text{head}_1 = \text{Attention}(\mathbf{Q}\mathbf{W}_1^Q, \mathbf{K}\mathbf{W}_1^K, \mathbf{V}\mathbf{W}_1^V) = \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) head1=Attention(QW1Q,KW1K,VW1V)=Attention(Q,K,V)
head 2 = Attention ( Q W 2 Q , K W 2 K , V W 2 V ) = Attention ( [ 0 1 1 0 ] , [ 1 0 0 1 ] , [ 2 1 4 3 ] ) \text{head}_2 = \text{Attention}(\mathbf{Q}\mathbf{W}_2^Q, \mathbf{K}\mathbf{W}_2^K, \mathbf{V}\mathbf{W}_2^V) = \text{Attention}(\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}, \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \begin{bmatrix} 2 & 1 \\ 4 & 3 \end{bmatrix}) head2=Attention(QW2Q,KW2K,VW2V)=Attention([0110],[1001],[2413])


5. 残差连接(Residual Connection)

残差连接用于缓解梯度消失问题,公式为:
y = Layer ( x ) + x \mathbf{y} = \text{Layer}(\mathbf{x}) + \mathbf{x} y=Layer(x)+x

  • x \mathbf{x} x 是输入。
  • Layer ( x ) \text{Layer}(\mathbf{x}) Layer(x) 是某一层的输出。

例子
假设 x = [ 1 , 2 ] ⊤ \mathbf{x} = [1, 2]^\top x=[1,2],某一层的输出 Layer ( x ) = [ 0.5 , − 0.5 ] ⊤ \text{Layer}(\mathbf{x}) = [0.5, -0.5]^\top Layer(x)=[0.5,0.5],则:
y = [ 0.5 , − 0.5 ] ⊤ + [ 1 , 2 ] ⊤ = [ 1.5 , 1.5 ] ⊤ \mathbf{y} = [0.5, -0.5]^\top + [1, 2]^\top = [1.5, 1.5]^\top y=[0.5,0.5]+[1,2]=[1.5,1.5]


6. 层归一化(Layer Normalization)

层归一化用于稳定训练过程,公式为:
LayerNorm ( x ) = γ ⋅ x − μ σ + β \text{LayerNorm}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x} - \mu}{\sigma} + \beta LayerNorm(x)=γσxμ+β

  • x \mathbf{x} x 是输入向量。
  • μ \mu μ 是均值, σ \sigma σ 是标准差。
  • γ \gamma γ β \beta β 是可学习的参数。

例子
假设 x = [ 1 , 2 , 3 ] ⊤ \mathbf{x} = [1, 2, 3]^\top x=[1,2,3] μ = 2 \mu = 2 μ=2 σ = ( 1 − 2 ) 2 + ( 2 − 2 ) 2 + ( 3 − 2 ) 2 3 = 2 3 \sigma = \sqrt{\frac{(1-2)^2 + (2-2)^2 + (3-2)^2}{3}} = \sqrt{\frac{2}{3}} σ=3(12)2+(22)2+(32)2 =32 γ = 1 \gamma = 1 γ=1 β = 0 \beta = 0 β=0,则:
LayerNorm ( x ) = 1 ⋅ [ 1 , 2 , 3 ] − 2 2 3 + 0 ≈ [ − 1.225 , 0 , 1.225 ] \text{LayerNorm}(\mathbf{x}) = 1 \cdot \frac{[1, 2, 3] - 2}{\sqrt{\frac{2}{3}}} + 0 \approx [-1.225, 0, 1.225] LayerNorm(x)=132 [1,2,3]2+0[1.225,0,1.225]


7. GELU激活函数

GELU(Gaussian Error Linear Unit)是一种常用的激活函数,公式为:
GELU ( x ) = x ⋅ Φ ( x ) \text{GELU}(x) = x \cdot \Phi(x) GELU(x)=xΦ(x)
其中 Φ ( x ) \Phi(x) Φ(x) 是标准正态分布的累积分布函数,近似计算为:
GELU ( x ) ≈ 0.5 x ( 1 + tanh ⁡ ( 2 π ( x + 0.044715 x 3 ) ) ) \text{GELU}(x) \approx 0.5x \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right)\right) GELU(x)0.5x(1+tanh(π2 (x+0.044715x3)))

例子
假设 x = 1 x = 1 x=1,则:
GELU ( 1 ) ≈ 0.5 ⋅ 1 ( 1 + tanh ⁡ ( 2 π ( 1 + 0.044715 ⋅ 1 3 ) ) ) ≈ 0.841 \text{GELU}(1) \approx 0.5 \cdot 1 \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(1 + 0.044715 \cdot 1^3)\right)\right) \approx 0.841 GELU(1)0.51(1+tanh(π2 (1+0.04471513)))0.841

8. Softmax 函数

Softmax 函数用于将向量转换为概率分布,公式为:
softmax ( z ) i = e z i ∑ j = 1 n e z j \text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}} softmax(z)i=j=1nezjezi

  • z \mathbf{z} z 是输入向量。
  • z i z_i zi 是向量的第 i i i 个元素。

例子
假设 z = [ 1 , 2 , 3 ] ⊤ \mathbf{z} = [1, 2, 3]^\top z=[1,2,3],则:
softmax ( z ) = [ e 1 e 1 + e 2 + e 3 , e 2 e 1 + e 2 + e 3 , e 3 e 1 + e 2 + e 3 ] ≈ [ 0.090 , 0.245 , 0.665 ] \text{softmax}(\mathbf{z}) = \left[\frac{e^1}{e^1 + e^2 + e^3}, \frac{e^2}{e^1 + e^2 + e^3}, \frac{e^3}{e^1 + e^2 + e^3}\right] \approx [0.090, 0.245, 0.665] softmax(z)=[e1+e2+e3e1,e1+e2+e3e2,e1+e2+e3e3][0.090,0.245,0.665]


9. 损失函数(Loss Function)

大模型通常使用交叉熵损失函数,公式为:
L ( y , y ^ ) = − ∑ i = 1 n y i log ⁡ ( y ^ i ) \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{i=1}^n y_i \log(\hat{y}_i) L(y,y^)=i=1nyilog(y^i)

  • y \mathbf{y} y 是真实标签(one-hot 编码)。
  • y ^ \hat{\mathbf{y}} y^ 是模型预测的概率分布。

例子
假设真实标签 y = [ 0 , 1 , 0 ] ⊤ \mathbf{y} = [0, 1, 0]^\top y=[0,1,0],模型预测 y ^ = [ 0.1 , 0.7 , 0.2 ] ⊤ \hat{\mathbf{y}} = [0.1, 0.7, 0.2]^\top y^=[0.1,0.7,0.2],则:
L ( y , y ^ ) = − ( 0 ⋅ log ⁡ ( 0.1 ) + 1 ⋅ log ⁡ ( 0.7 ) + 0 ⋅ log ⁡ ( 0.2 ) ) = − log ⁡ ( 0.7 ) ≈ 0.357 \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = - (0 \cdot \log(0.1) + 1 \cdot \log(0.7) + 0 \cdot \log(0.2)) = -\log(0.7) \approx 0.357 L(y,y^)=(0log(0.1)+1log(0.7)+0log(0.2))=log(0.7)0.357


10. Dropout

Dropout是一种正则化方法,训练时随机丢弃部分神经元,公式为:
y = m ⊙ x \mathbf{y} = \mathbf{m} \odot \mathbf{x} y=mx

  • m \mathbf{m} m 是掩码向量,元素为0或1,概率为 p p p
  • ⊙ \odot 是逐元素乘法。

例子
假设 x = [ 1 , 2 , 3 ] ⊤ \mathbf{x} = [1, 2, 3]^\top x=[1,2,3] p = 0.5 p = 0.5 p=0.5,掩码 m = [ 1 , 0 , 1 ] ⊤ \mathbf{m} = [1, 0, 1]^\top m=[1,0,1],则:
y = [ 1 , 0 , 1 ] ⊤ ⊙ [ 1 , 2 , 3 ] ⊤ = [ 1 , 0 , 3 ] ⊤ \mathbf{y} = [1, 0, 1]^\top \odot [1, 2, 3]^\top = [1, 0, 3]^\top y=[1,0,1][1,2,3]=[1,0,3]


11. 反向传播(Backpropagation)

反向传播通过链式法则计算梯度,公式为:
∂ L ∂ W = ∂ L ∂ y ⋅ ∂ y ∂ W \frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{W}} WL=yLWy

  • L \mathcal{L} L 是损失函数。
  • y \mathbf{y} y 是模型输出。

例子
假设 y = W x \mathbf{y} = \mathbf{W}\mathbf{x} y=Wx L = 1 2 ( y − t ) 2 \mathcal{L} = \frac{1}{2}(\mathbf{y} - \mathbf{t})^2 L=21(yt)2,则:
∂ L ∂ W = ( y − t ) ⋅ x ⊤ \frac{\partial \mathcal{L}}{\partial \mathbf{W}} = (\mathbf{y} - \mathbf{t}) \cdot \mathbf{x}^\top WL=(yt)x


12. 梯度下降(Gradient Descent)

梯度下降用于优化模型参数,更新公式为:
θ ← θ − η ∇ θ L \mathbf{\theta} \leftarrow \mathbf{\theta} - \eta \nabla_\theta \mathcal{L} θθηθL

  • θ \mathbf{\theta} θ 是模型参数。
  • η \eta η 是学习率。
  • ∇ θ L \nabla_\theta \mathcal{L} θL 是损失函数对参数的梯度。

例子
假设损失函数 L ( θ ) = θ 2 \mathcal{L}(\theta) = \theta^2 L(θ)=θ2,初始参数 θ = 3 \theta = 3 θ=3,学习率 η = 0.1 \eta = 0.1 η=0.1,则:
∇ θ L = 2 θ = 6 \nabla_\theta \mathcal{L} = 2\theta = 6 θL=2θ=6
θ ← θ − η ∇ θ L = 3 − 0.1 × 6 = 2.4 \theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L} = 3 - 0.1 \times 6 = 2.4 θθηθL=30.1×6=2.4


13. Adam优化器

Adam优化器结合了动量和自适应学习率,更新公式为:
m t = β 1 m t − 1 + ( 1 − β 1 ) ∇ θ L \mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \nabla_\theta \mathcal{L} mt=β1mt1+(1β1)θL
v t = β 2 v t − 1 + ( 1 − β 2 ) ( ∇ θ L ) 2 \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) (\nabla_\theta \mathcal{L})^2 vt=β2vt1+(1β2)(θL)2
m ^ t = m t 1 − β 1 t , v ^ t = v t 1 − β 2 t \hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t} m^t=1β1tmt,v^t=1β2tvt
θ t = θ t − 1 − η m ^ t v ^ t + ϵ \mathbf{\theta}_t = \mathbf{\theta}_{t-1} - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} θt=θt1ηv^t +ϵm^t

  • m t \mathbf{m}_t mt v t \mathbf{v}_t vt 分别是动量项和二阶动量项。
  • β 1 , β 2 \beta_1, \beta_2 β1,β2 是衰减率。
  • η \eta η 是学习率。
  • ϵ \epsilon ϵ 是平滑项。

例子
假设 ∇ θ L = [ 0.1 , − 0.2 ] ⊤ \nabla_\theta \mathcal{L} = [0.1, -0.2]^\top θL=[0.1,0.2] β 1 = 0.9 \beta_1 = 0.9 β1=0.9 β 2 = 0.999 \beta_2 = 0.999 β2=0.999 η = 0.001 \eta = 0.001 η=0.001 ϵ = 1 e − 8 \epsilon = 1e-8 ϵ=1e8,初始 m 0 = v 0 = 0 \mathbf{m}_0 = \mathbf{v}_0 = \mathbf{0} m0=v0=0,则:
m 1 = 0.9 ⋅ 0 + 0.1 ⋅ [ 0.1 , − 0.2 ] ⊤ = [ 0.01 , − 0.02 ] ⊤ \mathbf{m}_1 = 0.9 \cdot \mathbf{0} + 0.1 \cdot [0.1, -0.2]^\top = [0.01, -0.02]^\top m1=0.90+0.1[0.1,0.2]=[0.01,0.02]
v 1 = 0.999 ⋅ 0 + 0.001 ⋅ [ 0. 1 2 , ( − 0.2 ) 2 ] ⊤ = [ 0.0001 , 0.0004 ] ⊤ \mathbf{v}_1 = 0.999 \cdot \mathbf{0} + 0.001 \cdot [0.1^2, (-0.2)^2]^\top = [0.0001, 0.0004]^\top v1=0.9990+0.001[0.12,(0.2)2]=[0.0001,0.0004]
m ^ 1 = [ 0.01 , − 0.02 ] ⊤ 1 − 0. 9 1 = [ 0.01 , − 0.02 ] ⊤ \hat{\mathbf{m}}_1 = \frac{[0.01, -0.02]^\top}{1 - 0.9^1} = [0.01, -0.02]^\top m^1=10.91[0.01,0.02]=[0.01,0.02]
v ^ 1 = [ 0.0001 , 0.0004 ] ⊤ 1 − 0.99 9 1 = [ 0.0001 , 0.0004 ] ⊤ \hat{\mathbf{v}}_1 = \frac{[0.0001, 0.0004]^\top}{1 - 0.999^1} = [0.0001, 0.0004]^\top v^1=10.9991[0.0001,0.0004]=[0.0001,0.0004]
θ 1 = θ 0 − 0.001 ⋅ [ 0.01 , − 0.02 ] ⊤ [ 0.0001 , 0.0004 ] ⊤ + 1 e − 8 ≈ θ 0 − [ 0.1 , − 0.1 ] ⊤ \mathbf{\theta}_1 = \mathbf{\theta}_0 - 0.001 \cdot \frac{[0.01, -0.02]^\top}{\sqrt{[0.0001, 0.0004]^\top} + 1e-8} \approx \mathbf{\theta}_0 - [0.1, -0.1]^\top θ1=θ00.001[0.0001,0.0004] +1e8[0.01,0.02]θ0[0.1,0.1]


以上是大模型架构背后的核心数学原理和公式。这些公式构成了深度学习模型的基础,并在实际应用中通过高效的数值计算库(如PyTorch、TensorFlow、Paddle)实现。


原文地址:https://blog.csdn.net/weixin_42878111/article/details/145211512

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!