昇思25天学习打卡营第16天|LLM-MindNLP ChatGLM-6B StreamChat

🕗 发布于 2024-07-23 00:23 华为云 gpt 自然语言处理学习 人工智能

打卡

任务说明

任务说明

加载智谱清言的chatglm模型权重文件（目前有4个版本），本次主要尝试了chatglm-6b。

chatglm 6b 在提供的本次实验npu卡中可以正常加载，加载2和3版本时加载tokenizer出错了。但是可以看到model的打印结果，看到chatglm2 和 chatglm3 的模型结构相比1版本，词表扩充了2w+。而我们知道chatglm3-6b 还具有了 functional calling 功能。

环境配置

欧拉操作系统2.0（SP8）；
python3.9；
华为 NPU 910A；
mindnlp, mindspore, mindvision 安装部署。

其余安装：

!pip install mdtex2html
!pip install -i https://pypi.mirrors.ustc.edu.cn/simple mindspore==2.2.14

变量设置：HF_ENDPOINT=https://hf-mirror.com

部署方式

from mindnlp.transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import gradio as gr
import mdtex2html


### 下载 && 加载模型权重 
model = AutoModelForSeq2SeqLM.from_pretrained(
            'ZhipuAI/ChatGLM-6B', 
            mirror="modelscope").half()
model.set_train(False)

tokenizer = AutoTokenizer.from_pretrained(
            'ZhipuAI/ChatGLM-6B', 
             mirror="modelscope")

### 修改参数和prompt体验模型
prompt = '你好'
history = []
response, _ = model.chat(tokenizer, prompt, history=history, max_length=20)
response

ChatGLM-6B 体验截图示例

如上图，ChatGLM-6B tokenzier 的词表大小是 13w+，有5种特殊的 token。

如下为模型打印结果。

ChatGLM-6B 模型结构解析如下

1. 1级主类：ChatGLMForConditionalGeneration 生成式Transformer模型，用于条件文本生成。

2. 2级类：ChatGLMModel 层，是transformer 结构，是模型的核心部分。

3. 2级类：lm_head 结构的 Dense 全连接层。dim[in, out]=[4096, 130528]

4. ChatGLMModel 结构下的3级类组件分三层：

>> word_embeddings 嵌入层：dim[in, out]=[130528, 4096] ，即使用了 130528 个词汇，每个词汇映射到一个4096维的向量。

>> layers 网络结构层：Transformer模型的主体，包含 28 个GLMBlock。

>> final_layernorm 最后的层归一化。

5. GLMBlock 的结构：

》》1 input_layernorm层，层归一化，用于在注意力机制之前对输入进行归一化。

》》2 SelfAttention层，自注意力机制，用于计算输入序列中不同位置的注意力权重。共包括3层：RotaryEmbedding 旋转嵌入，用于增强模型对位置信息的处理能力； Dense（query_key_value）用于生成查询（Q）、键（K）和值（V）向量；Dense（Dense）用于自注意力机制的输出。。

》》3 post_attention_layernorm层，用于自注意力之后的归一化。

》》4 mlp 层，多层感知机，用于对自注意力层的输出进行进一步的非线性变换。这里的MLP使用的是GLU（门控线性单元）激活函数。dense_h_to_4h 线性变换将输入维度扩大4倍。dense_4h_to_h 将扩大后的维度还原。

如下图，chatglm-6b model 的打印结果。

$ print(model)

ChatGLMForConditionalGeneration<
  (transformer): ChatGLMModel<
    (word_embeddings): Embedding<vocab_size=130528, embedding_size=4096, use_one_hot=False, weight=Parameter (Tensor(shape=[130528, 4096], dtype=Float16, value=[...], name=transformer.word_embeddings.weight), requires_grad=True), dtype=Float16, padding_idx=None>
    (layers): CellList<
      (0): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.0.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.0.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.0.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.0.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (1): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.1.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.1.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.1.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.1.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (2): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.2.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.2.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.2.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.2.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (3): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.3.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.3.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.3.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.3.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (4): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.4.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.4.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.4.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.4.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (5): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.5.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.5.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.5.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.5.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (6): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.6.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.6.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.6.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.6.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (7): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.7.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.7.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.7.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.7.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (8): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.8.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.8.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.8.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.8.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (9): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.9.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.9.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.9.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.9.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (10): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.10.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.10.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.10.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.10.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (11): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.11.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.11.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.11.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.11.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (12): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.12.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.12.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.12.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.12.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (13): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.13.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.13.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.13.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.13.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (14): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.14.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.14.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.14.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.14.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (15): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.15.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.15.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.15.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.15.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (16): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.16.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.16.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.16.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.16.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (17): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.17.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.17.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.17.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.17.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (18): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.18.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.18.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.18.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.18.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (19): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.19.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.19.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.19.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.19.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (20): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.20.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.20.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.20.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.20.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (21): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.21.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.21.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.21.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.21.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (22): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.22.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.22.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.22.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.22.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (23): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.23.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.23.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.23.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.23.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (24): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.24.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.24.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.24.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.24.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (25): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.25.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.25.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.25.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.25.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (26): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.26.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.26.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.26.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.26.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (27): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.27.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.27.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.27.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.27.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      >
    (final_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.final_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.final_layernorm.bias), requires_grad=True)>
    >
  (lm_head): Dense<input_channels=4096, output_channels=130528>
  >

ChatGLM2-6B 模型结构解析如下

如下图，chatglm2-6b model 的打印结果。相比于1版本，模型结构没有变化，只是vocab_size词表扩充成了15w+。

$ print(model)


ChatGLMForConditionalGeneration<
  (transformer): ChatGLMModel<
    (word_embeddings): Embedding<vocab_size=150528, embedding_size=4096, use_one_hot=False, weight=Parameter (Tensor(shape=[150528, 4096], dtype=Float16, value=[...], name=transformer.word_embeddings.weight), requires_grad=True), dtype=Float16, padding_idx=None>
    (layers): CellList<
      (0): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.0.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.0.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.0.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.0.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (1): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.1.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.1.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.1.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.1.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (2): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.2.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.2.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.2.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.2.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (3): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.3.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.3.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.3.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.3.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (4): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.4.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.4.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.4.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.4.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (5): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.5.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.5.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.5.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.5.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (6): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.6.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.6.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.6.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.6.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (7): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.7.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.7.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.7.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.7.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (8): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.8.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.8.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.8.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.8.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (9): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.9.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.9.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.9.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.9.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (10): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.10.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.10.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.10.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.10.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (11): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.11.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.11.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.11.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.11.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (12): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.12.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.12.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.12.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.12.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (13): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.13.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.13.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.13.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.13.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (14): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.14.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.14.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.14.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.14.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (15): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.15.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.15.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.15.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.15.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (16): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.16.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.16.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.16.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.16.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (17): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.17.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.17.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.17.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.17.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (18): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.18.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.18.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.18.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.18.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (19): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.19.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.19.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.19.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.19.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (20): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.20.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.20.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.20.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.20.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (21): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.21.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.21.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.21.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.21.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (22): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.22.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.22.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.22.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.22.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (23): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.23.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.23.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.23.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.23.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (24): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.24.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.24.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.24.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.24.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (25): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.25.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.25.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.25.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.25.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (26): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.26.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.26.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.26.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.26.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      (27): GLMBlock<
        (input_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.27.input_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.27.input_layernorm.bias), requires_grad=True)>
        (attention): SelfAttention<
          (rotary_emb): RotaryEmbedding<>
          (query_key_value): Dense<input_channels=4096, output_channels=12288, has_bias=True>
          (dense): Dense<input_channels=4096, output_channels=4096, has_bias=True>
          >
        (post_attention_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.27.post_attention_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.layers.27.post_attention_layernorm.bias), requires_grad=True)>
        (mlp): GLU<
          (dense_h_to_4h): Dense<input_channels=4096, output_channels=16384, has_bias=True>
          (dense_4h_to_h): Dense<input_channels=16384, output_channels=4096, has_bias=True>
          >
        >
      >
    (final_layernorm): LayerNorm<normalized_shape=[4096], begin_norm_axis=-1, begin_params_axis=-1, weight=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.final_layernorm.weight), requires_grad=True), bias=Parameter (Tensor(shape=[4096], dtype=Float16, value=[...], name=transformer.final_layernorm.bias), requires_grad=True)>
    >
  (lm_head): Dense<input_channels=4096, output_channels=150528>
  >

原文地址：https://blog.csdn.net/wwt72/article/details/140506997

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：基于 Apache 的 httpd 文件服务器
下一篇：组合数求法

【学习日记】notebook添加JAVA支持
作者是个大学生这个专栏主要收集课时常用的软件以及女朋友上课用的软件的教程。需提前配置好java环境本篇仅对添加支持进行说明。新开了gitcode 用于上传安装包。解压进入解压后目录复制文件地
阅读更多2024-11-15
Docker与Podman全面比较
Docker和Podman作为两大容器引擎，各自拥有独特的特点和优势。本文将从溯源、特点、技术优势、应用实例和技术前景等方面对Docker和Podman进行全面比较。
阅读更多2024-11-15
算法学习blog：day2 继续记日记
4. 明日计划：至少五道题，并且要学会并实现今天的三道题，看这五道题的思路解法，下一天进行实现优化。除此之外pdf粗略看到了20页，明天继续看，后面才是重点。1.做了三道PAT 76，77，78，差一
阅读更多2024-11-15
基于Python的网上银行综合管理系统
【2025最新】基于python+django+vue+MySQL的网上银行综合管理系统，前后端分离。
阅读更多2024-11-15
自定义注解+拦截器+jwtFilter实现权限控制
GetterSUPER_ADMIN(1, "超级管理员"),SYSTEM_ADMIN(2, "系统管理员"),DOMESTIC_CONSUMER(3, &quo
阅读更多2024-11-15
前端面试题整理-vue指令开发
在 bind 钩子中，我为绑定的元素添加了一个点击事件监听器，当元素被点击时，执行复制操作。我当时在开发点击复制文本的功能，我有很多个元素都想有这个功能，但是我又不想每个元素都绑定一个 onClick
阅读更多2024-11-15
在使用ipc通信时，在渲染进程的Vue + TypeScript 开发过程，给window对象添加属性并赋值时，发生报错解决方法
在使用ipc通信时，在渲染进程的Vue + TypeScript 开发过程，给window对象添加属性并赋值时，发生报错解决方法
阅读更多2024-11-15
GESP4级考试语法知识（贪心算法（四））
GESP4级考试语法知识（贪心算法（四））
阅读更多2024-11-15
20241114在飞凌的OK3588-C的核心板上跑Linux R4时通过iperf3测试以太网卡的实际网速
创建一个eth0配置文件，配置文件的路径为：/etc/network/interfaces.d/eth0,设置动态ip的配置文件。虽然飞凌的OK3588-C的核心板使用的是千兆网卡RTL8211
阅读更多2024-11-15
【EmbeddedGUI】脏矩阵设计说明
脏矩阵设计说明
阅读更多2024-11-15

昇思25天学习打卡营第16天|LLM-MindNLP ChatGLM-6B StreamChat

打卡

任务说明

环境配置

部署方式

ChatGLM-6B 体验截图示例

ChatGLM-6B 模型结构解析如下

ChatGLM2-6B 模型结构解析如下

相关文章