BERT模型核心组件详解及其实现

🕗 发布于 2024-11-16 11:37 bert 人工智能 深度学习

摘要

BERT（Bidirectional Encoder Representations from Transformers）是一种基于Transformer架构的预训练模型，在自然语言处理领域取得了显著的成果。本文详细介绍了BERT模型中的几个关键组件及其实现，包括激活函数、变量初始化、嵌入查找、层归一化等。通过深入理解这些组件，读者可以更好地掌握BERT模型的工作原理，并在实际应用中进行优化和调整。

1. 引言

BERT模型由Google研究人员于2018年提出，通过大规模的无监督预训练和任务特定的微调，显著提升了多个自然语言处理任务的性能。本文将重点介绍BERT模型中的几个核心组件，包括激活函数、变量初始化、嵌入查找、层归一化等，并提供相应的代码实现。

2. 激活函数

2.1 Gaussian Error Linear Unit (GELU)

GELU是一种平滑的ReLU变体，其数学表达式如下：

GELU(x)=x⋅Φ(x)GELU(x)=x⋅Φ(x)

其中，Φ(x)Φ(x)是标准正态分布的累积分布函数（CDF）。在TensorFlow中，GELU可以通过以下方式实现：

def gelu(x):
  """Gaussian Error Linear Unit.

  This is a smoother version of the RELU.
  Original paper: https://arxiv.org/abs/1606.08415
  Args:
    x: float Tensor to perform activation.

  Returns:
    `x` with the GELU activation applied.
  """
  cdf = 0.5 * (1.0 + tf.tanh(
      (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
  return x * cdf

2.2 激活函数映射

为了方便使用不同的激活函数，我们定义了一个映射函数get_activation，该函数根据传入的字符串返回相应的激活函数：

def get_activation(activation_string):
  """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`.

  Args:
    activation_string: String name of the activation function.

  Returns:
    A Python function corresponding to the activation function. If
    `activation_string` is None, empty, or "linear", this will return None.
    If `activation_string` is not a string, it will return `activation_string`.

  Raises:
    ValueError: The `activation_string` does not correspond to a known
      activation.
  """
  if not isinstance(activation_string, six.string_types):
    return activation_string

  if not activation_string:
    return None

  act = activation_string.lower()
  if act == "linear":
    return None
  elif act == "relu":
    return tf.nn.relu
  elif act == "gelu":
    return gelu
  elif act == "tanh":
    return tf.tanh
  else:
    raise ValueError("Unsupported activation: %s" % act)

3. 变量初始化

在深度学习中，合理的变量初始化对于模型的收敛速度和最终性能至关重要。BERT模型中使用了截断正态分布初始化器（truncated_normal_initializer），其标准差为0.02：

def create_initializer(initializer_range=0.02):
  """Creates a `truncated_normal_initializer` with the given range."""
  return tf.truncated_normal_initializer(stddev=initializer_range)

4. 嵌入查找

嵌入查找是将输入的token id转换为向量表示的过程。BERT模型中使用了两种方法：一种是使用tf.gather()，另一种是使用one-hot编码：

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.gather()`.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [-1])
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)
  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

5. 层归一化

层归一化（Layer Normalization）是一种常用的归一化技术，用于加速训练过程并提高模型的泛化能力。BERT模型中使用了tf.contrib.layers.layer_norm来进行层归一化：

def layer_norm(input_tensor, name=None):
  """Run layer normalization on the last dimension of the tensor."""
  return tf.contrib.layers.layer_norm(
      inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)

为了方便使用，我们还定义了一个组合函数layer_norm_and_dropout，该函数先进行层归一化，再进行dropout操作：

def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
  """Runs layer normalization followed by dropout."""
  output_tensor = layer_norm(input_tensor, name)
  output_tensor = dropout(output_tensor, dropout_prob)
  return output_tensor

6. Dropout

Dropout是一种常用的正则化技术，用于防止模型过拟合。在BERT模型中，dropout的概率可以通过配置参数进行设置：

def dropout(input_tensor, dropout_prob):
  """Perform dropout.

  Args:
    input_tensor: float Tensor.
    dropout_prob: Python float. The probability of dropping out a value (NOT of
      *keeping* a dimension as in `tf.nn.dropout`).

  Returns:
    A version of `input_tensor` with dropout applied.
  """
  if dropout_prob is None or dropout_prob == 0.0:
    return input_tensor

  output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
  return output

7. 从检查点加载变量

在微调过程中，通常需要从预训练的模型检查点中加载变量。get_assignment_map_from_checkpoint函数用于计算当前变量与检查点变量的映射关系：

def get_assignment_map_from_checkpoint(tvars, init_checkpoint):
  """Compute the union of the current variables and checkpoint variables."""
  assignment_map = {}
  initialized_variable_names = {}

  name_to_variable = collections.OrderedDict()
  for var in tvars:
    name = var.name
    m = re.match("^(.*):\\d+$", name)
    if m is not None:
      name = m.group(1)
    name_to_variable[name] = var

  init_vars = tf.train.list_variables(init_checkpoint)

  assignment_map = collections.OrderedDict()
  for x in init_vars:
    (name, var) = (x[0], x[1])
    if name not in name_to_variable:
      continue
    assignment_map[name] = name
    initialized_variable_names[name] = 1
    initialized_variable_names[name + ":0"] = 1

  return (assignment_map, initialized_variable_names)

8. 结论

本文详细介绍了BERT模型中的几个核心组件，包括激活函数、变量初始化、嵌入查找、层归一化等。通过深入理解这些组件，读者可以更好地掌握BERT模型的工作原理，并在实际应用中进行优化和调整。希望本文能为读者在自然语言处理领域的研究和开发提供有益的参考。

原文地址：https://blog.csdn.net/m0_73697499/article/details/143808418

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Dev C++ 无法使用to_string方法的解决
下一篇：《网络硬件设备完全技术宝典》

Python_爬虫2_爬虫引发的问题
约束性：Robots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险。Robots Exclusion Standard 网络爬虫排除标准。网络爬虫：自动或人工识别robots.txt，再进行
阅读更多2024-11-17
STM32 | 小区环境检测系统
现在一些智慧小区安装小区环境检测系统，能够将小区当前位置的温度、湿度、光照强度，小区空气质量及自己所在城市所在天气信息实时显示在点阵的LED屏幕上。随着科技的发展，人们的生活水平越来越高，对居住的要求
阅读更多2024-11-17
力扣-2175、世界排名的变化
TeamPointsteam_id 包含唯一值。这张表的每一行均包含了一支国家队的 ID，它所代表的国家，以及它在全球排名中的得分。没有两支队伍代表同一个国家。team_id 包含唯一值。这张表的每一
阅读更多2024-11-17
【Python模拟websocket登陆-拆包封包】
python模拟访问websocket，完成登陆的功能，细节也是基本的类型转换，构造与js一致的数据包，并发送到sever、
阅读更多2024-11-17
M｜完美的日子
这个电影有两大争议点，主要的一点是一个东亚的厕所清洁工是否真会有这样的生活，还是说这是作为西方人的导演的意淫呢？这种生活底层劳动人民有是有，少肯定是很少。更深入地批判是，东亚普通民众的生活剪影远没有电
阅读更多2024-11-17
Ubuntu24.04配置安装可视化terminal终端
其实我想要连接家里的PVE虚拟机，并且去局域网访问各种虚拟机，只要一个占用带宽很小的terminal就可以，于是我就找到了各种可视化terminal终端，也把每个终端都搞了一下，对比一下看哪个好用，结
阅读更多2024-11-17
CI/CD认识
持续集成是一种开发实践，指团队中的开发人员将代码频繁地（通常每天多次）集成到共享的代码库中，并通过自动化的测试和构建来快速验证代码的正确性。持续交付和持续部署是 CI 的延续，专注于将代码从开发环境推
阅读更多2024-11-17
vue如何实现组件切换
【代码】vue如何实现组件切换。
阅读更多2024-11-17
【大数据学习 | flume】flume之常见的channel组件
Channel是连接Source和Sink的组件，大家可以将它看做一个（数据队列），它可以将事件暂存到内存中也可以持久化到本地磁盘上，直到Sink处理完该事件，Flume对于Channel，则提供了
阅读更多2024-11-17
类与对象；
构造函数是一个特殊的成员函数，名字与类名相同（date类构造函数 date（x，y））创建类类型对象时由编译器自动调用，以保证每个数据成员都有一个合适的初始值，并且在对象整个生命周期内只调用一次。构
阅读更多2024-11-17