SentenceTransformers (SBERT)

🕗 发布于 2024-07-24 11:36 Sentence Transformers SBERT Cross Encoder reranker

文章目录

一、关于 SBERT

官方文档：https://www.sbert.net/
github : https://github.com/UKPLab/sentence-transformers
paper : Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
https://arxiv.org/abs/1908.10084
model library : https://huggingface.co/models?library=sentence-transformers

特点

计算给定文本或图像的固定大小向量表示（嵌入）。
嵌入计算通常是高效的，嵌入相似度计算非常快。
适用于广泛的任务，例如语义文本相似度、语义搜索、聚类、分类、释义挖掘等。
通常用作两步检索过程中的第一步，其中使用跨编码器（又名reranker）模型对来自双编码器的top-k结果进行重新排名。

预训练模型

我们提供了100多种语言的大量预训练模型列表：https://huggingface.co/models?library=sentence-transformers

一些模型是通用模型，而另一些则为特定用例生成嵌入，只需传递模型名称即可加载预训练模型，如：SentenceTransformer('model_name')。

应用实例

您可以将此框架用于：

以及更多用例。

有关所有示例，请参见示例/应用程序。

二、安装

我们推荐 Python 3.8+ 和 PyTorch 1.11.0+。

您可以使用 pip 安装 sentence-transformers：

pip install -U sentence-transformers

使用conda安装

conda install -c conda-forge sentence-transformers

从源代码安装

或者，您也可以从存储库克隆最新版本并直接从源代码安装：

pip install -e .

带有CUDA的PyTorch

如果要使用GPU/CUDA，则必须使用匹配的CUDA版本安装PyTorch。跟随 PyTorch-START了解如何安装PyTorch的更多详细信息。

开发设置

将repo（或fork）克隆到您的机器后，在虚拟环境中运行：

python -m pip install -e ".[dev]"

pre-commit install

要测试您的更改，请运行：

pytest

三、入门使用

使用Sentence Transformers 模型是基本的：

from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model 
model = SentenceTransformer("all-MiniLM-L6-v2")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

使用SentenceTransformer("all-MiniLM-L6-v2")我们选择加载哪个Sentence Transformers 模型。在本例中，我们加载all-MiniLM-L6-v2，这是一个MiniLM模型，在超过10亿训练对的大型数据集上进行微调。

使用 SentenceTransformer.similarity(), 我们计算所有句子对之间的相似度。正如预期的那样，前两句之间的相似度（0.6660）高于第一句和第三句之间的相似度（0.1046）或第二句和第三句之间的相似度（0.1411）。

优化Sentence Transformers 模型很简单，只需要几行代码。有关详细信息，请参阅训练概述部分。

四、训练

该框架允许您微调自己的句子嵌入方法，从而获得特定任务的句子嵌入，您有各种选项可供选择，以便为您的特定任务获得完美的句子嵌入。

有关如何训练自己的嵌入模型的介绍，请参阅训练概述。我们提供了如何在各种数据集上训练模型的各种示例。

一些亮点是：

支持各种 Transformer 网络，包括BERT、RoBERTa、XLM-R、DistilBERT、伊莱克特拉、BART、…
多语言多任务学习
训练期间评估以找到最佳模型
20多个损失函数允许专门针对语义搜索、释义挖掘、语义相似度比较、聚类、三元组损失、对比损失等调整模型。

五、Cross Encoder

交叉编码器（又名 reranker）模型的特点：

计算给定文本对的相似度分数。
与Sentence Transformers （又名双编码器）模型相比，通常提供卓越的性能。
通常比Sentence Transformers 模型慢，因为它需要对每对而不是每个文本进行计算。
由于前面的2个特性，交叉编码器通常用于对Sentence Transformers 模型的top-k结果进行重新排序。

Cross Encoder 模型的用法类似于Sentence Transformers ：

from sentence_transformers.cross_encoder import CrossEncoder

# 1. Load a pretrained CrossEncoder model
model = CrossEncoder("cross-encoder/stsb-distilroberta-base")

# We want to compute the similarity between the query sentence...
query = "A man is eating pasta."

# ... and all sentences in the corpus
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

# 2. We rank all sentences in the corpus for the query
ranks = model.rank(query, corpus)

# Print the scores
print("Query: ", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}")
"""
Query:  A man is eating pasta.
0.67    A man is eating food.
0.34    A man is eating a piece of bread.
0.08    A man is riding a horse.
0.07    A man is riding a white horse on an enclosed ground.
0.01    The girl is carrying a baby.
0.01    Two men pushed carts through the woods.
0.01    A monkey is playing drums.
0.01    A woman is playing violin.
0.01    A cheetah is running behind its prey.
"""

# 3. Alternatively, you can also manually compute the score between two sentences
import numpy as np

sentence_combinations = [[query, sentence] for sentence in corpus]
scores = model.predict(sentence_combinations)

# Sort the scores in decreasing order to get the corpus indices
ranked_indices = np.argsort(scores)[::-1]
print("Scores:", scores)
print("Indices:", ranked_indices)
"""
Scores: [0.6732372, 0.34102544, 0.00542465, 0.07569341, 0.00525378, 0.00536814, 0.06676237, 0.00534825, 0.00516717]
Indices: [0 1 3 6 2 5 7 4 8]
"""

我们选择加载的CrossEncoder模型 CrossEncoder("cross-encoder/stsb-distilroberta-base") 。

在本例中，我们加载 cross-encoder/stsb-distilroberta-base ，这是在STS基准数据集上微调的DistilRoBERTa模型。