【Elasticsearch】inference ingest pipeline

🕗 发布于 2025-01-22 10:16 elasticsearch

Elasticsearch 的 Ingest Pipeline 功能允许你在数据索引之前对其进行预处理。通过使用 Ingest Pipeline，你可以执行各种数据转换和富化操作，包括使用机器学习模型进行推理（inference）。这在处理词嵌入、情感分析、图像识别等场景中非常有用。

### 使用 Inference Ingest Pipeline

以下是一个详细的步骤，展示如何使用 Inference Ingest Pipeline 在 Elasticsearch 中加载和使用预训练的机器学习模型来进行推理。

### 步骤 1: 准备机器学习模型

首先，你需要准备一个预训练的机器学习模型，并将其部署到 Elasticsearch 的机器学习模块中。Elasticsearch 支持多种模型格式，包括 TensorFlow、PyTorch、ONNX 等。

#### 示例：上传 TensorFlow 模型

1. **下载或训练模型**：确保你有一个 TensorFlow 模型文件（例如，`.pb` 文件）。

2. **上传模型**：使用 Elasticsearch 的机器学习 API 将模型上传到 Elasticsearch。

```json

PUT _ml/trained_models/my_word_embedding_model

{

"input": {

"field_names": ["text"]

"inference_config": {

"natural_language_inference": {

"results_field": "inference_results"

}

"model": {

"definition": {

"path": "path/to/your/model.pb"

}

```

### 步骤 2: 创建 Ingest Pipeline

创建一个 Ingest Pipeline，使用刚刚上传的模型进行推理。

```json

PUT _ingest/pipeline/word_embedding_pipeline

{

"description": "Pipeline to add word embeddings using a trained model",

"processors": [

{

"inference": {

"model_id": "my_word_embedding_model",

"target_field": "embedding"

}

]

}

```

### 步骤 3: 使用 Ingest Pipeline 索引数据

在索引数据时，指定使用创建的 Ingest Pipeline。

```json

POST word_embeddings/_doc?pipeline=word_embedding_pipeline

{

"word": "example"

}

```

### 示例：完整流程

以下是一个完整的示例，展示如何从头开始创建和使用 Inference Ingest Pipeline。

#### 1. 上传模型

```json

PUT _ml/trained_models/my_word_embedding_model

{

"input": {

"field_names": ["text"]

"inference_config": {

"natural_language_inference": {

"results_field": "inference_results"

}

"model": {

"definition": {

"path": "path/to/your/model.pb"

}

```

#### 2. 创建 Ingest Pipeline

```json

PUT _ingest/pipeline/word_embedding_pipeline

{

"description": "Pipeline to add word embeddings using a trained model",

"processors": [

{

"inference": {

"model_id": "my_word_embedding_model",

"target_field": "embedding"

}

]

}

```

#### 3. 创建索引

```json

PUT word_embeddings

{

"mappings": {

"properties": {

"word": {

"type": "keyword"

"embedding": {

"type": "dense_vector",

"dims": 100 // 根据你的词嵌入模型的维度设置

}

```

#### 4. 索引数据

```json

POST word_embeddings/_doc?pipeline=word_embedding_pipeline

{

"word": "example"

}

```

### 验证结果

你可以通过查询索引来验证数据是否正确索引，并且词嵌入向量是否已添加。

```json

GET word_embeddings/_search

{

"query": {

"match": {

"word": "example"

}

```

### 注意事项

1. **模型路径**：确保模型文件路径正确，并且 Elasticsearch 有权限访问该路径。

2. **模型格式**：Elasticsearch 支持多种模型格式，确保你使用的模型格式与 Elasticsearch 兼容。

3. **性能**：Inference Ingest Pipeline 可能会影响索引性能，特别是在处理大量数据时。考虑在生产环境中进行性能测试。

通过以上步骤，你可以在 Elasticsearch 中使用 Inference Ingest Pipeline 对数据进行预处理，从而实现词嵌入的自动计算和存储。希望这些示例和说明能帮助你更好地理解和使用 Elasticsearch 的 Inference Ingest Pipeline 功能。

当你执行以下查询时，Elasticsearch 会返回与 `word` 字段匹配 "example" 的所有文档及其相关信息。假设你已经按照前面的步骤创建了索引并插入了数据，查询结果将包含文档的 `_id`、`_source` 等字段。

### 查询示例

```json

GET word_embeddings/_search

{

"query": {

"match": {

"word": "example"

}

```

### 返回结果示例

假设你已经索引了一些文档，查询结果可能如下所示：

```json

{

"took": 1,

"timed_out": false,

"_shards": {

"total": 1,

"successful": 1,

"skipped": 0,

"failed": 0

"hits": {

"total": {

"value": 1,

"relation": "eq"

"max_score": 0.2876821,

"hits": [

{

"_index": "word_embeddings",

"_type": "_doc",

"_id": "1",

"_score": 0.2876821,

"_source": {

"word": "example",

"embedding": [0.1, 0.2, ..., 0.100]

}

]

}

```

### 解释

- **`took`**: 查询花费的时间（毫秒）。

- **`timed_out`**: 查询是否超时。

- **`_shards`**: 分片信息，包括总分片数、成功分片数、跳过分片数和失败分片数。

- **`hits`**: 匹配的文档列表。

- **`total`**: 匹配的文档总数。

- **`max_score`**: 最高评分。

- **`hits`**: 匹配的文档列表，每个文档包含以下字段：

- **`_index`**: 文档所属的索引名称。

- **`_type`**: 文档类型（在Elasticsearch 7.x及更高版本中，类型总是`_doc`）。

- **`_id`**: 文档的唯一标识符。

- **`_score`**: 文档的相关性评分。

- **`_source`**: 文档的原始内容，包括所有字段及其值。

### 示例解释

假设你索引了一个文档，其中 `word` 字段为 "example"，`embedding` 字段为 `[0.1, 0.2, ..., 0.100]`，查询结果将返回这个文档的信息。

- **`_index`**: 文档所属的索引名称，这里是 `word_embeddings`。

- **`_type`**: 文档类型，这里是 `_doc`。

- **`_id`**: 文档的唯一标识符，假设为 `1`。

- **`_score`**: 文档的相关性评分，这里是 `0.2876821`。

- **`_source`**: 文档的原始内容，包括 `word` 和 `embedding` 字段。

### 进一步操作

如果你需要进一步处理或过滤结果，可以在查询中添加更多的条件。例如，你可以使用 `size` 参数来限制返回的文档数量，或者使用 `sort` 参数来按特定字段排序。

```json

GET word_embeddings/_search

{

"query": {

"match": {

"word": "example"

}

"size": 10,

"sort": [

{ "_score": { "order": "desc" } }

]

}

```

希望这些解释和示例能帮助你更好地理解和使用 Elasticsearch 的查询结果。

PUT _ingest/pipeline/word_embedding_pipeline
{
"description": "Pipeline to add word embeddings using a trained model",
"processors": [
{
"inference": {
"model_id": "my_word_embedding_model",
"target_field": "embedding"
}
]
}这段是干嘛的？

这段代码用于在 Elasticsearch 中创建一个 Ingest Pipeline，该管道将在数据索引之前使用预训练的机器学习模型进行推理（inference），并将推理结果（词嵌入向量）添加到文档中。具体来说，这段代码做了以下几件事：

### 1. 创建 Ingest Pipeline

```json

PUT _ingest/pipeline/word_embedding_pipeline

{

"description": "Pipeline to add word embeddings using a trained model",

"processors": [

{

"inference": {

"model_id": "my_word_embedding_model",

"target_field": "embedding"

}

]

}

```

### 详细解释

- **`PUT _ingest/pipeline/word_embedding_pipeline`**:

- 这是一个 HTTP PUT 请求，用于创建一个新的 Ingest Pipeline。`word_embedding_pipeline` 是管道的名称，你可以根据需要选择一个合适的名称。

- **`description`**:

- 这是一个可选字段，用于描述管道的目的。在这里，描述为 "Pipeline to add word embeddings using a trained model"，表示这个管道用于使用预训练模型添加词嵌入向量。

- **`processors`**:

- 这是一个数组，包含一个或多个处理器（processors）。每个处理器定义了一个数据处理步骤。

- **`inference`**:

- 这是一个处理器，用于执行机器学习模型的推理。它有两个主要参数：

- **`model_id`**: 指定要使用的预训练模型的 ID。在这个例子中，模型 ID 是 `my_word_embedding_model`。

- **`target_field`**: 指定将推理结果存储到文档中的哪个字段。在这个例子中，推理结果将存储在 `embedding` 字段中。

### 2. 使用 Ingest Pipeline 索引数据

一旦创建了 Ingest Pipeline，你可以在索引数据时指定使用该管道。这样，Elasticsearch 会在数据索引之前自动执行推理，并将结果添加到文档中。

```json

POST word_embeddings/_doc?pipeline=word_embedding_pipeline

{

"word": "example"

}

```

### 示例流程

1. **上传模型**：

- 首先，你需要将预训练的机器学习模型上传到 Elasticsearch。假设你已经上传了一个名为 `my_word_embedding_model` 的模型。

2. **创建 Ingest Pipeline**：

- 使用上述代码创建一个 Ingest Pipeline，该管道将使用 `my_word_embedding_model` 模型进行推理，并将结果存储在 `embedding` 字段中。

3. **索引数据**：

- 在索引数据时，指定使用创建的 Ingest Pipeline。Elasticsearch 将在数据索引之前执行推理，并将结果添加到文档中。

### 示例

假设你已经上传了一个名为 `my_word_embedding_model` 的模型，并创建了 `word_embedding_pipeline` 管道。以下是一个完整的示例：

#### 1. 上传模型

```json

PUT _ml/trained_models/my_word_embedding_model

{

"input": {

"field_names": ["text"]

"inference_config": {

"natural_language_inference": {

"results_field": "inference_results"

}

"model": {

"definition": {

"path": "path/to/your/model.pb"

}

```

#### 2. 创建 Ingest Pipeline

```json

PUT _ingest/pipeline/word_embedding_pipeline

{

"description": "Pipeline to add word embeddings using a trained model",

"processors": [

{

"inference": {

"model_id": "my_word_embedding_model",

"target_field": "embedding"

}

]

}

```

#### 3. 创建索引

```json

PUT word_embeddings

{

"mappings": {

"properties": {

"word": {

"type": "keyword"

"embedding": {

"type": "dense_vector",

"dims": 100 // 根据你的词嵌入模型的维度设置

}

```

#### 4. 索引数据

```json

POST word_embeddings/_doc?pipeline=word_embedding_pipeline

{

"word": "example"

}

```

### 验证结果

你可以通过查询索引来验证数据是否正确索引，并且词嵌入向量是否已添加。

```json

GET word_embeddings/_search

{

"query": {

"match": {

"word": "example"

}

```

### 返回结果示例

```json

{

"took": 1,

"timed_out": false,

"_shards": {

"total": 1,

"successful": 1,

"skipped": 0,

"failed": 0

"hits": {

"total": {

"value": 1,

"relation": "eq"

"max_score": 0.2876821,

"hits": [

{

"_index": "word_embeddings",

"_type": "_doc",

"_id": "1",

"_score": 0.2876821,

"_source": {

"word": "example",

"embedding": [0.1, 0.2, ..., 0.100]

}

]

}

```

通过以上步骤，你可以在 Elasticsearch 中使用 Ingest Pipeline 自动处理数据，将词嵌入向量添加到文档中。希望这些解释和示例能帮助你更好地理解和使用 Ingest Pipeline。如果有任何进一步的问题，请随时提问！

原文地址：https://blog.csdn.net/risc123456/article/details/145286590

免责声明：本站文章内容转载自网络资源，如侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：可免费使用的电子画册制作平台
下一篇：Java设计模式十装饰模式 (Decorator Pattern)

jvm学习总结
在硬盘上查找并通过io读写字节码文件，使用类时才加载，例如调用来的main方法，new对象等等，在加载阶段会在内存中生成一个代表这个类的java.lang.Class对象，作为方法区这个类的各种数据的
阅读更多2025-01-22
【gopher的java学习笔记】Java中Mapper与Entity的关系详解
在Java后端开发中，特别是在使用MyBatis等持久层框架时，Mapper与Entity的关系是架构设计中不可忽视的一部分。本文将从Java Web应用程序的角度出发，详细探讨Mapper与Enti
阅读更多2025-01-22
【gopher的java学习笔记】Java中Service与Mapper的关系详解
在后端开发中，Java作为一种广泛使用的编程语言，其架构设计和层次划分对于系统的可维护性、可扩展性和性能有着至关重要的影响。特别是在使用MyBatis等持久层框架时，Service层与Mapper层的
阅读更多2025-01-22
基于单片机的直流电机控制系统（论文+源码）
1系统方案设计本设计基于单片机的直流电机控制系统的总体架构设计如图2.1所示，其采用STM32F103单片机作为控制器，结合ESP8266WiFi通信模块、L9110电机驱动电路、OLED液晶、按键等
阅读更多2025-01-22
探索 Vue.js 的高级插槽特性：动态插槽与作用域插槽优化
<thead><tr></th></tr></thead><tbody></td></tr></t
阅读更多2025-01-22
HarmonyOS NEXT：华为分享-碰一碰开发分享
碰一碰”是HarmonyOS NEXT系统中的一项创新功能，它允许用户通过简单的设备接触，实现多种内容的快速分享。这一功能打破了传统文件传输和分享的局限性，无需复杂的网络设置或社交关系，只需将两个设备
阅读更多2025-01-22
第17章安全培训筑牢梦想根基
在确认同事情况有所好转后，我们才回到考场，继续完成考试。王瑞瑞紧握着笔记本，点了点头，她的马尾辫随着动作轻轻摆动，但眼神却异常坚定：“是的，这让我们更加明白，我们的每一个小错误，都可能导致无法挽回的后
阅读更多2025-01-22
C# 中的Stopwatch和timer
Stopwatch：用于测量时间间隔，适合性能分析和精确计时。Timer：用于在指定的时间间隔后执行代码，适合定时任务。根据你的具体需求，可以选择使用Stopwatch来测量时间间隔，或使用Timer
阅读更多2025-01-22
学Python的人…
它主要负责包管理比较臃肿，我也是后面学深度学习才开始用这个的不是必需，初学者可安可不安。提前思考好学习路线：思维导图中的基础部分所有的都要学，但高阶部分选定一个方向学就好。，它的交互性的确更强，但我觉
阅读更多2025-01-22
Comment(爆破+git泄漏+二次注入)
即用户名为admin%27%0Aor%0A%271%27%3E%270%27%0Aor%0Aname%0Alike%0A%27admin。使用bugscanteam的githack工具，下载泄漏的源码
阅读更多2025-01-22

【Elasticsearch】inference ingest pipeline

相关文章