Elasticsearch Ingest Pipelines

🕗 发布于 2024-10-16 15:27 elasticsearch 大数据

1. 前言

在将第三方数据源的数据导入到Elasticsearch中时，原始数据长什么样，索引后的文档就是什么样。文档数据结构不统一，导致后续数据分析时变得麻烦，以往需要额外写一个中间程序来读取原始数据，转换加工后再写入到Elasticsearch，比较麻烦，于是官方推出了Ingest pipeline。

Ingest pipeline 允许文档在被索引之前对数据进行预处理，将数据加工处理成我们需要的格式，例如删除或增加一些字段等。有了Ingest pipeline，就不需要我们再额外开发中间程序对数据进行加工转换等操作了，全权交给Elasticsearch即可。

2. 创建Ingest pipeline

Pipeline 由一系列处理器 Processor 组成，每个Processor按照顺序执行，对传入的文档进行加工处理，Elasticsearch再将加工后的文档添加到索引中。

默认情况下，每个节点都具备ingest能力，即文档预处理的能力。如果要预处理大量的文档，建议开启专门的ingest节点，配置如下：

node.roles: [ ingest ]

可以通过Elasticsearch提供的「_ingest/pipeline」端点来创建Ingest pipeline，如下示例，创建了一个名为“test-pipeline”的pipeline，它由三个Processor组成：lowercase负责将字段值改为小写、remove负责删除给定字段、rename负责重命名给定字段。

PUT _ingest/pipeline/test-pipeline
{
  "description": "测试pipeline",
  "processors": [
    {
      "lowercase": {
        "field": "title"
      },
      "remove": {
        "field": "extended_data"
      },
      "rename": {
        "field": "field_a",
        "target_field": "field_b"
      }
    }
  ]
}

然后，我们就可以通过Elasticsearch提供的模拟API，来测试我们的Pipelline，如下示例：

POST _ingest/pipeline/test-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "title": "我是TITLE",
        "extended_data": "我是扩展数据",
        "field_a": "我是要rename的field"
      }
    }
  ]
}

Pipeline返回的结果符合我们的预期，字段title的值英文部分改为了等效的小写、extended_data字段被删除了、field_a被重命名为field_b。

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "title": "我是title",
          "field_b": "我是要rename的field"
        },
        "_ingest": {
          "timestamp": "2024-04-15T07:21:20.521194Z"
        }
      }
    }
  ]
}

3. 内置的Processor

截止Elasticsearch8.13版本，官方内置了40多个的Processor供我们使用，如下是部分Processor示例：

这里介绍几个常用的，其它参考官方文档即可。

append Processor：如果字段已经存在并且是一个数组，则向现有数组追加一个或多个值。

如下示例，如果文档存在tags字段并且是数组，则会自动追加给定的两个tag

PUT _ingest/pipeline/test-pipeline
{
  "processors": [
    {
      "append": {
        "field": "tags",
        "value": ["年度十佳","Top100"]
      }
    }
  ]
}

测试结果

POST _ingest/pipeline/test-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "tags":["热门"]
      }
    }
  ]
}

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "tags": [
            "热门",
            "年度十佳",
            "Top100"
          ]
        },
        "_ingest": {
          "timestamp": "2024-04-15T07:31:23.089093Z"
        }
      }
    }
  ]
}

bytes Processor：将人类可读的字节值(例如1KB)转换为整型字节值(例如1024)，如果是数组则转换所有成员。

如下示例，字段memory将会由人类可读的字符串转换为整型字节值

PUT _ingest/pipeline/test-pipeline
{
  "processors": [
    {
      "bytes": {
        "field": "memory"
      }
    }
  ]
}

测试结果

POST _ingest/pipeline/test-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "memory":"132MB"
      }
    }
  ]
}

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "memory": 138412032
        },
        "_ingest": {
          "timestamp": "2024-04-15T07:35:07.233976Z"
        }
      }
    }
  ]
}

convert Processor：将文档中的某个字段转换为给定的数据类型，例如将字符串转换为布尔类型。数据类型转换不能随意配置，否则会报错，支持的类型有：integer、long、float、double、string、boolean、ip、auto。

如下示例，将字符串转换为布尔类型

PUT _ingest/pipeline/test-pipeline
{
  "processors": [
    {
      "convert": {
        "field": "deleted",
        "type": "boolean"
      }
    }
  ]
}

测试结果

POST _ingest/pipeline/test-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "deleted":"false"
      }
    }
  ]
}

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "deleted": false
        },
        "_ingest": {
          "timestamp": "2024-04-15T07:40:42.066717Z"
        }
      }
    }
  ]
}

date Processor：从文档中的某个字段中解析时间，并将其作为文档的时间戳，默认时间戳的字段名为”@timestamp“。

如下示例，我们将publish_time作为文档的时间戳：

PUT _ingest/pipeline/test-pipeline
{
  "processors": [
    {
      "date": {
        "field": "publish_time",
        "formats": ["yyyy-MM-dd HH:mm:ss"]
      }
    }
  ]
}

测试结果：

POST _ingest/pipeline/test-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "publish_time":"2024-01-01 00:00:00"
      }
    }
  ]
}

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "publish_time": "2024-01-01 00:00:00",
          "@timestamp": "2024-01-01T00:00:00.000Z"
        },
        "_ingest": {
          "timestamp": "2024-04-15T07:44:28.383089Z"
        }
      }
    }
  ]
}

最后介绍一个功能强大的Processor：script，通过内联的自定义脚本来操作文档，使用起来相当灵活。

如下示例，Processor将从content字段中提取品牌、商品标题和价格，并移除掉content字段。

PUT _ingest/pipeline/test-pipeline
{
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": """
        String[] splits = ctx['content'].splitOnToken(',');
        ctx['brand'] = splits[0];
        ctx['title'] = splits[1];
        ctx['price'] = Integer.parseInt(splits[2]);
        ctx.remove('content');
        """
      }
    }
  ]
}

测试结果

POST _ingest/pipeline/test-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "content":"小米,小米汽车SU7,214900"
      }
    }
  ]
}

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "title": "小米汽车SU7",
          "brand": "小米",
          "price": 214900
        },
        "_ingest": {
          "timestamp": "2024-04-15T07:58:44.134495Z"
        }
      }
    }
  ]
}

4. 应用Ingest pipeline

Ingest pipeline创建好以后，主要有四类应用场景。

先创建一个名为”set-last-update-time-pipeline“的pipeline，它的目的是给文档设置一个最后修改的时间戳。

PUT _ingest/pipeline/set-last-update-time-pipeline
{
  "processors": [
    {
      "set": {
        "field": "last_update_time",
        "value": "{{{_ingest.timestamp}}}"
      }
    }
  ]
}

4.1 索引时指定pipeline

可以在索引文档时指定Pipeline，文档写入前会先经过管道的预处理。

下面是没指定Pipeline时写入文档和索引结果

POST test-index/_doc/1
{
  "title":"new doc"
}

{
  "_index": "test-index",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "title": "new doc"
  }
}

下面是指定Pipeline时索引文档和索引结果，发现文档自动添加了last_update_time字段。

POST test-index/_doc/1?pipeline=set-last-update-time-pipeline
{
  "title":"new doc"
}

{
  "_index": "test-index",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "title": "new doc",
    "last_update_time": "2024-04-15T08:11:50.448923Z"
  }
}

4.2 更新时指定pipeline

在更新文档时，也可以指定Pipeline对要更新的文档执行预处理操作。

如下示例，先索引两个文档

POST test-index/_doc/1
{
  "title":"new doc1"
}
POST test-index/_doc/2
{
  "title":"new doc2"
}

然后利用”_update_by_query“端点来批量更新文档，同时指定pipeline，即可给所有文档添加last_update_time

POST test-index/_update_by_query?pipeline=set-last-update-time-pipeline
{
  "query": {
    "match_all": {}
  }
}

下面是更新后的文档结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test-index",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "new doc1",
          "last_update_time": "2024-04-15T08:17:24.527237Z"
        }
      },
      {
        "_index": "test-index",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "new doc2",
          "last_update_time": "2024-04-15T08:17:24.528187Z"
        }
      }
    ]
  }
}

4.3 索引映射指定pipeline

如果不想每次写入和更新时都指定Pipeline，也可以在创建索引时指定Pipeline，如下示例：

PUT test-index
{
  "settings": {
    "index":{
      "default_pipeline":"set-last-update-time-pipeline"
    }
  }, 
  "mappings": {
    "properties": {
      "title":{
        "type": "keyword"
      }
    }
  }
}

索引文档或更新文档时，会自动触发对应的Pipeline对文档进行预处理。如下示例，写入两篇文档不指定Pipeline

POST test-index/_doc/1
{
  "title":"new doc1"
}
POST test-index/_doc/2
{
  "title":"new doc2"
}

检索文档发现均已添加last_update_time字段

[
  {
    "_index": "test-index",
    "_id": "1",
    "_score": 1,
    "_source": {
      "title": "new doc1",
      "last_update_time": "2024-04-15T08:21:49.591599Z"
    }
  },
  {
    "_index": "test-index",
    "_id": "2",
    "_score": 1,
    "_source": {
      "title": "new doc2",
      "last_update_time": "2024-04-15T08:21:49.592733Z"
    }
  }
]

4.4 reindex指定pipeline

Elasticsearch提供了reindex API用于将一个索引的文档复制到另一个索引，reindex过程也可以指定Pipeline在写入目标索引前进行预处理。

如下示例，先定义源索引以及写入两篇文档

PUT source-index
{
  "mappings": {
    "properties": {
      "title":{
        "type": "keyword"
      }
    }
  }
}

POST source-index/_doc
{
  "title":"doc1"
}
POST source-index/_doc
{
  "title":"doc2"
}

接着定义目标索引，映射新增last_update_time字段

PUT dest-index
{
  "mappings": {
    "properties": {
      "title":{
        "type": "keyword"
      },
      "last_update_time":{
        "type": "date"
      }
    }
  }
}

然后调用reindex API完成文档迁移

POST _reindex
{
  "source": {
    "index": "source-index"
  },
  "dest": {
    "index": "dest-index",
    "pipeline": "set-last-update-time-pipeline"
  }
}

最后查看dest-index文档数据，发现均添加last_update_time字段

[
  {
    "_index": "dest-index",
    "_id": "glfh4I4BXAgLe9UUkOwp",
    "_score": 1,
    "_source": {
      "title": "doc1",
      "last_update_time": "2024-04-15T08:43:18.905182Z"
    }
  },
  {
    "_index": "dest-index",
    "_id": "g1fh4I4BXAgLe9UUkuy1",
    "_score": 1,
    "_source": {
      "title": "doc2",
      "last_update_time": "2024-04-15T08:43:18.906068Z"
    }
  }
]

5. 尾巴

Ingest Pipelines 是Elasticsearch一个非常实用的功能，它类似于大数据中的ETL，在真正索引数据前，先经过Pipeline完成数据的清洗和加工，把原始数据转换成我们想要的格式再索引，利于后续的数据分析。默认情况下，所有节点都具备Ingest能力，如果要预处理大量文档，建议部署专门的ingest节点，避免影响到主数据节点。

原文地址：https://blog.csdn.net/qq_32099833/article/details/142958068

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Mac 在vscode 中的常用快捷键
下一篇：PetaLinux工程的常用命令——petalinux-config

FPGA实现PCIE与电脑端QT上位机双向数据传输，基于XDMA中断模式，提供3套工程源码和技术支持
FPGA实现PCIE与电脑端QT上位机双向数据传输，基于XDMA中断模式，提供9套工程源码和技术支持
阅读更多2024-10-18
SpringBoot智能物流管理系统开发指南
在概念模式的设计中，E-R模型法是最常见的设计方法。B/S架构的运行方式是在远程的服务器上把开发的软件系统部署在远程的服务器上，在部署好软件系统之后就可以实现在任何接入互联网的电脑上访问部署好的软件系
阅读更多2024-10-18
docker删除容器要怎么操作？如何保障容器安全！
攻击者使用越来越多的高级技术入侵容器，而供应链攻击是云原生安全面临的较大问题，恶意镜像比较容易构造，绕过杀软的静态检测。可以清晰地盘点工作负载本身的相关信息，此外，还能够实现不同工作负载之间的关系可视
阅读更多2024-10-18
html5的简单问题
语义化更好的内容元素；cookies是网站为了标示用户身份而存储在用户本地终端上的数据，cookies数据始终在同源的http请求中携带，在浏览器和服务器间来回传递。导致回流现象的操作：添加或删除可见
阅读更多2024-10-18
uniapp-vue3-vite 搭建小程序、H5 项目模板
搭建一个使用和的小程序与 H5 项目模板相对简单，下面将详细介绍步骤，包括环境配置、项目创建、代码结构和基本示例。
阅读更多2024-10-18
python2与python3之间的区别
Python 2 和 Python 3 之间存在许多重要的区别。
阅读更多2024-10-18
EtherCAT总线轴实际速度和实际转矩滤波(一阶低通滤波器和TD跟踪微分器滤波对比测试)
.....
阅读更多2024-10-18
宝塔面板+HYBBS搭建私有BBS结合内网穿透一键发布论坛至公网访问
本文主要为大家介绍，如何使用HYBBS在Ubuntu系统上搭建一个好看、轻巧且扩展功能强大的论坛网站，并结合内网穿透工具发布公网随时远程访问。在国内互联网刚刚起步时，网民上网大多数时候只能浏览门户网站
阅读更多2024-10-18
自动驾驶系列—CI在自动驾驶开发中的应用场景与实现详解
随着自动驾驶技术的迅猛发展，现代车辆变得越来越智能化。越来越多的车辆功能和特性不再依赖于硬件，而是通过软件来实现。这使得自动驾驶系统的开发变得更加复杂，尤其是需要频繁更新的自动驾驶算法和控制逻辑。在如
阅读更多2024-10-18
Android打开Activity时不自动弹出键盘
在一个页面有2个EditText输入框，刚进入activity的时候系统默认第一个EditText获得焦点且键盘弹出。只需要到配置文件增加以下配置即可。这里分2种情况处理。
阅读更多2024-10-18

Elasticsearch Ingest Pipelines

1. 前言

2. 创建Ingest pipeline

3. 内置的Processor

4. 应用Ingest pipeline

4.1 索引时指定pipeline

4.2 更新时指定pipeline

4.3 索引映射指定pipeline

4.4 reindex指定pipeline

5. 尾巴

相关文章