es 3期第14节-全文文本分词查询

🕗 发布于 2024-12-09 11:23 elasticsearch 大数据

#### 1.Elasticsearch是数据库，不是普通的Java应用程序，传统数据库需要的硬件资源同样需要，提升性能最有效的就是升级硬件。
#### 2.Elasticsearch是文档型数据库，不是关系型数据库，不具备严格的ACID事务特性，任何企图直接替代严格事务性场景的应用项目都会失败!!!

##### 索引字段与属性都属于静态设置，若后期变更历史数据需要重建索引才可生效
##### 对历史数据无效！！！！
##### 一定要重建索引！！！

#### 全文文本概念
### 概念介绍
## 1.文章语句分词
## 2.分词之后，支持基于分词检索
## 3.分词算法很多，分词领域很深入
## 4.基于倒排索引算法-Inverted-Index
## 5.分词检索的打分算法TF/IDF=>BM25
## 6.字段类型仅限于text类型

## 全文搜索内容较深，初步学习使用即可

# es测试分词器默认api语法，默认分词算法 standard 按照空格、逗号这种方式分
# 初步理解分词，数据在入库前已经做好了分词并建立了索引

POST _analyze
{
  "text": [
    "hello every body, 我是DavidSoCool, 我正在学习es"
  ],
  "analyzer":"standard"
}

### 全文文本检索
# Match-all：全查询
# Match：标准分词

# 准备数据

DELETE kibana_sample_data_flights_fulltext
POST _reindex
{
  "source": {
    "index": "kibana_sample_data_flights"
  },
  "dest": {
    "index": "kibana_sample_data_flights_fulltext"
  }
}

## match-all 全匹配
# 1.Match all没有限制条件，直接等同于search查询
# 2.boost:可以调整加权数值

GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match_all": {
      "boost": 10
    }
  }
}

## match_none，反向全匹配，可用于测试索引健康，不同与查询数据消耗性能

GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match_none": {}
  }
}

## match，文本匹配，最常用的
# 排序默认根据_score分值，匹配的次越多，分值就越高，可以用于做简单的推荐系统

GET kibana_sample_data_flights_fulltext/_mapping
# 先测试下分词结果，分成了4个词
POST _analyze
{
  "text": [
    "Cape Town International Airport"
  ],
  "analyzer":"standard"
}
# 任意匹配一个词就能查询出来
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":"Cape Town International Airport"
    }
  }
}
# 看5000条以后_score分值和Origin字段匹配的数量
GET kibana_sample_data_flights_fulltext/_search
{
  "from":1000,
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":"Cape Town International Airport"
    }
  }
}
# 看9000条以后_score分值和Origin字段匹配的数量
GET kibana_sample_data_flights_fulltext/_search
{
  "from":9000,
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":"Cape Town International Airport"
    }
  }
}

## Request 请求参数
# query:查询表达式
# analyzer:指定分词器，对于查询输入的文本进行分词
# operator:分词之间关联关系，默认是or
# minimum_should_match:分词最小匹配数量

# 这条语句等价于下面那条
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":{
        "query": "Cape Town International Airport",
        "analyzer": "standard",
        "operator": "or"
      }
    }
  }
}
# 这条语句等价于上面那条
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":"Cape Town International Airport"
    }
  }
}
# 使用operator=and，表示所有词都匹配上，注意看total
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":{
        "query": "Cape Town International Airport",
        "analyzer": "standard",
        "operator": "and"
      }
    }
  }
}
# 去掉前两个词后，跳过100条看看
GET kibana_sample_data_flights_fulltext/_search
{
  "from": 100,
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":{
        "query": "International Airport",
        "analyzer": "standard",
        "operator": "and"
      }
    }
  }
}
# minimum_should_match，控制匹配词的精确度，可以使用数字和百分比
# 只能用or，and会查不出数据
GET kibana_sample_data_flights_fulltext/_search
{
  "from": 100,
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":{
        "query": "Cape Town International Airport",
        "analyzer": "standard",
        "operator": "or",
        "minimum_should_match": 2
      }
    }
  }
}
# 跳过数据看看，total总数111条，跳过110条
# 第111条还是全匹配数据，112开始就只有2个词匹配的数据了
GET kibana_sample_data_flights_fulltext/_search
{
  "from": 110,
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":{
        "query": "Cape Town International Airport",
        "analyzer": "standard",
        "operator": "or",
        "minimum_should_match": 2
      }
    }
  }
}
# 如何minimum_should_match=4就相当于使用and的了
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":{
        "query": "Cape Town International Airport",
        "analyzer": "standard",
        "operator": "or",
        "minimum_should_match": 4
      }
    }
  }
}
# minimum_should_match 使用百分比，这里不是简单看分词的比例，需要看文档理解
# 建议还是使用数字，如果词很多可以使用百分比
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":{
        "query": "Cape Town International Airport",
        "analyzer": "standard",
        "operator": "or",
        "minimum_should_match": "50%"
      }
    }
  }
}
# minimum_should_match 也可以使用负数，相当于是负相关，不建议使用
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":{
        "query": "Cape Town International Airport",
        "analyzer": "standard",
        "operator": "or",
        "minimum_should_match": -1
      }
    }
  }
}
# fuzziness 纠错搜索，可以帮助我们纠正输入错误的词，具体看文档
# 将Cape输入成错误的Capa
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":{
        "query": "Capa",
        "analyzer": "standard",
        "operator": "or",
        "fuzziness": 1
      }
    }
  }
}

## Match boolPrefix前缀匹配
# 集成了match和bool
# 去掉最后的Airport，并且把International最后的l去掉，相当于前面2个单词全匹配，最后一个Internationa使用的前缀匹配

GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match_bool_prefix": {
      "Origin":"Cape Town Internationa"
    }
  }
}
# 原语句
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match": {
      "Origin":"Cape Town International Airport"
    }
  }
}

## match_phrase 短语搜索，按照我们输入的词顺序匹配，之前的是每个词各自匹配

GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match_phrase": {
      "Origin":"Cape Town International Airport"
    }
  }
}
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match_phrase": {
      "Origin":"Cape Town"
    }
  }
}
# 中间跳过一个词Town就查不出来，因为没有这个短语
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match_phrase": {
      "Origin":"Cape International Airport"
    }
  }
}
# slop参数，匹配允许短语间隔误差词数量，中间跳过一个词Town也可以查出来
# slop会耗费计算资源
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match_phrase": {
      "Origin":{
        "query": "Cape International Airport",
        "slop": 1
      }
    }
  }
}
# slop参数，中间跳过两个词
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match_phrase": {
      "Origin":{
        "query": "Cape Airport",
        "slop": 2
      }
    }
  }
}

## Match phase prefix
# 短语前缀查询，集成了短语匹配+前缀
# 前面分词走短语查询
# 最后的分词走前缀查询

# 把Airport的末尾t去掉，效率比slop高效些
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query":{
    "match_phrase_prefix": {
      "Origin":"Cape Town International Airpor"
    }
  }
}

## Multi match 多字段
# 很多应用场景需要同时查询多个字段，查询内容一样如电商领域，商品标题与商品描述
# Multimatch专门解决此场景需求，单个字段查询时等同与match匹配

## type 匹配类型
# best_fields，多字段中选择分值最高的字段，默认匹配类型
# most_fields，多字段分值累计和
# cross_fields，多字段查询时，部分分词在第一个字段里，其它的分词在另外的字段里phrase，短语匹配，等同match_phase
# phrase_prefix，短语前缀匹配，等同match_phase_prefix
# bool_prefix，全文匹配逻辑前缀，等同match_bool_prefix.
# tie_breaker，选择多字段分值计算方式，0-选择其中较大的，1-选择合并
# 切换不同的类型(best_fields/most_fields)，测试对比前后的分值与结果数量

GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query": {
    "multi_match": {
      "query": "Cape Town International Airport",
      "type": "best_fields",
      "fields": [
        "Origin",
        "Dest"
      ]
    }
  }
}
# 还可以使用模糊匹配字段
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query": {
    "multi_match": {
      "query": "Cape Town International Airport",
      "fields": "*rigin"
    }
  }
}
# 多个字段匹配，使用^符号和后面增加权重值数字，增加某个字段的权重，类同于单独写boost
GET kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query": {
    "multi_match": {
      "query": "Cape Town International Airport",
      "type": "best_fields",
      "fields": [
        "Origin",
        "Dest^2"
      ]
    }
  }
}

## Intervals文本顺序间隔，这个比较复杂一般用不上，需要深入研究
# 间隔查询是全文分词非常⾼级的查询能⼒，容许控制输入分词查询与内容之间的间隔。⽀持了多种间隔类型机制。
# 多个查询检索条件有先后，先基于第⼀个条件查询，之后在结果集上执⾏后⾯的查询条件，类似于 if,then 逻辑

## intervals match 间隔匹配查询
# match，关键字，间隔查询的全文分词⽅式，等同前⾯的match查询
# query，关键字，查询输入的内容
# max_gaps，关键字，容许中间间隔最⼤的词数量，默认-1，不限制
# ordered，关键字，查询的内容是否必须符合顺序，取值true/false，默认false
# analyzer，关键字，分词器
# filter，关键字，⼆级查询过滤器，⽀持多种过滤类型
# use_field，⾃定义字段类型，

## filter 参数说明，⼆级查询过滤器，⽀持多种过滤类型
# 类型说明
# after query查询在此之后执⾏
# before query查询在此之前执⾏
# contained_by 包含此执⾏条件之内的结果
# containing 包含此执⾏条件
# not_contained_by 不在此执⾏结果之内
# not_containing 不包含此条件
# not_overlapping 不重叠条件
# overlapping 重叠条件
# script 基于painless脚本限制

POST kibana_sample_data_flights_fulltext/_search
{
  "track_total_hits": true,
  "query": {
    "intervals": {
      "Dest": {
        "match": {
          "ordered": true,
          "query": "Sydney Smith Airport",
          "analyzer": "standard",
          "max_gaps": 2,
          "filter": {
            "containing": {
              "match": {
                "query": "International"
              }
            }
          }
        }
      }
    }
  }
}

## Query String查询字符
# DSL查询比较复杂，ES也提供了类似SOL表达式的查询方式，但功能性上并未超越DSL，仅仅是方便
# 优缺点优点:简单直接
# 缺点:语法阅读困难，表达能力有限，建议尽量不使用

# 查询Dest，用or的方式
POST kibana_sample_data_flights_fulltext/_search
{
  "query":{
    "query_string": {
      "query": "Dest:(Phoenix or Ministro)"
    }
  }
}
# 查询数字范围
POST kibana_sample_data_flights_fulltext/_search
{
  "query":{
    "query_string": {
      "query": "FlightDelayMin:[10 TO 100]"
    }
  }
}

## Url查询字符
# 查询表达式基于URL的形式
## 优缺点
# 优点:简洁直接
# 缺点:表达能力局限，极少情况下应用，建议使用DSL

POST kibana_sample_data_flights_fulltext/_search?q=(Dest:Phoenix) AND (Origin:Chubu)

### 查询性能分析
## Profile性能分析
# 1.基于查询树，生成性能分析报告
# 2.与传统关系型数据库执行计划一样等价
# 3.Kibana具备可视化功能，看懂需要一定功力

POST kibana_sample_data_flights_fulltext/_search
{
  "profile":true,
  "query":{
    "query_string": {
      "query": "Dest:(Phoenix or Ministro)"
    }
  }
}

profile查询解结果如下

还可使用search profiler如下

## Explain分值计算评估，有兴趣可以深入
# 1.解释分值计算逻辑与规则
# 2.帮助理解全文查询分值计算信息

POST kibana_sample_data_flights_fulltext/_explain/74TR0Y8BbWz2Sn6EhZCn
{
  "query":{
    "match": {
      "Dest": "Ministro Pistarini International Airport"
    }
  }
}

_explain结果如下，这是Dest字段ministro的分值计算

## 全文查询建议
# 全文文本查询是非精确查询（可以通过一些参数控制位精确查询）
# 查询关联度与分词算法（需要去了解，查询结果不是想要的并非是es错误）
# 查询精确度问题（近似值）

elasticsearch text 文本字段类型官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/text.html

elasticsearch analysis-analyzers 内置分词器官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/analysis-analyzers.html

elasticsearch full-text-queries 全文查询官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/full-text-queries.html

elasticsearch query-dsl-intervals-query 间隔查询官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/query-dsl-intervals-query.html

elasticsearch index-modules-similarity

elasticsearch similarity 相似度算法官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/index-modules-similarity.html https://www.elastic.co/guide/en/elasticsearch/reference/8.6/similarity.html

原文地址：https://blog.csdn.net/DavidSoCool/article/details/144333197

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

.NET(C#) 如何配置用户首选项及保存用户设置
.NET(C#) 如何配置用户首选项及保存用户设置
阅读更多2024-12-14
【最新】北大数字普惠金融指数数据集-省市县（2011-2023年）
郭峰,王靖一,王芳,孔涛,张勋,程志云.测度中国数字普惠金融发展:指数编制与空间特征[J].经济学(季刊),2020,19(04):1401-1418.时间跨度：省级和城市级指数时间跨度为2011-2
阅读更多2024-12-14
GESP202412 四级【Recamán】题解（AC）
a11ak−1−kkakak−1−kak−1k小杨想知道 Recamán 数列的前n项从小到大排序后的结果。手动计算非常困难，小杨希望你能帮他解决这个问题。
阅读更多2024-12-14
IDEA遇到EasyConnect中的网络资源无法访问的问题
版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/wanshanyu_/article/de
阅读更多2024-12-14
双目摄像头标定方法
此时已经完成标定，左下角为反投影误差，右边为外参可视化。将双目左右目拍的图像上传（左右目最好不少于20张）此时回到主页面，即可看到成功导出。把这些误差大的删除即可。
阅读更多2024-12-14
Servlet、omcat服务器架构与工作原理
Servlet是运行在服务器端的Java程序，它的主要职责之一是接收并处理来自客户端（如浏览器）的HTTP请求。当客户端发送一个请求到服务器时，Servlet可以解析请求中的信息，例如请求的URL路径
阅读更多2024-12-14
Vue生命周期钩子函数：深入解析与实践
作为高级Vue前端开发人员，对Vue组件的生命周期钩子函数有着深刻的理解是至关重要的。生命周期钩子函数是指在Vue组件的创建、更新、销毁等过程中，Vue自动调用的一系列方法。通过这些钩子函数，我们可以
阅读更多2024-12-14
安卓开发--使用android studio发布APP
app发布
阅读更多2024-12-14
数据结构与算法学习笔记----拓扑排序
@ author: 明月清了个风。
阅读更多2024-12-14
python 将数据保存到现有的Excel文件的新工作表
out_file = ‘query.xlsx’df1 = pd.DataFrame(out_data)若直接写入：df1.to_excel(out_file, index=False, sheet_n
阅读更多2024-12-14

es 3期 第14节-全文文本分词查询

相关文章

es 3期第14节-全文文本分词查询