ES——分词器

🕗 发布于 2024-12-11 21:55 elasticsearch c# 大数据

默认分词器——standaed analyzer

英文效果：根据空格和标点符号进行分词，会进行单词的大小写转换
中文效果：一字一词

示例
所有大写英文全部转为小写

GET /_analyze
{
  "text": ["I AM WUNAIIEQ"],
  "analyzer": "standard"
}

输出
在这里插入图片描述

IK分词器

描述：基于java的中文分词器
分词算法
ik_smart：最少切分
ik_max_word：最细粒度划分
示例
ik_smart最少切分

GET /_analyze
{
  "text": ["你好啊"],
  "analyzer": "ik_smart"
  
}

在这里插入图片描述

GET /_analyze
{
  "text": ["你好啊"],
  "analyzer": "ik_max_word"
}

在这里插入图片描述

拼音分词器

描述：直接将每个字的拼音返回和一段话的拼音首字母返回，不会保留中文，转为拼音之后，没有中文存在
作用：一般和ik分词器组合使用，编成自定义分词器

自定义分词器

描述：多种分词器组合，可以自己写，也可以使用上述的ik，拼音分词器等等
示例

PUT /student3
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "ik_pinyin" : {
          "tokenizer":"ik_max_word", //使用IK分词器中的ik_max_word模式来对文本进行分词
          "filter":"pinyin_filter"//指定了一个过滤器 pinyin_filter，用于对 ik_max_word 分词后的结果进行进一步处理。
         }
       },
      "filter" : {
        "pinyin_filter" : {
          "type" : "pinyin",
          "keep_separate_first_letter" : false,//是否将拼音的首字母单独保留
          "keep_full_pinyin" : true,//是否保留完整的拼音
          "keep_original" : true,//是否保留原始文本
          "remove_duplicated_term" : true//是否移除重复的词条
         }
       }
     }
   },
  "mappings":{
    "properties":{
      "name":{
        "type":"text",
        "store":true,
        "index":true,
             "analyzer":"ik_pinyin"
       },
      "age":{
        "type": "integer"
      }
     }
   }
}
GET /student3/_analyze
{
  "text": ["你好，我叫wunaiieq，很高兴见到你！"],
  "analyzer": "ik_pinyin"
}

结果输出
可以根据多个字眼检查到这条数据

{
  "tokens" : [
    {
      "token" : "ni",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "你好",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "nh",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "hao",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "wo",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "我",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "w",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "jiao",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "叫",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "j",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "wu",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 4
    },
    {
      "token" : "nai",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "i",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 6
    },
    {
      "token" : "e",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 7
    },
    {
      "token" : "q",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 8
    },
    {
      "token" : "wunaiieq",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 8
    },
    {
      "token" : "hen",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "gao",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "很高",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "hg",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "gao",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "xing",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "高兴",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "gx",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "jian",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "dao",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "见到",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "jd",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "dao",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "ni",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 16
    },
    {
      "token" : "到你",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 16
    },
    {
      "token" : "dn",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 16
    }
  ]
}

原文地址：https://blog.csdn.net/wusuoweiieq/article/details/144373321

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：jquery折叠菜单效果
下一篇：Intellij IDEA 2023 获取全限定类名

.NET(C#) 如何配置用户首选项及保存用户设置
.NET(C#) 如何配置用户首选项及保存用户设置
阅读更多2024-12-14
【最新】北大数字普惠金融指数数据集-省市县（2011-2023年）
郭峰,王靖一,王芳,孔涛,张勋,程志云.测度中国数字普惠金融发展:指数编制与空间特征[J].经济学(季刊),2020,19(04):1401-1418.时间跨度：省级和城市级指数时间跨度为2011-2
阅读更多2024-12-14
GESP202412 四级【Recamán】题解（AC）
a11ak−1−kkakak−1−kak−1k小杨想知道 Recamán 数列的前n项从小到大排序后的结果。手动计算非常困难，小杨希望你能帮他解决这个问题。
阅读更多2024-12-14
IDEA遇到EasyConnect中的网络资源无法访问的问题
版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。原文链接：https://blog.csdn.net/wanshanyu_/article/de
阅读更多2024-12-14
双目摄像头标定方法
此时已经完成标定，左下角为反投影误差，右边为外参可视化。将双目左右目拍的图像上传（左右目最好不少于20张）此时回到主页面，即可看到成功导出。把这些误差大的删除即可。
阅读更多2024-12-14
Servlet、omcat服务器架构与工作原理
Servlet是运行在服务器端的Java程序，它的主要职责之一是接收并处理来自客户端（如浏览器）的HTTP请求。当客户端发送一个请求到服务器时，Servlet可以解析请求中的信息，例如请求的URL路径
阅读更多2024-12-14
Vue生命周期钩子函数：深入解析与实践
作为高级Vue前端开发人员，对Vue组件的生命周期钩子函数有着深刻的理解是至关重要的。生命周期钩子函数是指在Vue组件的创建、更新、销毁等过程中，Vue自动调用的一系列方法。通过这些钩子函数，我们可以
阅读更多2024-12-14
安卓开发--使用android studio发布APP
app发布
阅读更多2024-12-14
数据结构与算法学习笔记----拓扑排序
@ author: 明月清了个风。
阅读更多2024-12-14
python 将数据保存到现有的Excel文件的新工作表
out_file = ‘query.xlsx’df1 = pd.DataFrame(out_data)若直接写入：df1.to_excel(out_file, index=False, sheet_n
阅读更多2024-12-14

ES——分词器

目录

默认分词器——standaed analyzer

IK分词器

拼音分词器

自定义分词器

相关文章