自学内容网 自学内容网

ES——分词器


分词器:将文本数据划分为一系列的单词(或称之为词项、tokens)的组件


默认分词器——standaed analyzer

英文效果:根据空格和标点符号进行分词,会进行单词的大小写转换
中文效果:一字一词

示例
所有大写英文全部转为小写

GET /_analyze
{
  "text": ["I AM WUNAIIEQ"],
  "analyzer": "standard"
}

输出
在这里插入图片描述

IK分词器

描述:基于java的中文分词器
分词算法
ik_smart:最少切分
ik_max_word:最细粒度划分
示例
ik_smart最少切分

GET /_analyze
{
  "text": ["你好啊"],
  "analyzer": "ik_smart"
  
}

在这里插入图片描述

GET /_analyze
{
  "text": ["你好啊"],
  "analyzer": "ik_max_word"
}

在这里插入图片描述

拼音分词器

描述:直接将每个字的拼音返回和一段话的拼音首字母返回,不会保留中文,转为拼音之后,没有中文存在
作用:一般和ik分词器组合使用,编成自定义分词器

自定义分词器

描述:多种分词器组合,可以自己写,也可以使用上述的ik,拼音分词器等等
示例

PUT /student3
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "ik_pinyin" : {
          "tokenizer":"ik_max_word", //使用IK分词器中的ik_max_word模式来对文本进行分词
          "filter":"pinyin_filter"//指定了一个过滤器 pinyin_filter,用于对 ik_max_word 分词后的结果进行进一步处理。
         }
       },
      "filter" : {
        "pinyin_filter" : {
          "type" : "pinyin",
          "keep_separate_first_letter" : false,//是否将拼音的首字母单独保留
          "keep_full_pinyin" : true,//是否保留完整的拼音
          "keep_original" : true,//是否保留原始文本
          "remove_duplicated_term" : true//是否移除重复的词条
         }
       }
     }
   },
  "mappings":{
    "properties":{
      "name":{
        "type":"text",
        "store":true,
        "index":true,
             "analyzer":"ik_pinyin"
       },
      "age":{
        "type": "integer"
      }
     }
   }
}
GET /student3/_analyze
{
  "text": ["你好,我叫wunaiieq,很高兴见到你!"],
  "analyzer": "ik_pinyin"
}

结果输出
可以根据多个字眼检查到这条数据

{
  "tokens" : [
    {
      "token" : "ni",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "你好",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "nh",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "hao",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "wo",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "我",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "w",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "jiao",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "叫",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "j",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "wu",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 4
    },
    {
      "token" : "nai",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "i",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 6
    },
    {
      "token" : "e",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 7
    },
    {
      "token" : "q",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 8
    },
    {
      "token" : "wunaiieq",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "ENGLISH",
      "position" : 8
    },
    {
      "token" : "hen",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "gao",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "很高",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "hg",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "gao",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "xing",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "高兴",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "gx",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "jian",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "dao",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "见到",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "jd",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "dao",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "ni",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 16
    },
    {
      "token" : "到你",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 16
    },
    {
      "token" : "dn",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 16
    }
  ]
}


原文地址:https://blog.csdn.net/wusuoweiieq/article/details/144373321

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!