ES——分词器
分词器:将文本数据划分为一系列的单词(或称之为词项、tokens)的组件
默认分词器——standaed analyzer
英文效果:根据空格和标点符号进行分词,会进行单词的大小写转换
中文效果:一字一词
示例
所有大写英文全部转为小写
GET /_analyze
{
"text": ["I AM WUNAIIEQ"],
"analyzer": "standard"
}
输出
IK分词器
描述:基于java的中文分词器
分词算法
ik_smart:最少切分
ik_max_word:最细粒度划分
示例
ik_smart最少切分
GET /_analyze
{
"text": ["你好啊"],
"analyzer": "ik_smart"
}
GET /_analyze
{
"text": ["你好啊"],
"analyzer": "ik_max_word"
}
拼音分词器
描述:直接将每个字的拼音返回和一段话的拼音首字母返回,不会保留中文,转为拼音之后,没有中文存在
作用:一般和ik分词器组合使用,编成自定义分词器
自定义分词器
描述:多种分词器组合,可以自己写,也可以使用上述的ik,拼音分词器等等
示例
PUT /student3
{
"settings" : {
"analysis" : {
"analyzer" : {
"ik_pinyin" : {
"tokenizer":"ik_max_word", //使用IK分词器中的ik_max_word模式来对文本进行分词
"filter":"pinyin_filter"//指定了一个过滤器 pinyin_filter,用于对 ik_max_word 分词后的结果进行进一步处理。
}
},
"filter" : {
"pinyin_filter" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,//是否将拼音的首字母单独保留
"keep_full_pinyin" : true,//是否保留完整的拼音
"keep_original" : true,//是否保留原始文本
"remove_duplicated_term" : true//是否移除重复的词条
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"store":true,
"index":true,
"analyzer":"ik_pinyin"
},
"age":{
"type": "integer"
}
}
}
}
GET /student3/_analyze
{
"text": ["你好,我叫wunaiieq,很高兴见到你!"],
"analyzer": "ik_pinyin"
}
结果输出
可以根据多个字眼检查到这条数据
{
"tokens" : [
{
"token" : "ni",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "你好",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "nh",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "hao",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "wo",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "我",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "w",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "jiao",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "叫",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "j",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "wu",
"start_offset" : 5,
"end_offset" : 13,
"type" : "ENGLISH",
"position" : 4
},
{
"token" : "nai",
"start_offset" : 5,
"end_offset" : 13,
"type" : "ENGLISH",
"position" : 5
},
{
"token" : "i",
"start_offset" : 5,
"end_offset" : 13,
"type" : "ENGLISH",
"position" : 6
},
{
"token" : "e",
"start_offset" : 5,
"end_offset" : 13,
"type" : "ENGLISH",
"position" : 7
},
{
"token" : "q",
"start_offset" : 5,
"end_offset" : 13,
"type" : "ENGLISH",
"position" : 8
},
{
"token" : "wunaiieq",
"start_offset" : 5,
"end_offset" : 13,
"type" : "ENGLISH",
"position" : 8
},
{
"token" : "hen",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "gao",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "很高",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "hg",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "gao",
"start_offset" : 15,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 11
},
{
"token" : "xing",
"start_offset" : 15,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 12
},
{
"token" : "高兴",
"start_offset" : 15,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 12
},
{
"token" : "gx",
"start_offset" : 15,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 12
},
{
"token" : "jian",
"start_offset" : 17,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 13
},
{
"token" : "dao",
"start_offset" : 17,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 14
},
{
"token" : "见到",
"start_offset" : 17,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 14
},
{
"token" : "jd",
"start_offset" : 17,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 14
},
{
"token" : "dao",
"start_offset" : 18,
"end_offset" : 20,
"type" : "CN_WORD",
"position" : 15
},
{
"token" : "ni",
"start_offset" : 18,
"end_offset" : 20,
"type" : "CN_WORD",
"position" : 16
},
{
"token" : "到你",
"start_offset" : 18,
"end_offset" : 20,
"type" : "CN_WORD",
"position" : 16
},
{
"token" : "dn",
"start_offset" : 18,
"end_offset" : 20,
"type" : "CN_WORD",
"position" : 16
}
]
}
原文地址:https://blog.csdn.net/wusuoweiieq/article/details/144373321
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!