中文分词模拟器

🕗 发布于 2024-11-08 20:16 算法华为 java

题目描述

给定一个连续不包含空格字符串，该字符串仅包含英文小写字母及英文文标点符号(逗号、分号、句号)，同时给定词库，对该字符串进行精确分词。
说明：

精确分词：字符串分词后，不会出现重叠。即“ilovechina” ，不同词库可分割为 “i，love，china” “ilove，china”，不能分割出现重叠的"i，ilove，china",i重叠出现
标点符号不成词，仅用于断句
词库：根据外部知识库统计出来的常用词汇例：dictionary=[“i”,“love”,“china”,“lovechina”,“ilove”],
分词原则：采用分词顺序优先且最长匹配原则
“ilovechina”，假设分词结果 [ i,ilove,lo,love,ch,china,lovechina ] 则输出 [ilove，china]
错误输出：[i,lovechina], 原因："ilove ">优先于 "lovechina"成词
错误输出：[i,love,china] 原因：“ilove” >“i” 遵循最长匹配原则

输入描述

字符串长度限制：0<length<256
词库长度限制： 1<length<100000
第一行输入待分词语句 “ilovechina”
第二行输入中文词库 “i,love,china,ch,na,ve,lo,this,is,the,word”

输出描述

按顺序输出分词结果 “i,love,china”

示例1

输入：

ilovechina
i,love,china,ch,na,ve,lo,this,is,the,word

输出：

i,love,china

示例2

输入：

iat
i,love,china,ch,na,ve,lo,this,is,the,word,beauti,tiful,ful

输出：

i,a,t

说明：

单个字母，不在词库中且不成词则直接输出单个字母

示例3

输入：

ilovechina,thewordisbeautiful
i,love,china,ch,na,ve,lo,this,is,the,word,beauti,tiful,ful

输出：

i,love,china,the,word,is,beauti,ful

说明：
标点符号为英文标点符号

题解

构建索引

源码 Java

import java.util.*;

public class Tokenizer {

static Map<Character, List<String>> map = new HashMap<>();
static Input input;
static {
input = new Input("ilovechina\n" +
"i,love,china,ch,na,ve,lo,this,is,the,word");
input = new Input("iat\n" +
"i,love,china,ch,na,ve,lo,this,is,the,word,beauti,tiful,ful");
input = new Input("ilovechina,thewordisbeautiful\n" +
"i,love,china,ch,na,ve,lo,this,is,the,word,beauti,tiful,ful");
for (int i = 'a'; i <= 'z'; i++) {
ArrayList<String> tokens = new ArrayList<>();
tokens.add(((char)i) + "");
map.put((char) i, tokens);
}
}

public static void main(String[] args) {
String ss = input.nextLine();
String[] dict = input.nextLine().split(",");
for (int i = 0; i < dict.length; i++) {
List<String> strings = map.get(dict[i].charAt(0));
strings.add(dict[i]);
map.put(dict[i].charAt(0), strings);
}
for (Map.Entry<Character, List<String>> entry : map.entrySet()) {
Collections.sort(entry.getValue(), (o1, o2) -> o2.length() - o1.length());
}
String[] words = ss.split("[^a-zA-Z]");
List<String> result = new ArrayList<>();
for (String word : words) {
while (word.length() > 0) {
List<String> tokens = map.get(word.charAt(0));
for (String token : tokens) {
if (word.startsWith(token)) {
result.add(token);
word = word.substring(token.length());
break;
}
}

}
}
System.out.println(String.join(",", result));
}
}

原文地址：https://blog.csdn.net/TangKenny/article/details/143473879

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：如何评估Elasticsearch查询性能的具体指标？
下一篇：S32G-VNP-RDB2开发环境搭建

中文分词模拟器

更多关于刷题的内容欢迎订阅我的专栏华为刷题笔记

题目描述

题解

源码 Java

相关文章