Kaggle Python练习：字符串和字典（Exercise: Strings and Dictionaries）

🕗 发布于 2024-10-18 15:54 python c# 开发语言

文章目录

问题：搜索特定单词并定位

一位研究人员收集了数千篇新闻文章。但她想将注意力集中在包含特定单词的文章上。完成以下功能以帮助她过滤文章列表。

您的函数应满足以下条件：

不要包含关键字字符串仅作为较大单词的一部分出现的文档。例如，如果她正在查找关键字“close”，则您不会包含字符串“enlined”。
她不希望你区分大小写字母。所以这句话“结案了”。当关键字“关闭”时将被包含
不要让句号或逗号影响匹配的内容。 “已经关门了。”当关键字为“close”时将被包含。但您可以假设没有其他类型的标点符号

思路

读取列表中的字符串并转为小写
去除两边的干扰符号",.?"，使用strip()函数
将中间的逗号替换为空格使用split()函数划分为单词
然后将划分出的单词与keyword进行比对，如果在则在空列表中保存索引
返回结果列表

# doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
doc_list=['The Learn Python Challenge Casino', 'They bought a car, and a horse', 'Casinoville?']
keyword = 'Casino'
list = []
l = len(doc_list)
for i in range(l):
        words = doc_list[i].lower()
        print(words)
        words = words.strip('.,?')
        print(words)
        
        wordlist = words.replace(",","").split()
        print(wordlist)
        for word in wordlist:
            if word == keyword.lower():
                list.append(i)
                print(i)
#         if keyword in wordlist:
#             print(i)
print(list)

在这里插入图片描述

代码实现

def word_search(doc_list, keyword):
    """
    Takes a list of documents (each document is a string) and a keyword. 
    Returns list of the index values into the original list for all documents 
    containing the keyword.

    Example:
    doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
    >>> word_search(doc_list, 'casino')
    >>> [0]
    """
    list = []
    l = len(doc_list)
    for i in range(l):
        words = doc_list[i].lower()
        words = words.strip(',.?')
        wordlist = words.replace(",","").split()
        for word in wordlist:
            if word == keyword:
                list.append(i)
                break
    return list

官方代码

def word_search(doc_list, keyword):
    # list to hold the indices of matching documents
    indices = [] 
    # Iterate through the indices (i) and elements (doc) of documents
    for i, doc in enumerate(doc_list):
        # Split the string doc into a list of words (according to whitespace)
        tokens = doc.split()
        # Make a transformed list where we 'normalize' each word to facilitate matching.
        # Periods and commas are removed from the end of each word, and it's set to all lowercase.
        normalized = [token.rstrip('.,').lower() for token in tokens]
        # Is there a match? If so, update the list of matching indices.
        if keyword.lower() in normalized:
            indices.append(i)
    return indices

代码解析

enumerate() 是 Python 的一个内置函数，用于为可迭代对象（如列表、元组或字符串）提供一个自动计数器，同时遍历该对象。它返回一个包含索引和值的迭代器，常用于 for 循环中。
enumerate(iterable, start=0)

iterable：任何可以遍历的对象，如列表、字符串等。
start（可选）：计数的起始值，默认为 0，也可以指定其他起始值。
enumerate() 返回一个迭代器对象，每次迭代返回一个元组，包含当前元素的索引和元素值。
向字典中添加键值对（元素对）
dictionary[key] = value
key：表示字典的键。
value：表示该键对应的值。

str.split() 方法用于根据指定的分隔符将字符串拆分为子字符串列表。默认情况下，分隔符是任意的空白字符（空格、制表符或换行符）
string.split(separator, maxsplit)
separator（可选）：指定的分隔符字符串。如果没有提供，字符串会按空白字符进行拆分。
maxsplit（可选）：指定最大拆分次数。默认值是 -1，表示不限制拆分次数。

str.rstrip() 是 Python 中的一个字符串方法，用于删除字符串末尾的指定字符（默认为空白字符）。
string.rstrip([chars])
chars（可选）：指定要移除的字符序列。如果没有提供，默认会移除末尾的所有空白字符（包括空格、换行符、制表符等）。

str.strip() 是 Python 中用于删除字符串两端（开头和结尾）指定字符（默认为空白字符）的一个方法。它可以同时移除字符串开头和末尾的字符。
string.strip([chars])
chars（可选）：指定要移除的字符序列。如果没有提供，默认会移除两端的所有空白字符（如空格、换行符、制表符等）。
result = text.strip(“，。？”) # 删除两端的 ‘，’、‘。’、‘？’

更进一步

现在研究人员想要提供多个关键字进行搜索。完成下面的函数来帮助她。

（我们鼓励您在实现此函数时使用刚刚编写的word_search函数。以这种方式重用代码可以使您的程序更加健壮和可读 - 并且可以节省打字！）
1、在里面改写函数，使用循环对多个keywords进行判断

def multi_word_search(doc_list, keywords):
    """
    Takes list of documents (each document is a string) and a list of keywords.  
    Returns a dictionary where each key is a keyword, and the value is a list of indices
    (from doc_list) of the documents containing that keyword

    >>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car and a casino", "Casinoville"]
    >>> keywords = ['casino', 'they']
    >>> multi_word_search(doc_list, keywords)
    {'casino': [0, 1], 'they': [1]}
    """
    # list to hold the indices of matching documents
#     indices = []
    dictionary = {}
    for keyword in keywords:
        indices = []
        # Iterate through the indices (i) and elements (doc) of documents
        for i, doc in enumerate(doc_list):
            # Split the string doc into a list of words (according to whitespace)
            tokens = doc.split()
            # Make a transformed list where we 'normalize' each word to facilitate matching.
            # Periods and commas are removed from the end of each word, and it's set to all lowercase.
            normalized = [token.rstrip('.,').lower() for token in tokens]
            # Is there a match? If so, update the list of matching indices.
            if keyword.lower() in normalized:
                indices.append(i)
        dictionary[keyword] = indices
    return dictionary

# Check your answer
q3.check()

2、直接调用前面已经实现的函数word_search(doc_list, keyword)

def multi_word_search(doc_list, keywords):
    """
    Takes list of documents (each document is a string) and a list of keywords.  
    Returns a dictionary where each key is a keyword, and the value is a list of indices
    (from doc_list) of the documents containing that keyword

    >>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car and a casino", "Casinoville"]
    >>> keywords = ['casino', 'they']
    >>> multi_word_search(doc_list, keywords)
    {'casino': [0, 1], 'they': [1]}
    """
    keyword_to_indices = {}
    for keyword in keywords:
        keyword_to_indices[keyword] = word_search(doc_list, keyword)
    return keyword_to_indices

原文地址：https://blog.csdn.net/qq_38473254/article/details/142991757

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Qt-系统处理鼠标相关事件(57)
下一篇：LeetCode：LCP77.符文储备（排序 Java）

JavaWeb合集-SpringBoot项目配套知识
Tomcat是 Apache软件基金会一个核心项目，是一个开源免费的轻量级Web服务器，支持Servlet/JSP少量JavaEE规范。Web服务器是一个软件程序，对HTTP协议的操作进行封装,
阅读更多2024-10-18
【MySQL】内置函数
想必大家在学校也学习过MySQL，可能学的懵懵懂懂，这个板块我们从入门开始，从最新的安装MySQL到学习MySQL语句，一步一步开始，一切都是新的，新的板块新的开始，大家一起努力，一起进步！！！二。
阅读更多2024-10-18
C++核心编程、面向对象
C++核心编程、面向对象
阅读更多2024-10-18
用PHP爬虫API数据获取商品SKU信息实战指南
在电商领域，对商品SKU信息的精准把握是商家取胜的关键。通过PHP爬虫API获取淘宝商品SKU信息，我们能够为电商运营提供数据支持，优化库存管理，制定精准的营销策略。这不仅提高了运营效率，也为消费者提
阅读更多2024-10-18
Devops工具链集成的意义及基本原理
Devops工具链集成的意义在于实现开发（Development）与运维（Operations）之间的紧密协作，通过自动化流程提高软件交付的速度、质量和稳定性。其基本原理是通过一系列相互连接的工具，涵
阅读更多2024-10-18
3D Gaussian Splatting前向渲染代码解读
3D GS前向渲染解读
阅读更多2024-10-18
Android SELinux——策略文件配置结构（八）
在 Android 系统中，SELinux 主要是通过一系列配置文件来进行管理和配置的。这些配置文件涵盖了策略定义、标签映射、签名信息等多个方面。
阅读更多2024-10-18
数据结构--线性表
循环链表是链式存储结构的一种特殊形式，其特点是表中最后一个节点的指针域指向头节点，从而使整个链表形成一个环状结构。这种结构使得链表中的元素可以无限循环地被访问，为某些特定场景下的操作提供了便利。循环链
阅读更多2024-10-18
【OpenGauss源码学习 —— （VecSortAgg）】
在 openGauss (OG) 中，VecSortAgg 是一种基于矢量化的排序聚合操作，它用于在执行 SQL 查询时高效地对数据进行分组和聚合。与传统的逐行处理不同，VecSortAgg 通过批量
阅读更多2024-10-18
决策树C4.5如何处理缺省值
C4.5通过加权的方式有效处理缺失值，无需删除或填补缺失数据。这种灵活性使得它在应对真实世界中的数据集时表现优越，因为真实数据往往存在一定的缺失信息。C4.5的这种策略既能最大限度利用样本信息，又能减
阅读更多2024-10-18