Python实大模型文章的RSS订阅采集器

🕗 发布于 2024-10-11 19:01 python 开发语言 人工智能

在这篇教程中，我们将详细讲解如何使用Python构建一个可靠的RSS订阅采集器。这个工具可以帮助你自动收集和过滤来自知乎、博客等支持RSS的平台的文章。

实际上也是我们技术博客站点的源码
以下是细节拆解，完整的源码见：

https://github.com/llama-factory/llamafactory/blob/main/Python实现大模型文章的RSS订阅采集器.md

项目概述

我们的RSS订阅采集器主要完成以下任务：

从YAML配置文件读取配置信息
从多个源获取RSS订阅
解析订阅并提取相关信息
基于相关性过滤内容
将收集到的数据存储到MySQL数据库中

环境准备

你需要安装以下Python包：

pip install requests PyYAML pymysql schedule tqdm

项目结构

项目主要包含两个文件：

collect.py - 主Python脚本
config.yaml - 配置文件

配置文件示例

sources:
  知乎:
    - https://www.zhihu.com/people/username/posts
  科学空间:
    - https://spaces.ac.cn/feed
  博客:
    - https://example.com/feed.xml

rss_templates:
  zhihu: "https://rsshub.example.com/zhihu/posts/people/{uid}"

database:
  host: "localhost"
  user: "username"
  password: "password"
  database: "dbname"
  table: "rss_feed_data"

workflow:
  base_url: "https://api.example.com/v1/workflows/run"
  api_key: "your-api-key"

详细实现

1. 创建基础类结构

class RSSFeedCollector:
    def __init__(self):
        self.logger = self._setup_logger()
        
    def _setup_logger(self):
        logger = logging.getLogger("RSSFeedCollector")
        logger.setLevel(logging.INFO)
        formatter = logging.Formatter(
            "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
        )
        
        # 使用 RotatingFileHandler 进行日志轮转
        file_handler = RotatingFileHandler(
            "rss_collector.log", maxBytes=5 * 1024 * 1024, backupCount=3
        )
        file_handler.setFormatter(formatter)
        logger.addHandler(file_handler)
        
        # 同时输出到控制台
        console_handler = logging.StreamHandler(sys.stdout)
        console_handler.setFormatter(formatter)
        logger.addHandler(console_handler)
        
        return logger

2. 加载配置文件

def load_config(self):
    with open("config.yaml", "r") as file:
        config = yaml.safe_load(file)
    self.SOURCE = {
        url: source_type
        for source_type, urls in config["sources"].items()
        for url in urls
    }
    self.db_config = config["database"]
    self.table = self.db_config.pop("table")
    self.workflow_config = config["workflow"]

3. 获取和解析RSS数据

def fetch_rss_data(self, url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return ET.fromstring(response.text)
    except requests.RequestException as e:
        self.logger.error(f"获取RSS数据失败: {url}, 错误: {str(e)}")
        raise RSSFetchError(f"无法获取RSS数据: {url}") from e
    except ET.ParseError as e:
        self.logger.error(f"解析RSS XML失败: {url}, 错误: {str(e)}")
        raise RSSParseError(f"无法解析RSS XML: {url}") from e

def parse_channel_info(self, channel):
    return {
        "channel_title": channel.findtext("title"),
        "channel_link": channel.findtext("link"),
        "channel_description": channel.findtext("description"),
        "language": channel.findtext("language"),
    }

def parse_rss_item(self, item):
    pubDate = item.findtext("pubDate")
    if pubDate:
        try:
            dt = parsedate_to_datetime(pubDate)
            pubDate = dt.strftime("%Y-%m-%d %H:%M:%S")
        except ValueError:
            self.logger.warning(f"无法解析日期: {pubDate}")
            pubDate = None

    content = item.findtext("description") or ""
    description = self.smart_truncate(
        self.extract_text(content), length=200, max_length=250
    )

    return {
        "title": item.findtext("title"),
        "description": description,
        "link": item.findtext("link"),
        "pubDate": pubDate,
        "content": content,
    }

4. 内容过滤

def filter_relevant_blogs(self, data_list):
    results = []
    for blog in tqdm(data_list, desc="过滤文章"):
        is_relevant = self.check_blog_relevance(blog)
        if is_relevant:
            results.append(blog)
    
    filtered_count = len(data_list) - len(results)
    self.logger.info(f"过滤了 {filtered_count} 篇不相关的文章")
    return results

5. 数据库操作

def insert_into_database(self, connection, cursor, data_list):
    if not data_list:
        self.logger.info("没有相关文章需要插入数据库")
        return

    insert_query = f"""
    INSERT INTO {self.table} 
    (title, description, link, pubDate, content)
    VALUES (%s, %s, %s, %s, %s)
    ON DUPLICATE KEY UPDATE
    title = VALUES(title),
    description = VALUES(description),
    content = VALUES(content)
    """

    batch_size = 100
    success_count = 0
    error_count = 0

    for i in range(0, len(data_list), batch_size):
        batch = data_list[i:i + batch_size]
        try:
            cursor.executemany(insert_query, [
                (item["title"], item["description"], item["link"],
                 item["pubDate"], item["content"])
                for item in batch
            ])
            connection.commit()
            success_count += len(batch)
        except Error as e:
            connection.rollback()
            error_count += len(batch)
            self.logger.error(f"批量插入错误: {str(e)}")

    self.logger.info(f"插入完成。成功: {success_count}, 失败: {error_count}")

运行说明

你可以通过以下两种方式运行采集器：

立即执行一次采集：

python collect.py --now

启动定时任务（每3天执行一次）：

python collect.py

最佳实践和错误处理

使用日志轮转：通过RotatingFileHandler确保日志文件不会无限增长。
批量数据库操作：使用批量插入提高效率，同时在出错时进行单条重试。
优雅的错误处理：使用自定义异常类和try/except块处理各种可能的错误：

class RSSFetchError(Exception):
    """获取RSS数据时发生的错误"""
    pass

class RSSParseError(Exception):
    """解析RSS数据时发生的错误"""
    pass

进度显示：使用tqdm库显示处理进度，提供更好的用户体验。

进阶优化建议

添加代理支持：在fetch_rss_data中添加代理支持，避免IP被封禁：

def fetch_rss_data(self, url, proxies=None):
    try:
        response = requests.get(url, proxies=proxies)
        response.raise_for_status()
        return ET.fromstring(response.text)
    except requests.RequestException as e:
        self.logger.error(f"获取RSS数据失败: {url}, 错误: {str(e)}")
        raise RSSFetchError(f"无法获取RSS数据: {url}") from e

添加重试机制：对于不稳定的网络环境，添加重试机制：

from retrying import retry

@retry(stop_max_attempt_number=3, wait_fixed=2000)
def fetch_rss_data(self, url):
    # ... 原有的实现 ...

异步支持：使用aiohttp和asyncio实现异步爬取，提高效率：

async def fetch_rss_data_async(self, url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            text = await response.text()
            return ET.fromstring(text)

总结

通过这个项目，我们实现了一个功能完整的RSS订阅采集器。它不仅可以自动收集多个来源的RSS内容，还能进行内容过滤和存储。通过使用日志记录、错误处理和批量操作等最佳实践，我们确保了程序的可靠性和效率。

你可以基于这个基础实现，根据自己的需求进行定制和扩展，比如添加更多的数据源、实现更复杂的过滤逻辑，或者集成到其他系统中。

记住，在处理网络请求时要注意添加适当的延迟，遵守网站的robots.txt规则，做一个有道德的爬虫程序。同时，定期备份你的数据库，以防意外情况发生。

原文地址：https://blog.csdn.net/budahui/article/details/142798973

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：在Linux中编译工具有哪些
下一篇：基础【前端】面试题

BUU刷题-Pwn-jarvisoj_typo(ARM符号表恢复技术,Rizzo,FLIRT)
通过IDA动态调试和符号表恢复找到目标函数,存在read函数溢出再通过pwndbg来计算栈溢出的长度是112再通过RopGadgets找到gadget,用来传参和调用函数由于是静态程序就一定会存在很多
阅读更多2024-10-11
【原创教程】电气电工23：电气柜的品牌及常用型号
好夫满有很多种类的机箱，EB精巧控制箱系列、KL接线箱系列、BKL不锈钢接线箱系列、GB挂壁箱系列、BGB不锈钢挂壁系列、GB立式控制箱系列、BGB不锈钢立式控制箱系列、AK豪华立式控制箱系列、BAK
阅读更多2024-10-11
C++学习笔记（54）
cout << "文件信息结构体" << fileinfo.filename << "(" << fileinf
阅读更多2024-10-11
黑马javaWeb笔记重点备份2:mybatis基础（注解方式）、数据库连接池概念、lombok使用
Lombok是一个实用的Java类库，可以通过简单的注解来简化和消除一些必须有但显得很臃肿的Java代码。通过注解的形式自动生成构造器、getter/setter、equals、hashcode、to
阅读更多2024-10-11
Go-知识泛型
除了内置的comparable和any两种类型可作为类型约束使用，用户还可以使用interface来定义类型集合。任意类型元素(如 int)近似类型元素(使用表示法，如int)联合类型元素(使用|表示
阅读更多2024-10-11
基于GoogleNet深度学习网络的手语识别算法matlab仿真
基于GoogleNet深度学习网络的手语识别算法，是一种利用卷积神经网络（Convolutional Neural Networks, CNN）来识别手语手势的方法。GoogleNet，也被称为Inc
阅读更多2024-10-11
低代码可视化-uniapp商城首页小程序-代码生成器
在设计一个小程序的首页时，包含轮播图、通知栏和商品列表这三个元素是非常常见且有效的布局方式。这样的设计既能够吸引用户的注意力，又能够高效地展示信息和商品。
阅读更多2024-10-11
Linux_kernel中断系统13
在系统启动 \ 热插拔和动态加载模块时，自动创建设备节点文件系统中的/dev目录下的设备节点都是由mdev创建的在加载模块时根据驱动程序，可以在/dev/目录下自动创建设备文件中断处理函数存在的疑虑
阅读更多2024-10-11
如何通过USB插口分清慢充和快充充电器
因此，不同的颜色代表着不同的速度和功能，大多数情况是这样设计的，当然也有一些厂商为了外观颜值改变接口颜色，没有完全统一的标准。目前大多数的USB接口以黑色和蓝色偏多，尤其是电脑端口，并且蓝色USB3.
阅读更多2024-10-11
Vue3的学习（二）路由
简单路由案例：配置路由规则，createWebHistory是指定路由的工作模式，routes中的每个元素都是一个配置好的路由，其中path是路由的路径，component是该路由对应的组件挂载app
阅读更多2024-10-11