使用 Python 爬取某网站简历模板（bs4/lxml+协程）

🕗 发布于 2024-12-12 19:40 python 开发语言

使用 Python 爬取站长素材简历模板

简介

在本教程中，我们将学习如何使用 Python 来爬取站长素材网站上的简历模板。我们将使用requests和BeautifulSoup库来发送 HTTP 请求和解析 HTML 页面。本教程将分为两个部分：第一部分是使用BeautifulSoup的方法，第二部分是使用lxml的方法，并比较两者的差异。

环境准备

首先，确保你已经安装了 Python。然后，安装以下库：

pip install requests beautifulsoup4 lxml

方法一：使用 BeautifulSoup

1.导入库

import requests
from bs4 import BeautifulSoup
import os

2.创建文件夹用于保存爬取的简历图片

if not os.path.exists("resume_templates_images"):
    os.makedirs("resume_templates_images")

3.爬取第一页

first_page_url = "https://sc.chinaz.com/jianli/free.html"
response = requests.get(first_page_url)
response.encoding = 'utf-8'

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    templates = soup.find_all('div', class_='box col3 ws_block')

    for template in templates:
        link = template.find('a', target='_blank')['href']
        img = template.find('img')['src']

        if img.startswith('//'):
            img = 'https:' + img

        title = template.find('p').find('a').text.strip()

        img_response = requests.get(img)
        if img_response.status_code == 200:
            img_name = f"{title.replace(' ', '_')}.jpg"
            img_path = os.path.join("resume_templates_images", img_name)
            with open(img_path, 'wb') as f:
                f.write(img_response.content)
        else:
            print(f"下载图片 {img} 失败，状态码: {img_response.status_code}")

4.爬取第二页到第五页

在这里插入代base_url = "https://sc.chinaz.com/jianli/free_"
for page_num in range(2, 6):
    url = f"{base_url}{page_num}.html"
    response = requests.get(url)
    response.encoding = 'utf-8'

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        templates = soup.find_all('div', class_='box col3 ws_block')

        for template in templates:
            link = template.find('a', target='_blank')['href']
            img = template.find('img')['src']

            if img.startswith('//'):
                img = 'https:' + img

            title = template.find('p').find('a').text.strip()

            img_response = requests.get(img)
            if img_response.status_code == 200:
                img_name = f"{title.replace(' ', '_')}.jpg"
                img_path = os.path.join("resume_templates_images", img_name)
                with open(img_path, 'wb') as f:
                    f.write(img_response.content)
            else:
                print(f"下载图片 {img} 失败，状态码: {img_response.status_code}")
码片

方法二：使用 lxml

first_page_url = "https://sc.chinaz.com/jianli/free.html"
response = requests.get(first_page_url)
response.encoding = 'utf-8'

if response.status_code == 200:
    tree = etree.HTML(response.text)
    templates = tree.xpath('//div[@class="box col3 ws_block"]')

    for template in templates:
        link = template.xpath('.//a[@target="_blank"]/@href')[0]
        img = template.xpath('.//img/@src')[0]

        if img.startswith('//'):
            img = 'https:' + img

        title = template.xpath('.//p/a[@class="title_wl"]/text()')[0].strip()

        img_response = requests.get(img)
        if img_response.status_code == 200:
            img_name = f"{title.replace(' ', '_')}.jpg"
            img_path = os.path.join("resume_templates_images", img_name)
            with open(img_path, 'wb') as f:
                f.write(img_response.content)
        else:
            print(f"下载图片 {img} 失败，状态码: {img_response.status_code}")

同方法一，但使用lxml的xpath方法。

方法比较

• 解析速度：lxml通常比BeautifulSoup快，特别是在处理大型 HTML 文档时。

• 易用性：BeautifulSoup提供了更直观的方法来查找元素，如find和find_all，而lxml使用xpath，这可能需要更多的学习。

• 灵活性：xpath在定位复杂的 HTML 结构时更加灵活，但也需要更复杂的查询。

通过运行我们发现这段代码的执行时间较长，那么我们有没有方法来缩短运行时间呢

import asyncio
import aiohttp
from bs4 import BeautifulSoup
import os
import time  # 导入time模块来记录时间

# 创建一个文件夹resume_templates_images用于保存图片
if not os.path.exists("resume_templates_images"):
    os.makedirs("resume_templates_images")

# 用于存储所有页面的模板数据
all_template_data = []

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def parse_page(session, url):
    soup = BeautifulSoup(await fetch(session, url), 'html.parser')
    templates = soup.find_all('div', class_='box col3 ws_block')

    for template in templates:
        link = template.find('a', target='_blank')['href']
        img = template.find('img')['src']

        if img.startswith('//'):
            img = 'https:' + img

        title = template.find('p').find('a').text.strip()

        async with session.get(img) as img_response:
            if img_response.status == 200:
                img_name = f"{title.replace(' ', '_')}.jpg"
                img_path = os.path.join("resume_templates_images", img_name)
                with open(img_path, 'wb') as f:
                    f.write(await img_response.read())

        all_template_data.append({
            'title': title,
            'img_url': img,
            'link': link
        })

async def main():
    start_time = time.time()  # 记录开始时间

    async with aiohttp.ClientSession() as session:
        # 处理第一页
        await parse_page(session, "https://sc.chinaz.com/jianli/free.html")

        # 处理第二页到第五页
        for page_num in range(2, 6):
            url = f"https://sc.chinaz.com/jianli/free_{page_num}.html"
            await parse_page(session, url)

        # 输出所有页面的模板数据
        for idx, data in enumerate(all_template_data, 1):
            print(f"模板 {idx}:")
            print(f"名称: {data['title']}")
            print(f"图片链接: {data['img_url']}")
            print(f"模板链接: {data['link']}")
            print("=" * 50)

    end_time = time.time()  # 记录结束时间
    run_time = end_time - start_time  # 计算运行时间
    print(f"程序运行时间：{run_time:.2f}秒")

if __name__ == "__main__":
    asyncio.run(main())

这段代码是一个使用asyncio和aiohttp库来异步爬取站长素材网站上的简历模板的 Python 脚本。以下是代码的详细解释和如何加快爬取速度的说明：

• parse_page 函数：一个异步函数，用于解析页面内容，提取模板链接和图片链接，并下载图片。

• 异步 I/O：使用asyncio和aiohttp可以实现异步 I/O 操作，这意味着在等待网络响应时，程序可以执行其他任务，而不是被阻塞。这样可以显著提高爬取效率，特别是在需要处理多个页面时。
在这里插入图片描述
这段代码是顺序并发执行执行每个页面的爬取，有没有更快的方式——并发执行
• 并发请求：使用asyncio.gather来同时启动多个parse_page任务。

修改代码以实现并发请求

以下是如何修改main函数来实现并发请求：

async def main():
    start_time = time.time()  # 记录开始时间

    async with aiohttp.ClientSession() as session:
        # 处理第一页
        tasks = [parse_page(session, "https://sc.chinaz.com/jianli/free.html")]

        # 处理第二页到第五页，并发执行
        for page_num in range(2, 6):
            url = f"https://sc.chinaz.com/jianli/free_{page_num}.html"
            tasks.append(parse_page(session, url))

        # 等待所有页面处理完成
        await asyncio.gather(*tasks)

        # 输出所有页面的模板数据
        for idx, data in enumerate(all_template_data, 1):
            print(f"模板 {idx}:")
            print(f"名称: {data['title']}")
            print(f"图片链接: {data['img_url']}")
            print(f"模板链接: {data['link']}")
            print("=" * 50)

    end_time = time.time()  # 记录结束时间
    run_time = end_time - start_time  # 计算运行时间
    print(f"程序运行时间：{run_time:.2f}秒")


if __name__ == "__main__":
    asyncio.run(main())

在这个修改后的版本中，所有的页面爬取任务都被添加到一个列表中，然后使用asyncio.gather来并发执行这些任务。这样可以同时发送多个请求，而不是等待一个请求完成后再发送下一个请求，从而加快整体的爬取速度。
在这里插入图片描述

import asyncio
import aiohttp
from bs4 import BeautifulSoup
import os
import time
import aiofiles

# 创建一个文件夹resume_templates_images用于保存图片
if not os.path.exists("resume_templates_images"):
    os.makedirs("resume_templates_images")

# 用于存储所有页面的模板数据
all_template_data = []
#async with aiohttp.ClientSession() as session
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()#返回字符串形式的响应数据

async def parse_page(session, url):
    soup = BeautifulSoup(await fetch(session, url), 'html.parser')
    templates = soup.find_all('div', class_='box col3 ws_block')

    for template in templates:
        link = template.find('a', target='_blank')['href']
        img = template.find('img')['src']

        if img.startswith('//'):
            img = 'https:' + img

        title = template.find('p').find('a').text.strip()

        async with session.get(img) as img_response:
            if img_response.status == 200:
                file_type = ".jpg.rar"#  以rar压缩文件的形式储存
                img_name = f"{title.replace(' ', '_')+file_type}"#  更改保存的格式仅需修改
                img_path = os.path.join("resume_templates_images", img_name)
                async with aiofiles.open(img_path, 'wb') as f:
                    await f.write(await img_response.read())# read()返回二进制数据

        all_template_data.append({
            'title': title,
            'img_url': img,
            'link': link
        })

async def main():
    start_time = time.time()  # 记录开始时间

    async with aiohttp.ClientSession() as session:
        # 创建任务列表
        tasks = []

        # 处理第一页
        task = asyncio.create_task(parse_page(session, "https://sc.chinaz.com/jianli/free.html"))
        tasks.append(task)

        # 处理第二页到第五页，并发执行
        for page_num in range(2, 6):
            url = f"https://sc.chinaz.com/jianli/free_{page_num}.html"
            task = asyncio.create_task(parse_page(session, url))
            tasks.append(task)

        # 等待所有页面处理完成  挂起任务列表 asyncio.gather 是 Python asyncio 模块中的一个函数，它用于并发地运行多个协程，并且等待它们全部完成。
        #  asyncio.gather 的作用类似于 asyncio.wait，但它不仅等待协程完成，还会返回一个包含所有结果的列表。
        await asyncio.gather(*tasks)

        # 输出所有页面的模板数据
        for idx, data in enumerate(all_template_data, 1):
            print(f"模板 {idx}:")
            print(f"名称: {data['title']}")
            print(f"图片链接: {data['img_url']}")
            print(f"模板链接: {data['link']}")
            print("=" * 50)

    end_time = time.time()  # 记录结束时间
    run_time = end_time - start_time  # 计算运行时间
    print(f"程序运行时间：{run_time:.2f}秒")

if __name__ == "__main__":
    asyncio.run(main())

原文地址：https://blog.csdn.net/F2022697486/article/details/144338843

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Xerces-C，一个成熟的 C++ XML 解析库！
下一篇：单元测试SpringBoot

【linux】ubuntu自由切换代理配置以及安装配置proxychains实现wget和curl正常访问下载
以后使用只需要加proxychains就可以实现wget使用socks5代理下载。以后使用要加proxychains curl后跟。
阅读更多2024-12-12
Redis篇-1--入门介绍
‌Redis（Remote Dictionary Server），全称为远程字典服务。‌是一个开源的、使用C语言编写的、支持网络交互的、可基于内存也可持久化的Key-Value数据库。Redis提供了
阅读更多2024-12-12
在react中使用组件的标签页写订单管理页面
遍历标签配置数组：使用map函数遍历这个数组，为每个元素创建一个Tabs.TabPane组件。Tabs组件的工作机制：Tabs组件内部会根据active属性或用户点击事件来显示或隐藏不同的Tabs.T
阅读更多2024-12-12
μC/OS-Ⅱ源码学习(2)---多任务系统的实现(下)
本文继续探究任务生命周期的其它函数源码。
阅读更多2024-12-12
MATLAB深度学习(七)——ResNet残差网络
一、ResNet网络ResNet是深度残差网络的简称。其核心思想就是在，每两个网络层之间加入一个残差连接，缓解深层网络中的梯度消失问题二、残差结构在多层神经网络模型里，设想一个包含诺干层自网络，子网络
阅读更多2024-12-12
前端监控方案sentry整体概览
随着数据的上报，服务器本地的磁盘占用和数据库大小会越来越大，按照 Sentry 定时数据任务的配置保留90天来说，全量接入后磁盘占用会维持在一个比较大的值，同时这么大的数据量对数据的查询也是一个负担。
阅读更多2024-12-12
CSS系列（7）-- 背景与边框详解
CSS之旅第七站
阅读更多2024-12-12
C# 探险之旅：第二节 - 定义变量与变量赋值
首先，让我们来揭开“变量”的神秘面纱。变量，简单来说，就像是魔法森林里的小精灵，每个小精灵都有一个名字，代表着它自己，而且它们还能携带不同的宝贝（值）。在C#的世界里，变量就是我们用来存储数据的小盒子
阅读更多2024-12-12
图像边缘检测示例（综合利用阈值分割、数学形态学和边缘检测算子）
这里以moon.tif为例。figure('Name','使用阈值分割和数学形态学运算提取边缘','NumberTitle','off');subplot(2,3,5),imshow(bw2_fill
阅读更多2024-12-12
《无线网络安全技术》阅读笔记
基于策略的信任管理技术主要依赖的是当前已经存在的安全性机制来保证整个信任管理系统的安全性，最为常见的情况就是依靠签名证书，因为签名证书是由第三方的权威机构所颁发的，依赖签名证书也就是间接地依赖第三方权
阅读更多2024-12-12