《人生苦短，我用python·十一》python网络爬虫的简单使用

🕗 发布于 2024-07-27 09:26 python 爬虫 开发语言

Python 有很多库可以用于网络爬虫，最常用的包括 requests 和 BeautifulSoup。以下是如何使用这些库来爬取数据的详细步骤和示例。

1. 安装依赖库
首先，确保安装了 requests 和 BeautifulSoup 库。如果还没有安装，可以使用以下命令进行安装：

pip install requests
pip install beautifulsoup4

2. 使用 requests 库获取网页内容
requests 库用于发送 HTTP 请求并接收响应。以下是获取网页内容的示例：

import requests

url = 'https://example.com'
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
    print(html_content)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

3. 使用 BeautifulSoup 解析 HTML 内容
BeautifulSoup 是一个用于解析 HTML 和 XML 文档的库。以下是解析 HTML 内容的示例：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有的标题标签
titles = soup.find_all('h1')
for title in titles:
    print(title.get_text())

# 查找特定的标签
specific_div = soup.find('div', {'class': 'specific-class'})
if specific_div:
    print(specific_div.get_text())

4. 综合示例
以下是一个综合示例，演示如何从一个新闻网站爬取标题和链接：

import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/'
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')

    # 查找所有新闻条目
    stories = soup.find_all('a', {'class': 'storylink'})
    for story in stories:
        title = story.get_text()
        link = story['href']
        print(f"Title: {title}")
        print(f"Link: {link}\n")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

5. 处理分页
有些网站的数据分布在多个页面上，需要处理分页。以下是处理分页的示例：

import requests
from bs4 import BeautifulSoup

base_url = 'https://example.com/page/'
page_number = 1

while True:
    url = f"{base_url}{page_number}"
    response = requests.get(url)

    if response.status_code != 200:
        break

    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')

    items = soup.find_all('div', {'class': 'item'})
    if not items:
        break

    for item in items:
        title = item.find('h2').get_text()
        print(title)

    page_number += 1

6. 处理动态内容
对于动态生成的内容，如通过 JavaScript 加载的内容，可以使用 Selenium 库。安装方法：

pip install selenium

使用 Selenium 获取动态内容的示例：

from selenium import webdriver

url = 'https://example.com'
driver = webdriver.Chrome()  # 或者使用其他浏览器的驱动程序
driver.get(url)

html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')

# 解析内容
items = soup.find_all('div', {'class': 'item'})
for item in items:
    title = item.find('h2').get_text()
    print(title)

driver.quit()

7. 爬虫礼仪
遵守网站的 robots.txt 文件：这个文件定义了哪些页面允许被爬取。
设置适当的请求间隔：避免频繁请求，给服务器带来负担。
使用 User-Agent：在请求头中添加 User-Agent，表明请求是由浏览器发出的。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

response = requests.get(url, headers=headers)

原文地址：https://blog.csdn.net/cs1395293598/article/details/140646870

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

Maxwell 底层原理详解
Maxwell 是一个轻量级的 MySQL binlog 解析工具，它通过连接 MySQL 并获取 binlog 数据，利用解析这些二进制日志，将其转化为易于处理的RowMap对象，并通过Produc
阅读更多2024-10-18
WebMvcConfigurer自定义配置
WebMvcConfigurer 是 Spring 提供的接口，用于扩展 Spring MVC 的默认行为。它是一种非侵入式的配置方式，可以轻松地进行各种自定义配置，如拦截器、消息转换器、跨域设置等。
阅读更多2024-10-18
期货配资系统风控逻辑开发/完整源代码
期货配资系统风控逻辑的开发是确保系统安全、稳定、高效运行的关键环节。
阅读更多2024-10-18
二叉查找树（Binary Search Tree）Java语言实现
二叉查找树（Binary Search Tree），也称为二叉搜索树、有序二叉树（Ordered Binary Tree）或排序二叉树（Sorted Binary Tree）。
阅读更多2024-10-18
Spring如何通过三级缓存解决循环依赖的问题
在创建 bean 的过程中，通过提前曝光未完全初始化的 bean 实例，使得在循环依赖的情况下，其他 bean 可以获取到正在创建中的 bean，从而保证了创建过程的顺利进行。通过在适当的时候提供一个
阅读更多2024-10-18
【Vue】项目部署本地部署和服务器部署
本地部署 Vue 项目的dist目录，可以选择使用简单的 HTTP 服务器（如或serve）、Nginx 或 Docker。每种方法都有其优点和适用场景，具体选择取决于你的需求和环境。将 Vue 项目
阅读更多2024-10-18
springcloud之应用服务快速失败熔断降级保护 Hystrix
那么为了应对雪崩我们经常会进行服务扩容、添加缓存、优化流程但往往突发的事件依然有击穿缓存、应用负载、数据库IO、网络异常等等带来的风险，所以一些常见的做法有服务降级、限流、熔断，在逐步恢复系统可用率来
阅读更多2024-10-18
嵌入式 GmSSL的SM2，SM3具体使用及对接JAVA的BC库
用嵌入式下的GmSSL库用公钥生成的SM2的密文发送给Java服务端。GmSSL移植到嵌入式可以参考我上一篇博文。
阅读更多2024-10-18
电脑 WiFi 上网，开发板和电脑直连,如何才能让开发板也有网络
电脑 WiFi 上网，开发板和电脑直连,如何才能让开发板也有网络。
阅读更多2024-10-18
repo 命令大全详解（第十六篇 repo selfupdate）
命令简单直接，用于更新repo工具到最新版本。通过不同的选项，用户可以控制输出信息的详细程度和更新的范围。
阅读更多2024-10-18

《人生苦短，我用python·十一》python网络爬虫的简单使用

相关文章