练手之基于python的新闻爬虫

🕗 发布于 2024-11-07 20:47 python 爬虫 开发语言

新闻爬虫

# coding=utf-8
from bs4 import BeautifulSoup
import requests
import sys
import random
import pymysql
links = []
datas = []
hea = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'
}
urls =[
    "https://www.chinanews.com/china.shtml", #国内
    "https://www.chinanews.com/society.shtml", #社会
    "https://www.chinanews.com/compatriot.shtml",#港澳
    "https://www.chinanews.com/wenhua.shtml",#文化
    "https://www.chinanews.com/world.shtml",#国际
    "https://www.chinanews.com/cj/gd.shtml",#财经
    "https://www.chinanews.com/sports.shtml",#体育
    "https://www.chinanews.com/huaren.shtml"  #华人
]
# 打开数据库连接
db = pymysql.connect(host='127.0.0.1', user='root', password='123456', port=3396, db='news_recommendation_system')
# 使用cursor()方法获取操作游标
cursor = db.cursor()

def main():
    #reload(sys)
    #sys.setdefaultencoding("utf-8")
    #baseurl = 'https://www.chinanews.com/taiwan.shtml'  # 要爬取的网页链接
    baseurl = 'https://www.chinanews.com/taiwan.shtml'  # 要爬取的网页链接
    # deleteDate()
    # 1.爬取主网页获取各个链接
    getLink(baseurl)
    # 2.根据链接爬取内部信息并且保存数据到数据库
    getInformationAndSave()
    # 3.关闭数据库
    db.close()

def getInformationAndSave():
    for link in links:
        data = []
        url = "https://www.chinanews.com" + link[1]
        cur_html = requests.get(url, headers=hea)
        cur_html.encoding = "utf8"
        soup = BeautifulSoup(cur_html.text, 'html.parser')
        # 获取时间
        title = soup.find('h1')
        title = title.text.strip()
        # 获取时间和来源
        tr = soup.find('div', class_='left-t').text.split()
        time = tr[0] + tr[1]
        recourse = tr[2]
        # 获取内容
        cont = soup.find('div', class_="left_zw")
        content = cont.text.strip()
        print(link[0] + "---" + title + "---" + time + "---" + recourse + "---" + url)
        saveDate(title,content,time,recourse,url)

def deleteDate():
    sql = "DELETE FROM news "
    try:
        # 执行SQL语句
        cursor.execute(sql)
        # 提交修改
        db.commit()
    except:
        # 发生错误时回滚
        db.rollback()

def saveDate(title,content,time,recourse,url):
    try:
        cursor.execute("INSERT INTO news(news_title, news_content, type_id, news_creatTime, news_recourse,news_link) VALUES ('%s', '%s', '%s', '%s', '%s' ,'%s')" % \
          (title, content, random.randint(1,8), time, recourse,url))
        db.commit()
        print("执行成功")
    except:
        db.rollback()
        print("执行失败")

def getLink(baseurl):
    html = requests.get(baseurl, headers=hea)
    html.encoding = 'utf8'
    soup = BeautifulSoup(html.text, 'html.parser')
    for item in soup.select('div.content_list > ul > li'):
        # 对不符合的数据进行清洗
        if (item.a == None):
            continue
        data = []
        type = item.div.text[1:3]  # 类型
        link = item.div.next_sibling.next_sibling.a['href']
        data.append(type)
        data.append(link)
        links.append(data)

if __name__ == '__main__':
    main()

原文地址：https://blog.csdn.net/roccreed/article/details/143499950

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：【WebRTC】视频采集模块流程的简单分析
下一篇：计算机网络易混淆知识点串记

基于SSM（Spring + Spring MVC + MyBatis）框架的咖啡馆管理系统
用户管理：管理员可以添加、删除、修改和查询用户信息。员工管理：记录员工信息，如姓名、职位、工资等。菜单管理：支持对菜单项的增删改查操作，包括菜品名称、价格、类别等。订单管理：处理订单信息，记录订单详情
阅读更多2024-11-08
域名邮箱推荐：安全与稳定的邮件域名邮箱！
域名邮箱登录是现代企业不可或缺的工具，掌握从注册到登录的全过程，对于提升工作效率和品牌形象至关重要。烽火邮箱，专业域名邮箱推荐，免费企业邮箱，稳定安全，短期邮箱灵活选择！
阅读更多2024-11-08
敬业签适配鸿蒙：开启多端协同新篇章
纯血鸿蒙，即华为推出的原生鸿蒙操作系统（HarmonyOS Next），是一款面向全场景的分布式操作系统，它以其独特的微内核设计和多设备协同能力，引领着智能终端的新潮流。鸿蒙系统的推出，不仅标志着中国
阅读更多2024-11-08
【Flutter 内嵌 android 原生 View以及相互跳转】
在android 工程的包名下，也可在MainActivity创建 android 原生view ，继承PlatformView。新建MyViewFactory.java注册PlatformView。
阅读更多2024-11-08
HCIP--3实验- 链路聚合,VLAN间通讯,Super VLAN,MSTP,VRRPip配置,静态路由,环回，缺省，空接口,NAT
你可以为VLAN 10配置一个VLAN IF，其IP地址为192.168.10.1/24，为VLAN 20配置另一个VLAN IF，其IP地址为192.168.20.1/24。才能够正常访问网络。：首
阅读更多2024-11-08
Linux下的WatchDog
watch🐕
阅读更多2024-11-08
ZABBIX API获取监控服务器OS层信息
Zabbix 提供了强大的 RESTful API，支持通过编程的方式管理 Zabbix 配置、获取监控数据和实现自动化任务。通过 API，你可以查询 Zabbix 中的主机、群组、监控项等信息，为实
阅读更多2024-11-08
LRU缓存算法
LRU缓存算法
阅读更多2024-11-08
Hive 的数据类型
一组键值对，键必须是唯一的。函数来创建映射和结构体。多个字段组成的数据类型。假设我们有一个用户表。
阅读更多2024-11-08
独立站 API 接口的性能优化策略
例如，对于一个全球范围内使用的独立站 API，将其文档放在 CDN 上后，亚洲用户可以从亚洲的 CDN 节点获取文档，欧洲用户可以从欧洲的 CDN 节点获取，减少了网络延迟。API 服务器监听这个消息
阅读更多2024-11-08

练手之基于python的新闻爬虫

新闻爬虫

相关文章