【JS逆向课件：第十五课：Scrapy基础】

🕗 发布于 2024-07-25 20:09 javascript scrapy 开发语言

简介

什么是框架？

所谓的框，其实说白了就是一个【项目的半成品】，该项目的半成品需要被集成了各种功能且具有较强的通用性。

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架，非常出名，非常强悍。所谓的框架就是一个已经被集成了各种功能（高性能异步下载，队列，分布式，解析，持久化等）的具有很强通用性的项目模板。对于框架的学习，重点是要学习其框架的特性、各个功能的用法即可。

初期如何学习框架？

只需要学习框架集成好的各种功能的用法即可！前期切勿钻研框架的源码！

安装

Linux/mac系统：
      pip install scrapy（任意目录下）

Windows系统：

      a. pip install wheel（任意目录下）

      b. 下载twisted文件，下载网址如下： http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

      c. 终端进入下载目录，执行 pip install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
      注意：如果该步骤安装出错，则换一个版本的whl文件即可

      d. pip install pywin32（任意目录下）

      e. pip install scrapy（任意目录下）
      
如果安装好后，在终端中录入scrapy指令按下回车，如果没有提示找不到该指令，则表示安装成功

基本使用

创建项目

scrapy startproject 项目名称

项目的目录结构：

firstBlood   # 项目所在文件夹, 建议用pycharm打开该文件夹
    ├── firstBlood  # 项目跟目录
    │   ├── __init__.py
    │   ├── items.py  # 封装数据的格式
    │   ├── middlewares.py  # 所有中间件
    │   ├── pipelines.py# 所有的管道
    │   ├── settings.py# 爬虫配置信息
    │   └── spiders# 爬虫文件夹, 稍后里面会写入爬虫代码
    │       └── __init__.py
    └── scrapy.cfg# scrapy项目配置信息,不要删它,别动它,善待它.

创建爬虫爬虫文件：
- cd project_name（进入项目目录）
- scrapy genspider 爬虫文件的名称（自定义一个名字即可）起始url
  - （例如：scrapy genspider first www.xxx.com）
- 创建成功后，会在爬虫文件夹下生成一个py的爬虫文件

编写爬虫文件

理解爬虫文件的不同组成部分

import scrapy


class BiliSpider(scrapy.Spider):
    #爬虫文件的名称，是当前爬虫文件的唯一标识
    name = 'bili'
    #允许的域名
    # allowed_domains = ['www.baidu.com']
    #起始的url列表：可以将即将被请求的url，存放在当前列表中。默认情况，列表中存储的url都会被scrapy框架进行get请求的发送
    start_urls = ['https://www.baidu.com/','https://www.sogou.com']
    #实现数据解析
    #参数response表示请求对应的响应对象
    #parse方法调用的次数取决于请求的次数
    def parse(self, response):
        print(response)

配置文件修改:settings.py
- 不遵从robots协议：ROBOTSTXT_OBEY = False
- 指定输出日志的类型：LOG_LEVEL = ‘ERROR’
- 指定UA：USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36’

运行项目

scrapy crawl 爬虫名称 ：该种执行形式会显示执行的日志信息（推荐）

数据解析

注意，如果终端还在第一个项目的文件夹中，则需要在终端中执行cd …/返回到上级目录，在去新建另一个项目。
新建数据解析项目：
- 创建工程：scrapy startproject 项目名称
- cd 项目名称
- 创建爬虫文件：scrapy genspider 爬虫文件名 www.xxx.com
配置文件的修改：settings.py
- 不遵从robots协议：ROBOTSTXT_OBEY = False
- 指定输出日志的类型：LOG_LEVEL = ‘ERROR’
- 指定UA：USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36’

编写爬虫文件：spiders/bilibili.py

import scrapy

#爬取bili对应的视频标题
class BiliSpider(scrapy.Spider):
    #爬虫文件的名称，是当前爬虫文件的唯一标识
    name = 'bili'
    #允许的域名
    # allowed_domains = ['www.baidu.com']
    #起始的url列表：可以将即将被请求的url，存放在当前列表中。默认情况，列表中存储的url都会被scrapy框架进行get请求的发送
    start_urls = ['https://search.bilibili.com/all?vt=40586385&keyword=%E7%9F%A5%E8%AF%86%E5%9B%BE%E8%B0%B1&from_source=webtop_search&spm_id_from=333.1007&search_source=5']
    #实现数据解析
    #参数response表示请求对应的响应对象
    #parse方法调用的次数取决于请求的次数
    def parse(self, response):
        #可以在响应对象中直接使用xpath进行数据解析
        div_list = response.xpath('//*[@id="i_cecream"]/div/div[2]/div[2]/div/div/div/div[2]/div/div')
        for div in div_list:
            #注意：在scrapy中使用xpath进行数据解析，进行标签定位后，提取数据的时候，返回的是Selector对象而并非是提取处出的字符串类型的数据
            #extract():可以将Selector中存储的字符串数据进行提取
            # title = div.xpath('./div/div[2]/div/div/a/h3/@title')[0].extract()
            # up_name = div.xpath('./div/div[2]/div/div/p/a/span[1]/text()')[0].extract()
            up_name = div.xpath('./div/div[2]/div/div/p/a/span[1]/text()').extract_first()
            #extract_first() ==> [0].extract()
            title = div.xpath('./div/div[2]/div/div/a/h3/@title').extract_first()
            #在xpath后直接调用extract返回的数据会存储在一个列表中
            # up_name = div.xpath('./div/div[2]/div/div/p/a/span[1]/text()').extract()
            up_name = div.xpath('./div/div[2]/div/div/p/a/span[1]/text()').extract_first()

            #extract()：xpath返回的是多个数据
            #extract_first()：xpath返回的是单个数据
            print(title,up_name)

持久化存储

两种方案：

基于终端指令的持久化存储
基于管道的持久化存储（推荐）

基于终端指令的持久化存储

只可以将parse方法的返回值存储到指定后缀的文本文件中。

编码流程：

在爬虫文件中，将爬取到的数据全部封装到parse方法的返回值中

import scrapy


class BiliSpider(scrapy.Spider):
    name = 'bili'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://search.bilibili.com/all?vt=40586385&keyword=%E7%9F%A5%E8%AF%86%E5%9B%BE%E8%B0%B1&from_source=webtop_search&spm_id_from=333.1007&search_source=5']
    #基于终端指令的持久化存储：只可以将parse方法的返回值存储到固定后缀的文本文件中
    def parse(self, response):
        div_list = response.xpath('//*[@id="i_cecream"]/div/div[2]/div[2]/div/div/div/div[2]/div/div')
        all_data = []
        for div in div_list:
            up_name = div.xpath('./div/div[2]/div/div/p/a/span[1]/text()').extract_first()
            title = div.xpath('./div/div[2]/div/div/a/h3/@title').extract_first()
            dic = {}
            dic['title'] = title
            dic['name'] = up_name
            all_data.append(dic)
        return all_data #all_data里面就存储了爬取到的数据

将parse方法的返回值存储到指定后缀的文本文件中:
- scrapy crawl 爬虫文件名称 -o bilibili.csv

总结：
- 优点：简单，便捷
- 缺点：局限性强
  - 只可以将数据存储到文本文件无法写入数据库
  - 存储数据文件后缀是指定好的，通常使用.csv
  - 需要将存储的数据封装到parse方法的返回值中

基于管道实现持久化存储

优点：极大程度的提升数据存储的效率

缺点：编码流程较多

编码流程

1.在爬虫文件中进行数据解析

import scrapy
from biliSavePro.items import BilisaveproItem

class BiliSpider(scrapy.Spider):
    name = 'bili'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://search.bilibili.com/all?vt=40586385&keyword=%E7%9F%A5%E8%AF%86%E5%9B%BE%E8%B0%B1&from_source=webtop_search&spm_id_from=333.1007&search_source=5']

    def parse(self, response):
        div_list = response.xpath('//*[@id="i_cecream"]/div/div[2]/div[2]/div/div/div/div[2]/div/div')
        all_data = []
        for div in div_list:
            up_name = div.xpath('./div/div[2]/div/div/p/a/span[1]/text()').extract_first()
            title = div.xpath('./div/div[2]/div/div/a/h3/@title').extract_first()

2.将解析到的数据封装到Item类型的对象中

2.1 在items.py文件中定义相关的字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BilisaveproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #进行字段的定义：数据解析出来几个字段，这里就需要定义几个字段
    title = scrapy.Field()
    up_name = scrapy.Field()

2.2 在爬虫文件中引入Item类，实例化item对象，将解析到的数据存储到item对象中

def parse(self, response):
        div_list = response.xpath('//*[@id="i_cecream"]/div/div[2]/div[2]/div/div/div/div[2]/div/div')
        all_data = []
        for div in div_list:
            up_name = div.xpath('./div/div[2]/div/div/p/a/span[1]/text()').extract_first()
            title = div.xpath('./div/div[2]/div/div/a/h3/@title').extract_first()
            #创建一个item类型的对象
            item = BilisaveproItem()
            #将解析出来的数据存储到item类型对象中
            item['title'] = title #将数据解析出来的title数据存储到item对象中的title属性中
            item['up_name'] = up_name

            #将item对象提交给管道
            yield item

3.将item对象提交给管道

#将存储好数据的item对象提交给管道
yield item

4.在管道中接收item类型对象(pipelines.py就是管道文件)

管道只可以接收item类型的对象，不可以接收其他类型对象

class SavedataproPipeline:
    #process_item用来接收爬虫文件传递过来的item对象
    #item参数，就是管道接收到的item类型对象
    def process_item(self, item, spider):
        print(item)
        return item

5.在管道中对接收到的数据进行任意形式的持久化存储操作

可以存储到文件中也可以存储到数据库中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class BilisaveproPipeline:
    fp = None
    #重写父类的方法
    def open_spider(self,spider):
        #该方法只会在项目运行时process_item方法调用前被执行一次
        print('i am open_spider()')
        self.fp = open('bili.txt','w')

    #该方法是用来接收爬虫文件提交过来的item对象
    #参数item就是爬虫文件提交过来的item对象
    #process_item会被调用多次（调用的次数取决于爬虫文件向管道提交item的次数）
    def process_item(self, item, spider):
        # print(item['title'],item['up_name'])
        self.fp.write(item['up_name']+':'+item['title']+'\n')
        print('数据成功被保存！！！')
        return item

    def close_spider(self,spider):
        print('i am close_spider()')
        #该方法只会在process_item方法调用结束后被调用一次
        self.fp.close()

6.在配置文件中开启管道机制

注意：默认情况下，管道机制是没有被开启的，需要在配置文件中手动开启
在setting.py中把ITEM_PIPELINES解除注释就表示开启了管道机制

管道深入操作

如何将数据存储到数据库
- 注意：一个管道类负责将数据存储到一个具体的载体中。如果想要将爬取到的数据存储到多个不同的载体/数据库中，则需要定义多个管道类。
思考：
- 在有多个管道类的前提下，爬虫文件提交的item会同时给没一个管道类还是单独的管道类？
  - 爬虫文件只会将item提交给优先级最高的那一个管道类。优先级最高的管道类的process_item中需要写return item操作，该操作就是表示将item对象传递给下一个管道类，下一个管道类获取了item对象，才可以将数据存储成功！
管道类：

import pymysql #pip install pymysql
#pymysql可以实现使用python程序远程连接mysql数据库
class BiliprodbPipeline:
    conn = None  # mysql的链接对象
    cursor = None  # 游标对象
    def open_spider(self,spider):
        #创建链接对象
        self.conn = pymysql.Connect(
            host='127.0.0.1',#数据库服务器ip地址
            port=3306, #mysql固定端口号
            user='root',#mysql用户名
            password='boboadmin',#mysql密码
            db='db001',
            charset='utf8'
        )
        #创建游标对象：是用来执行sql语句
        self.cursor = self.conn.cursor()
    #将数据存储到mysql数据库
    def process_item(self, item, spider):
        sql = 'insert into bili values ("%s","%s")'%(item['up_name'],item['title'])
        self.cursor.execute(sql)
        self.conn.commit() #提交事物
        print('数据存储到mysql中......')
        return item #item会返回给下一个即将被执行的管道类

    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

#将数据持久化存储到redis中
from redis import Redis
class BiliprodbPipelineRedis:
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(
            host='127.0.0.1',
            port=6379
        )
    def process_item(self, item, spider):
        #item本质是一个字典
        self.conn.lpush('bili',item)
        print('数据存储到redis中......')
        return item


import pymongo
class MongoPipeline:
    conn = None #链接对象
    db_sanqi = None #数据仓库
    def open_spider(self,spider):
        self.conn = pymongo.MongoClient(
            host='127.0.0.1',
            port=27017
        )
        self.db_sanqi = self.conn['sanqi']
    def process_item(self,item,spider):
        self.db_sanqi['xiaoshuo'].insert_one({'title':item['title']})
        print('插入成功！')
        return item

配置文件：

ITEM_PIPELINES = {
   #管道类后面的数字表示管道类的优先级，数字越小优先级越高。优先级越高，则表示该管道类会被优先执行
   'biliProDB.pipelines.BiliprodbPipeline': 300,
   'biliProDB.pipelines.BiliprodbPipelineRedis': 301,
   'biliProDB.pipelines.MongoPipeline': 302
}

scrapy爬取多媒体资源数据

使用一个专有的管道类ImagesPipeline

具体的编码流程：

1.在爬虫文件中进行图片/视频的链接提取
2.将提取到的链接封装到items对象中，提交给管道

3.在管道文件中自定义一个父类为ImagesPipeline的管道类，且重写三个方法即可：

def get_media_requests(self, item, info):接收爬虫文件提交过来的item对象，然后对图片地址发起网路请求，返回图片的二进制数据

def file_path(self, request, response=None, info=None, *, item=None)：指定保存图片的名称
def item_completed(self, results, item, info)：返回item对象给下一个管道类

4.在配置文件中开启指定的管道，且通过IMAGES_STORE = 'girlsLib’操作指定图片存储的文件夹。

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import scrapy
from itemadapter import ItemAdapter

from scrapy.pipelines.images import ImagesPipeline

#自定义的管道类一定要继承与ImagesPipeline
class mediaPileline(ImagesPipeline):
    #重写三个父类的方法来完成图片二进制数据的请求和持久化存储
    #可以根据图片地址，对其进行请求，获取图片数据
    #参数item：就是接收到的item对象
    def get_media_requests(self, item, info):
        img_src = item['src']
        yield scrapy.Request(img_src)
    #指定图片的名称（只需要返回图片存储的名称即可）
    def file_path(self, request, response=None, info=None, *, item=None):
        imgName = request.url.split('/')[-1]
        print(imgName,'下载保存成功！')
        return imgName
    #如果没有下一个管道类，该方法可以不写
    def item_completed(self, results, item, info):
        return item #可以将当前的管道类接收到item对象传递给下一个管道类2.

scrapy深度爬取

如何爬取多页的数据（全站数据爬取）

手动请求发送：

#callback用来指定解析方法
yield scrapy.Request(url=new_url,callback=self.parse)

如何爬取深度存储的数据
- 什么是深度，说白了就是爬取的数据没有存在于同一张页面中。
- 必须使用请求传参的机制才可以完整的实现。
  - 请求传参：
    - ```
    yield scrapy.Request(meta={},url=detail_url,callback=self.parse_detail)
    
    可以将meta字典传递给callback这个回调函数
```

import scrapy
from ..items import DeepproItem

class DeepSpider(scrapy.Spider):
    name = 'deep'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://wz.sun0769.com/political/index/politicsNewest']
    #解析首页数据
    def parse(self, response):
        li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
        for li in li_list:
            title = li.xpath('./span[3]/a/text()').extract_first()
            detail_url = 'https://wz.sun0769.com'+li.xpath('./span[3]/a/@href').extract_first()
            # print(title)
            item = DeepproItem()
            item['title'] = title
            #对详情页的url发起请求
            #参数meta可以将自身这个字典传递给callback指定的回调函数
            yield scrapy.Request(meta={'item':item},url=detail_url,callback=self.parse_detail)
    #解析详情页数据
    def parse_detail(self,response):
        meta = response.meta #接收请求传参过来的meta字典
        item = meta['item']
        content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]//text()').extract()
        content = ''.join(content)
        # print(content)
        item['content'] = content

        yield item

原文地址：https://blog.csdn.net/weixin_50556117/article/details/140598756

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：go-kratos 学习笔记(6) 数据库gorm使用
下一篇：自动驾驶-机器人-slam-定位面经和面试知识系列03之C++STL面试题（01）

MYSQL常用基本操作总结
SQL查询中各个关键字的执行先后顺序： from > on> join > where > group by > with > having >select
阅读更多2024-09-19
JAVA并发编程系列之Semaphore信号量剖析
候选人，心中万马奔腾！！！吐了一口82年老血，当场砸电脑回家！原因是：腾讯T2面试，现场限时3分钟+限最多20行代码，模拟地铁口安检进站。其中安检入口10个，当前排队人数是100个，每个人安检进站耗时
阅读更多2024-09-19
24年蓝桥杯及攻防世界赛题-MISC-2
24年蓝桥杯及攻防世界赛题-MISC-2
阅读更多2024-09-19
干货-并发编程提高——重谈 RUNNABLE-上篇（十四）
直接看它的 Javadoc 中的说明：一个在 JVM 中执行的线程处于这一状态中。（A threadexecuting而传统的进（线）程状态一般划分如下：注：这里的进程指早期的单线程进程，这里所谓进程
阅读更多2024-09-19
phpstudy 建站使用 php8版本打开 phpMyAdmin后台出现网页提示致命错误：（phpMyAdmin这是版本问题导致的）
将网站根目录phpMyAdmin4.8.5里面的文件换成官网下载的5.2.1版本即可。重启网站，打开phpMyAdmin后台即可（若打不开更改 mysql密码即可）解决方法：官网下载phpmyadm
阅读更多2024-09-19
零工市场小程序：保障灵活就业
截止2024年高校毕业生达到1179万，在今年的经济情况下，就业市场就面临着比较大的压力，许多毕业生面临一时之间难以找到合适的工作的问题，那么求职者就会需要一份临时的工作来得到报酬，面对传统的找零工方
阅读更多2024-09-19
Linux中权限和指令
mv指令是move的缩写，用来，经常用来备份文件或目录。
阅读更多2024-09-19
Redis 底层揭秘：事务与 Lua 脚本的工作原理
定义Lua 是一种轻量级的脚本语言，它可以在 Redis 中被执行，用于实现复杂的逻辑操作。优势与事务相比，Lua 脚本具有更高的性能和更好的灵活性。Lua 脚本可以在 Redis 服务器端一次性执行
阅读更多2024-09-19
Vue3使用shapefile读取矢量数据，以数组形式返回坐标点
【代码】Vue3使用shapefile读取矢量数据，以数组形式返回坐标点。
阅读更多2024-09-19
WEB 编程：使用富文本编辑器 Quill 配合 WebBroker 后端
评估了好几个，最后选择这个开源的。把前端代码，存储为一个单独的文本文件，方便随便哪个页面需要的时候可以使用。相当于封装为一个独立的对象，方便代码重用。
阅读更多2024-09-19