【爬虫】Firecrawl对京东热卖网信息爬取(仅供学习)
项目地址
Firecrawl更多是使用在LLM大模型知识库的构建,是大模型数据准备中的一环(在Dify中会接触到)也是作为检索增强生成(Retrieval Augmented Generation)技术,简称 RAG(当前最火热的LLM应用方案)的一环。
在线体验
邮箱或其他方式登录后,点击Dashboard
然后可见自己的API Key
分析
1.用Firecrawl的Python SDK举例
使用方法:
1.通过官网进入获取自己的API Key(免费可用500次请求)
2.打开firecrawl中的Python SDK文件夹
可在里面构建.py用于测试的文件(部分test_开头的文件是博主自行构建用于测试的)
2.分析搜索地址(请勿大批量爬取影响网站正常运作)
https://re.jd.com/search?keyword=CCD&enc=utf-8
我们可以看到不同关键词搜索会改变地址栏的keyword=CCD,那么我们就可以通过更改地址栏keyword=?来更改网络页面。(可以在在线体验中先测试一下)
简单用代码表示如下
keyword = "Dangerous people"
try:
# Crawl a website:
crawl_status = app.crawl_url(
# 京东
f'https://re.jd.com/search?keyword={keyword}&enc=utf-8',
params={
'limit': 10,
'scrapeOptions': {'formats': ['markdown', 'html']}
},
)
3.完整爬虫代码
配置好环境,更换自己的API Key和关键词keyword即可
from firecrawl import FirecrawlApp
import json
import urllib.parse
import re
from bs4 import BeautifulSoup
def get_value_in_html(text):
"""html取值,通过BeautifulSoup取对应的值"""
soup = BeautifulSoup(text, 'html.parser')
items = soup.find_all('li', {'clstag': lambda x: x and 'ri_same_recommend' in x})
result = []
for item in items:
# 取img_k类的img标签的src属性作为图片链接
pic_img = item.find('div', class_='pic').find('img', class_='img_k')['src']
pic_img = f"https:{pic_img}"
a_tag = item.find('div', class_='li_cen_bot').find('a')
if a_tag is None:
continue
product_link = a_tag['href']
price = a_tag.find('div', class_='commodity_info').find('span', class_='price')
if price is not None:
price = price.text.strip()
else:
price = ''
title = a_tag.find('div', class_='commodity_tit')
if title is not None:
title = title.text.strip()
else:
title = ''
comment_span = a_tag.find('div', class_='comment').find('span', class_='praise')
if comment_span is not None:
evaluation = comment_span.text.strip()
else:
evaluation = ''
product_info = {
'price': price,
'title': title,
'evaluation': evaluation
}
result.append({
"pic_img": pic_img,
"product_link": product_link,
"product_info": product_info
})
return result
API_KEY = "..."
# 用登录后自己的API_KEY
app = FirecrawlApp(api_key=API_KEY)
if __name__ == '__main__':
# 可更改关键词搜索其他
keyword = "Dangerous people"
# keyword_encode = urllib.parse.quote(keyword)
try:
# Crawl a website:
crawl_status = app.crawl_url(
# 京东热卖
f'https://re.jd.com/search?keyword={keyword}&enc=utf-8',
params={
'limit': 10,
'scrapeOptions': {'formats': ['markdown', 'html']}
},
)
# markdown = crawl_status['data'][0]['markdown']
# 使用html提取(提取方法用到bs4)
html = crawl_status['data'][0]['html']
response = {
"result_list": get_value_in_html(html)
}
print(json.dumps(response, ensure_ascii=False))
except Exception as e:
print(f"链接firecrawl异常:{e}")
4.返回爬取数据
返回json数据如下
原文地址:https://blog.csdn.net/steamedobun/article/details/143887916
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!