第L8周：机器学习｜随机森林

🕗 发布于 2024-10-03 18:45 机器学习 随机森林 人工智能 神经网络数据挖掘

本文为365天深度学习训练营中的学习记录博客
原作者：K同学啊

任务：
●了解随机森林的大概原理，知道这是个什么东西即可（后续你很少有机会用到，除非你是专攻机器学习（非深度学习）这块，那可以去看看相关的科研论文，学习底层数学原理）。

1.随机森林是什么？

随机森林（Random Forest, RF）是一种由决策树构成的集成算法，采用的是 Bagging 方法，他在很多情况下都能有不错的表现。

假若随机森林的基学习器（个体学习器）是如下决策树

在这里插入图片描述
那么随机森林则是如下结构，其是由很多决策树构成的，不同决策树之间没有关联。当我们进行分类任务时，新的输入样本进入，就让森林中的每一棵决策树分别进行判断和分类，每个决策树会得到一个自己的分类结果，决策树的分类结果中哪一个分类最多，那么随机森林就会把这个结果当做最终的结果。

在这里插入图片描述

2.数据读取

本项目使用了一个人工合成的天气数据集，模拟了雨天、晴天、多云和雪天四种类型，在分析过程中，对数据进行了异常值处理，并通过描述性统计对数据进行了初步探索，接着，构建了随机森林模型进行预测，并生成了模型的重要特征图，该项目适用于初学者学习如何进行全面的数据分析和机器学习模型构建。数据集字段详情如下：

在这里插入图片描述

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

data = pd.read_csv('./L7/weather_classification_data.csv')
data

代码输出：

	Temperature	Humidity	Wind Speed	Precipitation (%)	Cloud Cover	Atmospheric Pressure	UV Index	Season	Visibility (km)	Location	Weather Type
0	14.0	73	9.5	82.0	partly cloudy	1010.82	2	Winter	3.5	inland	Rainy
1	39.0	96	8.5	71.0	partly cloudy	1011.43	7	Spring	10.0	inland	Cloudy
2	30.0	64	7.0	16.0	clear	1018.72	5	Spring	5.5	mountain	Sunny
3	38.0	83	1.5	82.0	clear	1026.25	7	Spring	1.0	coastal	Sunny
4	27.0	74	17.0	66.0	overcast	990.67	1	Winter	2.5	mountain	Rainy
...	...	...	...	...	...	...	...	...	...	...	...
13195	10.0	74	14.5	71.0	overcast	1003.15	1	Summer	1.0	mountain	Rainy
13196	-1.0	76	3.5	23.0	cloudy	1067.23	1	Winter	6.0	coastal	Snowy
13197	30.0	77	5.5	28.0	overcast	1012.69	3	Autumn	9.0	coastal	Cloudy
13198	3.0	76	10.0	94.0	overcast	984.27	0	Winter	2.0	inland	Snowy
13199	-5.0	38	0.0	92.0	overcast	1015.37	5	Autumn	10.0	mountain	Rainy

13200 rows × 11 columns

3.数据检查与预处理

# 查看数据信息
data.info()

代码输出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13200 entries, 0 to 13199
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Temperature           13200 non-null  float64
 1   Humidity              13200 non-null  int64  
 2   Wind Speed            13200 non-null  float64
 3   Precipitation (%)     13200 non-null  float64
 4   Cloud Cover           13200 non-null  object 
 5   Atmospheric Pressure  13200 non-null  float64
 6   UV Index              13200 non-null  int64  
 7   Season                13200 non-null  object 
 8   Visibility (km)       13200 non-null  float64
 9   Location              13200 non-null  object 
 10  Weather Type          13200 non-null  object 
dtypes: float64(5), int64(2), object(4)
memory usage: 1.1+ MB

# 查看分类特征的唯一值
characteristic = ['Cloud Cover','Season','Location','Weather Type']
for i in characteristic:
    print(f'{i}:')
    print(data[i].unique())
    print('-'*50)

代码输出：

Cloud Cover:
['partly cloudy' 'clear' 'overcast' 'cloudy']
--------------------------------------------------
Season:
['Winter' 'Spring' 'Summer' 'Autumn']
--------------------------------------------------
Location:
['inland' 'mountain' 'coastal']
--------------------------------------------------
Weather Type:
['Rainy' 'Cloudy' 'Sunny' 'Snowy']
--------------------------------------------------

import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib

# 使用 SimHei 字体来支持中文字符
plt.rcParams['font.sans-serif'] = ['SimHei']
# 设置字体，避免负号显示问题
plt.rcParams['axes.unicode_minus'] = False

feature_map = {
    'Temperature': '温度',
    'Humidity': '湿度百分比',
    'Wind Speed': '风速',
    'Precipitation (%)': '降水量百分比',
    'Atmospheric Pressure': '大气压力',
    'UV Index': '紫外线指数',
    'Visibility (km)': '能见度'
}
plt.figure(figsize=(15, 10))

for i, (col, col_name) in enumerate(feature_map.items(), 1):
    plt.subplot(2, 4, i)
    sns.boxplot(y=data[col])
    plt.title(f'{col_name}的箱线图', fontsize=14)
    plt.ylabel('数值', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

代码输出：

在这里插入图片描述

1.温度的异常值存在大量超出常识的温度，这里以超过60摄氏度认定为异常值，需要进行处理。
2.湿度百分比和降水量百分比，由于数值存在超过100%的值，认为超过100%的值为异常值，需要进行处理。
3.风速的高值可能是由于台风、龙卷风等极端天气事件，故不处理。
4.大气压力的异常值可能由于高海拔地区或气象现象（如低气压系统）引起。
5.能见度低可能是由于雾霾、雨雪等天气现象，这些异常值在特定条件下是正常的，故不处理。

print(f"温度超过60°C的数据量：{data[data['Temperature'] > 60].shape[0]}，占比{round(data[data['Temperature'] > 60].shape[0] / data.shape[0] * 100,2)}%。")
print(f"湿度百分比超过100%的数据量：{data[data['Humidity'] > 100].shape[0]}，占比{round(data[data['Humidity'] > 100].shape[0] / data.shape[0] * 100,2)}%。")
print(f"降雨量百分比超过100%的数据量：{data[data['Precipitation (%)'] > 100].shape[0]}，占比{round(data[data['Precipitation (%)'] > 100].shape[0] / data.shape[0] * 100,2)}%。")

代码输出：

温度超过60°C的数据量：207，占比1.57%。
湿度百分比超过100%的数据量：416，占比3.15%。
降雨量百分比超过100%的数据量：392，占比2.97%。

异常值占比很小，这里可以直接删除，或者将其赋值为100%，为了保持数据集的一致性和准确性，这里选择直接删除，可以避免它们对分析结果或模型训练产生负面影响。

print("删前的数据shape：", data.shape)
data = data[(data['Temperature'] <= 60) & (data['Humidity'] <= 100) & (data['Precipitation (%)'] <= 100)]
print("删后的数据shape：", data.shape)

代码输出：

删前的数据shape： (13200, 11)
删后的数据shape： (12360, 11)

4.数据分析

data.describe(include='all')

代码输出：

	Temperature	Humidity	Wind Speed	Precipitation (%)	Cloud Cover	Atmospheric Pressure	UV Index	Season	Visibility (km)	Location	Weather Type
count	12360.000000	12360.000000	12360.000000	12360.000000	12360	12360.000000	12360.000000	12360	12360.000000	12360	12360
unique	NaN	NaN	NaN	NaN	4	NaN	NaN	4	NaN	3	4
top	NaN	NaN	NaN	NaN	overcast	NaN	NaN	Winter	NaN	mountain	Snowy
freq	NaN	NaN	NaN	NaN	5726	NaN	NaN	5288	NaN	4535	3130
mean	18.071359	66.937460	9.356837	50.864968	NaN	1005.713743	3.791262	NaN	5.535801	NaN	NaN
std	15.804363	19.390333	6.318334	30.967846	NaN	38.300471	3.720638	NaN	3.377554	NaN	NaN
min	-24.000000	20.000000	0.000000	0.000000	NaN	800.120000	0.000000	NaN	0.000000	NaN	NaN
25%	4.000000	56.000000	5.000000	19.000000	NaN	994.587500	1.000000	NaN	3.000000	NaN	NaN
50%	21.000000	69.000000	8.500000	54.000000	NaN	1007.495000	2.000000	NaN	5.000000	NaN	NaN
75%	30.000000	81.000000	13.000000	79.000000	NaN	1016.750000	6.000000	NaN	7.500000	NaN	NaN
max	60.000000	100.000000	48.500000	100.000000	NaN	1199.210000	14.000000	NaN	20.000000	NaN	NaN

plt.figure(figsize=(20, 15))
plt.subplot(3, 4, 1)
sns.histplot(data['Temperature'], kde=True,bins=20)
plt.title('温度分布')
plt.xlabel('温度')
plt.ylabel('频数')

plt.subplot(3, 4, 2)
sns.boxplot(y=data['Humidity'])
plt.title('湿度百分比箱线图')
plt.ylabel('湿度百分比')

plt.subplot(3, 4, 3)
sns.histplot(data['Wind Speed'], kde=True,bins=20)
plt.title('风速分布')
plt.xlabel('风速（km/h）')
plt.ylabel('频数')

plt.subplot(3, 4, 4)
sns.boxplot(y=data['Precipitation (%)'])
plt.title('降雨量百分比箱线图')
plt.ylabel('降雨量百分比')

plt.subplot(3, 4, 5)
sns.countplot(x='Cloud Cover', data=data)
plt.title('云量 (描述)分布')
plt.xlabel('云量 (描述)')
plt.ylabel('频数')

plt.subplot(3, 4, 6)
sns.histplot(data['Atmospheric Pressure'], kde=True,bins=10)
plt.title('大气压分布')
plt.xlabel('气压 (hPa)')
plt.ylabel('频数')

plt.subplot(3, 4, 7)
sns.histplot(data['UV Index'], kde=True,bins=14)
plt.title('紫外线等级分布')
plt.xlabel('紫外线指数')
plt.ylabel('频数')

plt.subplot(3, 4, 8)
Season_counts = data['Season'].value_counts()
plt.pie(Season_counts, labels=Season_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('季节分布')

plt.subplot(3, 4, 9)
sns.histplot(data['Visibility (km)'], kde=True,bins=10)
plt.title('能见度分布')
plt.xlabel('能见度（Km）')
plt.ylabel('频数')

plt.subplot(3, 4, 10)
sns.countplot(x='Location', data=data)
plt.title('地点分布')
plt.xlabel('地点')
plt.ylabel('频数')

plt.subplot(3, 4, (11,12))
sns.countplot(x='Weather Type', data=data)
plt.title('天气类型分布')
plt.xlabel('天气类型')
plt.ylabel('频数')

plt.tight_layout()
plt.show()

代码输出：

在这里插入图片描述

● 温度：温度数据集中在较合理的范围内（主要在0°C到40°C），极端高温（>60°C）的数据已被清理。整体分布稍微左偏，说明较低温度的情况较多。
● 湿度：湿度分布在合理范围内（20%到100%），中位数和平均值接近，说明数据分布相对对称。
● 风速：数据集中在较低的风速范围内（0-20 km/h），极端高风速事件少见，数据左偏，低风速情况更为常见。
● 降水量：降水量分布较均匀，中位数为54%，反映了各种天气条件下的降水概率。
● 大气压力：大气压力主要集中在标准范围（990-1020 hPa），数据分布正常，没有明显的异常值。
● 紫外线指数：紫外线指数大多较低，极端高指数的情况罕见，表明大部分时间的紫外线风险较低。
● 能见度：能见度数据大多集中在5 km左右，反映了多数情况下的中等能见度条件。
● 云量：多云（overcast）在数据集中出现频率较高。
● 季节分布：冬季数据最多，可能是数据采集季节或地区气候特征的反映。
● 地点分布：主要来自山区和内陆地区，这可能影响天气类型和其他气象特征的分布。
● 天气类型：分布比较均匀，没有单一类别占据绝对优势。

5.随机森林

new_data = data.copy()
label_encoders = {}
categorical_features = ['Cloud Cover', 'Season', 'Location', 'Weather Type']
for feature in categorical_features:
    le = LabelEncoder()
    new_data[feature] = le.fit_transform(data[feature])
    label_encoders[feature] = le

for feature in categorical_features:
    print(f"'{feature}'特征的对应关系：")
    for index, class_ in enumerate(label_encoders[feature].classes_):
        print(f"  {index}: {class_}")

代码输出：

'Cloud Cover'特征的对应关系：
  0: clear
  1: cloudy
  2: overcast
  3: partly cloudy
'Season'特征的对应关系：
  0: Autumn
  1: Spring
  2: Summer
  3: Winter
'Location'特征的对应关系：
  0: coastal
  1: inland
  2: mountain
'Weather Type'特征的对应关系：
  0: Cloudy
  1: Rainy
  2: Snowy
  3: Sunny

# 构建x，y
x = new_data.drop(['Weather Type'],axis=1)
y = new_data['Weather Type']

# 划分数据集
x_train,x_test,y_train,y_test = train_test_split(x,y,
                                                 test_size=0.3,
                                                 random_state=15) 

# 构建随机森林模型
rf_clf = RandomForestClassifier(random_state=15)
rf_clf.fit(x_train, y_train)

# 使用随机森林进行预测
y_pred_rf = rf_clf.predict(x_test)
class_report_rf = classification_report(y_test, y_pred_rf)
print(class_report_rf)

代码输出：

              precision    recall  f1-score   support

           0       0.88      0.91      0.89       871
           1       0.93      0.91      0.92       983
           2       0.92      0.93      0.92       929
           3       0.92      0.90      0.91       925

    accuracy                           0.91      3708
   macro avg       0.91      0.91      0.91      3708
weighted avg       0.91      0.91      0.91      3708

6.结果分析

feature_importances = rf_clf.feature_importances_
features_rf = pd.DataFrame({'特征': x.columns, '重要度': feature_importances})
features_rf.sort_values(by='重要度', ascending=False, inplace=True)
plt.figure(figsize=(10, 8))
sns.barplot(x='重要度', y='特征', data=features_rf)
plt.xlabel('重要度')
plt.ylabel('特征')
plt.title('随机森林特征图')
plt.show()

代码输出：
在这里插入图片描述

随机森林模型的预测准确率很高，并且通过特征度分析，发现影响模型的主要因素有：温度、湿度、紫外线指数、能见度、大气压力。

原文地址：https://blog.csdn.net/lihuhelihu/article/details/142213834

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Cpp::STL—vector类的模拟实现(11)
下一篇：矩阵系统源码搭建的具体步骤，支持oem，源码搭建

C++模拟实现vector容器【万字模拟✨】
模拟实现vector，根据文档，我们先看一下vector有哪些成员，需要我们完成什么功能。如果你对这些功能有过初步的了解请跳过。认识(.hpp)后缀。
阅读更多2024-10-04
【H2O2|全栈】关于CSS（10）CSS3扩充了哪些新鲜的东西？（三）
本系列博客主要介绍CSS有关知识点，当前章节讲述CSS3相关内容。本期主要内容为CSS3的过渡属性和帧动画。部分内容仅代表个人观点，仅供参考，希望能帮助到您。
阅读更多2024-10-04
【Qt】Qt安装（2024-10，QT6.7.3，Windows，Qt Creator 、Visual Studio、Pycharm 示例）
Qt安装和c++、python简单示例。
阅读更多2024-10-04
OIDC9-OIDC集成登录功能(SpringBoot3.0)
在 Spring Security 5.0 及以后的版本中，WebSecurityConfigurerAdapter 类已被标记为不推荐使用（deprecated）。因此，Spring Boot 3.
阅读更多2024-10-04
区块链可投会议CCF C--CT-RSA 2025 截止10.15 附2024录用率
Conference：The Cryptographers' Track at RSA Conference (CT-RSA)CCF level：CCF CCategories：network and
阅读更多2024-10-04
C初阶（十二）do - while循环 --- 致敬革命烈士
do - while循环 --- 致敬革命烈士
阅读更多2024-10-04
修改Kali Linux的镜像网站
清华大学开办的TUNA协会负责维护的镜像网站，提供了包括Kali Linux在内的多种开源软件的镜像。其Kali Linux镜像的URL为：https://mirrors.tuna.tsinghua.
阅读更多2024-10-04
使用MTVerseXR SDK实现VR串流
MTVerseXR SDK 是摩尔线程GPU加速的虚拟现实（VR）流媒体平台，专门用于从远程服务器流式传输基于标准OpenXR的应用程序。MTVerseXR可以通过Wi-Fi和USB流式将VR内容从W
阅读更多2024-10-04
vite 底层解析
目前大多数框架的前端构建工具都已经被vite取代，相信你已经使用过vite了。可是在使用过程中，vite对我来说一直是模糊的，现在就来一探究竟，为啥它更好？接下来我将为从以下几点出发，究其原理。
阅读更多2024-10-04
基于matlab的指纹识别
随着科学技术的不断发展，自动化的指纹识别技术如今已经被人们广泛地应用在银行、商业交易、公安部门、海关部门等需要对人的身份进识别的领域，而本文所描述的是对自动化指纹识别系统的研究现状以及自动化指纹识别系
阅读更多2024-10-04

第L8周：机器学习｜随机森林

相关文章