R5天气识别学习笔记
- 🍨 本文为🔗365天深度学习训练营 中的学习记录博客
- 🍖 原作者:K同学啊
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Activation,Dropout
from tensorflow.python.keras.callbacks import EarlyStopping
from tensorflow.python.keras.layers import Dropout
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error , mean_absolute_percentage_error , mean_squared_error
data = pd.read_csv(r"C:\Users\11054\Desktop\kLearning\R5\weatherAUS.csv")
df = data.copy()
data.head()
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-12-01 | Albury | 13.4 | 22.9 | 0.6 | NaN | NaN | W | 44.0 | W | ... | 71.0 | 22.0 | 1007.7 | 1007.1 | 8.0 | NaN | 16.9 | 21.8 | No | No |
1 | 2008-12-02 | Albury | 7.4 | 25.1 | 0.0 | NaN | NaN | WNW | 44.0 | NNW | ... | 44.0 | 25.0 | 1010.6 | 1007.8 | NaN | NaN | 17.2 | 24.3 | No | No |
2 | 2008-12-03 | Albury | 12.9 | 25.7 | 0.0 | NaN | NaN | WSW | 46.0 | W | ... | 38.0 | 30.0 | 1007.6 | 1008.7 | NaN | 2.0 | 21.0 | 23.2 | No | No |
3 | 2008-12-04 | Albury | 9.2 | 28.0 | 0.0 | NaN | NaN | NE | 24.0 | SE | ... | 45.0 | 16.0 | 1017.6 | 1012.8 | NaN | NaN | 18.1 | 26.5 | No | No |
4 | 2008-12-05 | Albury | 17.5 | 32.3 | 1.0 | NaN | NaN | W | 41.0 | ENE | ... | 82.0 | 33.0 | 1010.8 | 1006.0 | 7.0 | 8.0 | 17.8 | 29.7 | No | No |
5 rows × 23 columns
data.describe()
MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustSpeed | WindSpeed9am | WindSpeed3pm | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 143975.000000 | 144199.000000 | 142199.000000 | 82670.000000 | 75625.000000 | 135197.000000 | 143693.000000 | 142398.000000 | 142806.000000 | 140953.000000 | 130395.00000 | 130432.000000 | 89572.000000 | 86102.000000 | 143693.000000 | 141851.00000 |
mean | 12.194034 | 23.221348 | 2.360918 | 5.468232 | 7.611178 | 40.035230 | 14.043426 | 18.662657 | 68.880831 | 51.539116 | 1017.64994 | 1015.255889 | 4.447461 | 4.509930 | 16.990631 | 21.68339 |
std | 6.398495 | 7.119049 | 8.478060 | 4.193704 | 3.785483 | 13.607062 | 8.915375 | 8.809800 | 19.029164 | 20.795902 | 7.10653 | 7.037414 | 2.887159 | 2.720357 | 6.488753 | 6.93665 |
min | -8.500000 | -4.800000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 980.50000 | 977.100000 | 0.000000 | 0.000000 | -7.200000 | -5.40000 |
25% | 7.600000 | 17.900000 | 0.000000 | 2.600000 | 4.800000 | 31.000000 | 7.000000 | 13.000000 | 57.000000 | 37.000000 | 1012.90000 | 1010.400000 | 1.000000 | 2.000000 | 12.300000 | 16.60000 |
50% | 12.000000 | 22.600000 | 0.000000 | 4.800000 | 8.400000 | 39.000000 | 13.000000 | 19.000000 | 70.000000 | 52.000000 | 1017.60000 | 1015.200000 | 5.000000 | 5.000000 | 16.700000 | 21.10000 |
75% | 16.900000 | 28.200000 | 0.800000 | 7.400000 | 10.600000 | 48.000000 | 19.000000 | 24.000000 | 83.000000 | 66.000000 | 1022.40000 | 1020.000000 | 7.000000 | 7.000000 | 21.600000 | 26.40000 |
max | 33.900000 | 48.100000 | 371.000000 | 145.000000 | 14.500000 | 135.000000 | 130.000000 | 87.000000 | 100.000000 | 100.000000 | 1041.00000 | 1039.600000 | 9.000000 | 9.000000 | 40.200000 | 46.70000 |
data.dtypes
Date object
Location object
MinTemp float64
MaxTemp float64
Rainfall float64
Evaporation float64
Sunshine float64
WindGustDir object
WindGustSpeed float64
WindDir9am object
WindDir3pm object
WindSpeed9am float64
WindSpeed3pm float64
Humidity9am float64
Humidity3pm float64
Pressure9am float64
Pressure3pm float64
Cloud9am float64
Cloud3pm float64
Temp9am float64
Temp3pm float64
RainToday object
RainTomorrow object
dtype: object
#将数据转换为日期时间格式
data['Date'] = pd.to_datetime(data['Date'])
data['year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.day
data.head()
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | year | Month | day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-12-01 | Albury | 13.4 | 22.9 | 0.6 | NaN | NaN | W | 44.0 | W | ... | 1007.1 | 8.0 | NaN | 16.9 | 21.8 | No | No | 2008 | 12 | 1 |
1 | 2008-12-02 | Albury | 7.4 | 25.1 | 0.0 | NaN | NaN | WNW | 44.0 | NNW | ... | 1007.8 | NaN | NaN | 17.2 | 24.3 | No | No | 2008 | 12 | 2 |
2 | 2008-12-03 | Albury | 12.9 | 25.7 | 0.0 | NaN | NaN | WSW | 46.0 | W | ... | 1008.7 | NaN | 2.0 | 21.0 | 23.2 | No | No | 2008 | 12 | 3 |
3 | 2008-12-04 | Albury | 9.2 | 28.0 | 0.0 | NaN | NaN | NE | 24.0 | SE | ... | 1012.8 | NaN | NaN | 18.1 | 26.5 | No | No | 2008 | 12 | 4 |
4 | 2008-12-05 | Albury | 17.5 | 32.3 | 1.0 | NaN | NaN | W | 41.0 | ENE | ... | 1006.0 | 7.0 | 8.0 | 17.8 | 29.7 | No | No | 2008 | 12 | 5 |
5 rows × 26 columns
data.drop('Date',axis=1,inplace=True)
data.columns
Index(['Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
'Temp3pm', 'RainToday', 'RainTomorrow', 'year', 'Month', 'day'],
dtype='object')
plt.figure(figsize=(15,13))
data_select = data.select_dtypes(include=[np.number])
<Figure size 1500x1300 with 0 Axes>
ax = sns.heatmap(data_select.corr(), square=True, annot=True, fmt='.2f')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()
# 设置样式和调色板
sns.set(style="whitegrid", palette="Set2")
# 创建一个 1 行 2 列的图像布局
fig, axes = plt.subplots(1, 2, figsize=(10, 4)) # 图形尺寸调大 (10, 4)
# 图表标题样式
title_font = {'fontsize': 14, 'fontweight': 'bold', 'color': 'darkblue'}
# 第一张图:RainTomorrow
sns.countplot(x='RainTomorrow', data=data, ax=axes[0], edgecolor='black') # 添加边框
axes[0].set_title('Rain Tomorrow', fontdict=title_font) # 设置标题
axes[0].set_xlabel('Will it Rain Tomorrow?', fontsize=12) # X轴标签
axes[0].set_ylabel('Count', fontsize=12) # Y轴标签
axes[0].tick_params(axis='x', labelsize=11) # X轴刻度字体大小
axes[0].tick_params(axis='y', labelsize=11) # Y轴刻度字体大小
# 第二张图:RainToday
sns.countplot(x='RainToday', data=data, ax=axes[1], edgecolor='black') # 添加边框
axes[1].set_title('Rain Today', fontdict=title_font) # 设置标题
axes[1].set_xlabel('Did it Rain Today?', fontsize=12) # X轴标签
axes[1].set_ylabel('Count', fontsize=12) # Y轴标签
axes[1].tick_params(axis='x', labelsize=11) # X轴刻度字体大小
axes[1].tick_params(axis='y', labelsize=11) # Y轴刻度字体大小
sns.despine() # 去除图表顶部和右侧的边框
plt.tight_layout() # 调整布局,避免图形之间的重叠
plt.show()
x=pd.crosstab(data['RainTomorrow'],data['RainToday'])
x
RainToday | No | Yes |
---|---|---|
RainTomorrow | ||
No | 92728 | 16858 |
Yes | 16604 | 14597 |
y=x/x.transpose().sum().values.reshape(2,1)*100
y
RainToday | No | Yes |
---|---|---|
RainTomorrow | ||
No | 84.616648 | 15.383352 |
Yes | 53.216243 | 46.783757 |
y.plot(kind="bar",figsize=(4,3),color=['#006666','#d279a6']);
雨天百分比
x=pd.crosstab(data['Location'],data['RainToday'])
# 获取每个城市下雨天数和非下雨天数的百分比
y=x/x.transpose().sum().values.reshape((-1, 1))*100
# 按每个城市的雨天百分比排序
y=y.sort_values(by='Yes',ascending=True )
color=['#cc6699','#006699','#006666','#862d86','#ff9966' ]
y.Yes.plot(kind="barh",figsize=(15,20),color=color)
<Axes: ylabel='Location'>
data.columns
Index(['Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
'Temp3pm', 'RainToday', 'RainTomorrow', 'year', 'Month', 'day'],
dtype='object')
plt.figure(figsize=(8,6))
sns.scatterplot(data=data,x='Pressure9am',
y='Pressure3pm',hue='RainTomorrow');
plt.figure(figsize=(8,6))
sns.scatterplot(data=data,x='Humidity9am',
y='Humidity3pm',hue='RainTomorrow');
plt.figure(figsize=(8,6))
sns.scatterplot(x='MaxTemp', y='MinTemp',
data=data, hue='RainTomorrow');
数据预处理
# 每列中缺失数据的百分比
data.isnull().sum()/data.shape[0]*100
Location 0.000000
MinTemp 1.020899
MaxTemp 0.866905
Rainfall 2.241853
Evaporation 43.166506
Sunshine 48.009762
WindGustDir 7.098859
WindGustSpeed 7.055548
WindDir9am 7.263853
WindDir3pm 2.906641
WindSpeed9am 1.214767
WindSpeed3pm 2.105046
Humidity9am 1.824557
Humidity3pm 3.098446
Pressure9am 10.356799
Pressure3pm 10.331363
Cloud9am 38.421559
Cloud3pm 40.807095
Temp9am 1.214767
Temp3pm 2.481094
RainToday 2.241853
RainTomorrow 2.245978
year 0.000000
Month 0.000000
day 0.000000
dtype: float64
# 在该列中随机选择数进行填充
lst=['Evaporation','Sunshine','Cloud9am','Cloud3pm']
for col in lst:
fill_list = data[col].dropna()
data[col] = data[col].fillna(pd.Series(np.random.choice(fill_list, size=len(data.index))))
s = (data.dtypes == "object")
object_cols = list(s[s].index)
object_cols
['Location',
'WindGustDir',
'WindDir9am',
'WindDir3pm',
'RainToday',
'RainTomorrow']
# inplace=True:直接修改原对象,不创建副本
# data[i].mode()[0] 返回频率出现最高的选项,众数
for i in object_cols:
data[i].fillna(data[i].mode()[0], inplace=True)
t = (data.dtypes == "float64")
num_cols = list(t[t].index)
num_cols
['MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustSpeed',
'WindSpeed9am',
'WindSpeed3pm',
'Humidity9am',
'Humidity3pm',
'Pressure9am',
'Pressure3pm',
'Cloud9am',
'Cloud3pm',
'Temp9am',
'Temp3pm']
# .median(), 中位数
for i in num_cols:
data[i].fillna(data[i].median(), inplace=True)
data.isnull().sum()
Location 0
MinTemp 0
MaxTemp 0
Rainfall 0
Evaporation 0
Sunshine 0
WindGustDir 0
WindGustSpeed 0
WindDir9am 0
WindDir3pm 0
WindSpeed9am 0
WindSpeed3pm 0
Humidity9am 0
Humidity3pm 0
Pressure9am 0
Pressure3pm 0
Cloud9am 0
Cloud3pm 0
Temp9am 0
Temp3pm 0
RainToday 0
RainTomorrow 0
year 0
Month 0
day 0
dtype: int64
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for i in object_cols:
data[i] = label_encoder.fit_transform(data[i])
X = data.drop(['RainTomorrow','day'],axis=1).values
y = data['RainTomorrow'].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=101)
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
model = Sequential()
model.add(Dense(units=24,activation='tanh',))
model.add(Dense(units=18,activation='tanh'))
model.add(Dense(units=23,activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(units=12,activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(units=1,activation='sigmoid'))
from tensorflow.python.keras.optimizers import adam_v2
optimizer = adam_v2.Adam(1e-3)
model.compile(loss='binary_crossentropy',
optimizer=optimizer,
metrics="accuracy")
early_stop = EarlyStopping(monitor='val_loss',
mode='min',
min_delta=0.001,
verbose=1,
patience=25,
restore_best_weights=True)
模型训练
model.fit(x=X_train,
y=y_train,
validation_data=(X_test, y_test), verbose=1,
callbacks=[early_stop],
epochs = 10,
batch_size = 32
)
ps:isinstance(ds, input_lib.DistributedDatasetInterface) 需要修改源码 直接返回False就行
Epoch 1/10
3410/3410 [==============================] - 6s 2ms/step - loss: 0.3992 - accuracy: 0.8299 - val_loss: 0.3842 - val_accuracy: 0.8312
Epoch 2/10
3410/3410 [==============================] - 6s 2ms/step - loss: 0.3829 - accuracy: 0.8364 - val_loss: 0.3731 - val_accuracy: 0.8402
Epoch 3/10
3410/3410 [==============================] - 5s 2ms/step - loss: 0.3805 - accuracy: 0.8376 - val_loss: 0.3828 - val_accuracy: 0.8282
Epoch 4/10
3410/3410 [==============================] - 6s 2ms/step - loss: 0.3783 - accuracy: 0.8378 - val_loss: 0.3887 - val_accuracy: 0.8346
Epoch 5/10
3410/3410 [==============================] - 6s 2ms/step - loss: 0.3758 - accuracy: 0.8393 - val_loss: 0.3715 - val_accuracy: 0.8412
Epoch 6/10
3410/3410 [==============================] - 6s 2ms/step - loss: 0.3747 - accuracy: 0.8392 - val_loss: 0.3800 - val_accuracy: 0.8381
Epoch 7/10
3410/3410 [==============================] - 6s 2ms/step - loss: 0.3729 - accuracy: 0.8402 - val_loss: 0.3655 - val_accuracy: 0.8419
Epoch 8/10
3410/3410 [==============================] - 6s 2ms/step - loss: 0.3727 - accuracy: 0.8401 - val_loss: 0.3649 - val_accuracy: 0.8422
Epoch 9/10
3410/3410 [==============================] - 6s 2ms/step - loss: 0.3717 - accuracy: 0.8408 - val_loss: 0.3669 - val_accuracy: 0.8416
Epoch 10/10
3410/3410 [==============================] - 6s 2ms/step - loss: 0.3709 - accuracy: 0.8408 - val_loss: 0.3765 - val_accuracy: 0.8403
<tensorflow.python.keras.callbacks.History at 0x27751cb3b50>
结果可视化
import matplotlib.pyplot as plt
acc = model.history.history['accuracy']
val_acc = model.history.history['val_accuracy']
loss = model.history.history['loss']
val_loss = model.history.history['val_loss']
epochs_range = range(10)
plt.figure(figsize=(14, 4))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()
个人总结
RNN及其变体在序列数据处理中的应用
-
RNN
RNN(Recurrent Neural Network,循环神经网络)是一类专门处理序列数据的神经网络模型。与传统的前馈网络(如全连接网络MLP、卷积网络CNN等)不同,RNN能够在序列的时间步之间传递信息,具备“记忆”先前输入的能力。这种特性使得RNN在处理依赖于上下文或时间顺序的任务时非常有效,例如自然语言处理(NLP)、时间序列预测和语音识别等。 -
RNN的基本结构
RNN的基本结构可以概括为以下几点:
序列性:RNN在每个时间步接收当前输入,并结合上一时间步的隐藏状态来更新当前的隐藏状态。
循环结构:在每个时间步,RNN会基于当前输入和上一时刻的隐藏状态来更新当前隐藏状态,然后输出结果。
- RNN的关键特征
循环:RNN通过将过去的隐藏状态反复输入到网络,与当前输入一起决策新的隐藏状态,它在时间序列上“循环”展开。
参数共享:对于序列中每个时间步,RNN使用相同的一组权重一般的多层感知器(MLP)不同,MLP 每一层都会有一组新的权重。 - RNN的优势与局限
优势:
适合序列数据:相比于传统的全连接网络,RNN能够更好地处理变长的序列输入,捕捉序列中的时序依赖关系。
参数共享:节省模型参数,防止过度膨胀。
局限:
长期依赖问题:经典RNN里,随着序列长度增大,早期输入的信息往往无法传播到后面时间步,会导致梯度消失或梯度爆炸。
训练效率:由于存在序列展开和反向传播(BPTT: Back Propagation Through Time)的特殊性,训练速度通常慢于并行度高的卷积网络。
5. RNN的常见变体:LSTM和GRU
为了克服RNN的局限性,人们提出了带有“门控”机制的循环神经网络结构,其中最典型的是LSTM和GRU。
LSTM(Long Short-Term Memory):
内存单元(Cell State):LSTM通过内存单元来存储长期信息。
门控机制:LSTM引入了忘记门(Forget Gate)、输入门(Input Gate)和输出门(Output Gate)来控制信息的流动,从而保留长期的梯度信息,缓解梯度消失问题。
应用:在很多NLP任务中,LSTM大多表现优于传统RNN。
GRU(Gated Recurrent Unit):
简化结构:GRU结构上比LSTM更简化,只有更新门(Update Gate)和重置门(Reset Gate),虽然结构更简单,但也能保留一定的长期依赖能力。
性能:在某些任务中,GRU的性能与LSTM不相上下,且训练速度更快。
6. RNN的应用实例
自然语言处理(NLP):
情感分析
文本分类
机器翻译
文本生成
时间序列预测:
股票预测
温度预测
信号处理
语音识别或合成:
处理音频序列
原文地址:https://blog.csdn.net/Inface0443/article/details/145194055
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!