yjs13——pandas数据离散化+数据表的合并
1.数据离散化
1.1什么是数据离散化?
在 Pandas 中,数据离散化(也称为 分箱 或 分段)是将连续数据转换为离散数据的过程。换句话说,就是把一个范围连续的数值型数据,根据一定的规则划分成多个区间(或称为"箱"),并将每个数据点映射到对应的区间中。
离散化通常用于处理连续的数值特征,以便简化分析或用于分类任务。在实际操作中,离散化通常通过将数据分为几个固定区间或者通过使用统计量(如分位数)将数据划分成若干部分。
1.2数据离散化的具体操作
1.2.1大致均分——pd.qcut()
data1 = pd.qcut(data.columns1‘数据’, k‘份’)
总效果统计查看:
data1.value_counts()
注意:
1.是pd.qcut,不是 数据.qcut
2.是 数据.value_count,不是pd.value_count
3.一般传入的数据不是一个表,是表中的某个属性,某列数据
1.2.2按照指定的区间分
data2 = pd.cut(数据data.columns1, 分类标准bins) 例如bins=[0,30,50,100],那么分类是 (0,30],(30,50]...总效果统计查看:
data2.value_counts()
1.3 one-hot编码的实现
data_one = pd.get_dummies( data1.column1数据 )
2.数据表的合并
2.1 pd.concat()实现
data_concat=pd.concat([data1, data2,...], axis=0/1) //axis=0按照列合并(默认),axis=1按照行合并
2.2 pd.merge()实现
data=pd.merge(t_left, t_right, on=["key1", "key2"], how='left/right/inner/outer') t_left: 作为左表的数据表 t_right:作为右表的数据表 on: 关键字 how: 连接方式,详解可查找数据库的相关资料
代码:
# 数据的离散化+数据表的合并
import numpy as np
import pandas as pd
from pandas import DataFrame
data1 = pd.read_csv("./百度云笔记/data/drinks.csv").head(20)
# 均分离散化
print(data1)
print("将data1的数据中的wine_servings一列进行5个区间大致均分:")
data11 = pd.qcut(data1.wine_servings, 5)
print(data11)
print("总体看区间统计:")
print(data11.value_counts())
# 按照指定的区间离散化
bins = [-1, 20, 30, 50, 100, 200, 300, 400]
print("按照bins所述的区间进行划分:")
data12 = pd.cut(data1.wine_servings, bins)
print(data12)
print("总体看区间统计:")
print(data12.value_counts())
# one——hot编码
data_one = pd.get_dummies(data1.country)
print("给country这个列属性进行独热编码")
print(data_one)
print("=================================================================================================")
# 数据表的合并
# pd.concat函数:
data_1 = np.array([[20, 80, 90], [90, 99, 87], [80, 80, 87]])
data_2 = np.array([[20, 66, 90], [89, 90, 90], [99, 100, 88], [90, 100, 97]])
data_left = pd.DataFrame(data_1, columns=["语文", "数学", "英语"],
index=["stu_A", "stu_B", "stu_C"])
data_right = pd.DataFrame(data_2, columns=["语文", "政治", "英语"],
index=["stu_A", "x", "y", "stu_B"])
print(data_left)
print(data_right)
data_con1 = pd.concat([data_left, data_right], axis=0)
data_con2 = pd.concat([data_left, data_right], axis=1)
print("按照列合并:")
print(data_con1)
print("按照行合并:")
print(data_con2)
# pd.merge函数:
data_m_l = pd.merge(data_left, data_right, on=["语文", "英语"], how='left')
data_m_r = pd.merge(data_left, data_right, on=["语文", "英语"], how='right')
data_m_i = pd.merge(data_left, data_right, on=["语文", "英语"], how='inner')
data_m_o = pd.merge(data_left, data_right, on=["语文", "英语"], how='outer')
print("左连接:")
print(data_m_l)
print("右连接:")
print(data_right)
print("内连接:")
print(data_m_i)
print("外连接:")
print(data_m_o)
结果:
country beer_servings ... total_litres_of_pure_alcohol continent
0 Afghanistan 0 ... 0.0 AS
1 Albania 89 ... 4.9 EU
2 Algeria 25 ... 0.7 AF
3 Andorra 245 ... 12.4 EU
4 Angola 217 ... 5.9 AF
5 Antigua & Barbuda 102 ... 4.9 NaN
6 Argentina 193 ... 8.3 SA
7 Armenia 21 ... 3.8 EU
8 Australia 261 ... 10.4 OC
9 Austria 279 ... 9.7 EU
10 Azerbaijan 21 ... 1.3 EU
11 Bahamas 122 ... 6.3 NaN
12 Bahrain 42 ... 2.0 AS
13 Bangladesh 0 ... 0.0 AS
14 Bangladesh 143 ... 6.3 NaN
15 Bangladesh 142 ... 14.4 EU
16 Bangladesh 295 ... 10.5 EU
17 Bangladesh 263 ... 6.8 NaN
18 Bangladesh 34 ... 1.1 AF
19 Bhutan 23 ... 0.4 AS[20 rows x 6 columns]
将data1的数据中的wine_servings一列进行5个区间大致均分:
0 (-0.001, 6.6]
1 (45.0, 195.2]
2 (13.6, 45.0]
3 (195.2, 312.0]
4 (13.6, 45.0]
5 (13.6, 45.0]
6 (195.2, 312.0]
7 (6.6, 13.6]
8 (195.2, 312.0]
9 (45.0, 195.2]
10 (-0.001, 6.6]
11 (45.0, 195.2]
12 (6.6, 13.6]
13 (-0.001, 6.6]
14 (13.6, 45.0]
15 (13.6, 45.0]
16 (195.2, 312.0]
17 (6.6, 13.6]
18 (6.6, 13.6]
19 (-0.001, 6.6]
Name: wine_servings, dtype: category
Categories (5, interval[float64, right]): [(-0.001, 6.6] < (6.6, 13.6] < (13.6, 45.0] <
(45.0, 195.2] < (195.2, 312.0]]
总体看区间统计:
(13.6, 45.0] 5
(-0.001, 6.6] 4
(6.6, 13.6] 4
(195.2, 312.0] 4
(45.0, 195.2] 3
Name: wine_servings, dtype: int64
按照bins所述的区间进行划分:
0 (-1, 20]
1 (50, 100]
2 (-1, 20]
3 (300, 400]
4 (30, 50]
5 (30, 50]
6 (200, 300]
7 (-1, 20]
8 (200, 300]
9 (100, 200]
10 (-1, 20]
11 (50, 100]
12 (-1, 20]
13 (-1, 20]
14 (30, 50]
15 (30, 50]
16 (200, 300]
17 (-1, 20]
18 (-1, 20]
19 (-1, 20]
Name: wine_servings, dtype: category
Categories (7, interval[int64, right]): [(-1, 20] < (20, 30] < (30, 50] < (50, 100] < (100, 200] <
(200, 300] < (300, 400]]
总体看区间统计:
(-1, 20] 9
(30, 50] 4
(200, 300] 3
(50, 100] 2
(100, 200] 1
(300, 400] 1
(20, 30] 0
Name: wine_servings, dtype: int64
给country这个列属性进行独热编码
Afghanistan Albania Algeria ... Bahrain Bangladesh Bhutan
0 1 0 0 ... 0 0 0
1 0 1 0 ... 0 0 0
2 0 0 1 ... 0 0 0
3 0 0 0 ... 0 0 0
4 0 0 0 ... 0 0 0
5 0 0 0 ... 0 0 0
6 0 0 0 ... 0 0 0
7 0 0 0 ... 0 0 0
8 0 0 0 ... 0 0 0
9 0 0 0 ... 0 0 0
10 0 0 0 ... 0 0 0
11 0 0 0 ... 0 0 0
12 0 0 0 ... 1 0 0
13 0 0 0 ... 0 1 0
14 0 0 0 ... 0 1 0
15 0 0 0 ... 0 1 0
16 0 0 0 ... 0 1 0
17 0 0 0 ... 0 1 0
18 0 0 0 ... 0 1 0
19 0 0 0 ... 0 0 1[20 rows x 15 columns]
=================================================================================================
语文 数学 英语
stu_A 20 80 90
stu_B 90 99 87
stu_C 80 80 87
语文 政治 英语
stu_A 20 66 90
x 89 90 90
y 99 100 88
stu_B 90 100 97
按照列合并:
语文 数学 英语 政治
stu_A 20 80.0 90 NaN
stu_B 90 99.0 87 NaN
stu_C 80 80.0 87 NaN
stu_A 20 NaN 90 66.0
x 89 NaN 90 90.0
y 99 NaN 88 100.0
stu_B 90 NaN 97 100.0
按照行合并:
语文 数学 英语 语文 政治 英语
stu_A 20.0 80.0 90.0 20.0 66.0 90.0
stu_B 90.0 99.0 87.0 90.0 100.0 97.0
stu_C 80.0 80.0 87.0 NaN NaN NaN
x NaN NaN NaN 89.0 90.0 90.0
y NaN NaN NaN 99.0 100.0 88.0
左连接:
语文 数学 英语 政治
0 20 80 90 66.0
1 90 99 87 NaN
2 80 80 87 NaN
右连接:
语文 政治 英语
stu_A 20 66 90
x 89 90 90
y 99 100 88
stu_B 90 100 97
内连接:
语文 数学 英语 政治
0 20 80 90 66
外连接:
语文 数学 英语 政治
0 20 80.0 90 66.0
1 90 99.0 87 NaN
2 80 80.0 87 NaN
3 89 NaN 90 90.0
4 99 NaN 88 100.0
5 90 NaN 97 100.0进程已结束,退出代码为 0
遇到的问题:
1.一般来说,pandas的高级操作,比如缺失判断、数据离散化,独热编码,数据表的合并这些函数都是pd.xxx(),但是注意缺失值nan补充、其他缺失字符的替换、缺失值的删除、离散化数据区间的统计查看等都是 数据.xxx()
2.一开始不理解pd.cut中的”bins”是怎么划分
3.pd.merge中的左右内外链接方式具体是啥
原文地址:https://blog.csdn.net/weixin_59924168/article/details/142847625
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!