Python酷库之旅-第三方库Pandas(020)

# 49、pandas.merge_asof函数
pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches=True, direction='backward')
Perform a merge by key distance.

This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.

For each row in the left DataFrame:

A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

Optionally match on equivalent keys with ‘by’ before searching with ‘on’.

Parameters:
left
DataFrame or named Series
right
DataFrame or named Series
on
label
Field name to join on. Must be found in both DataFrames. The data MUST be ordered. Furthermore this must be a numeric column, such as datetimelike, integer, or float. On or left_on/right_on must be given.

left_on
label
Field name to join on in left DataFrame.

right_on
label
Field name to join on in right DataFrame.

left_index
bool
Use the index of the left DataFrame as the join key.

right_index
bool
Use the index of the right DataFrame as the join key.

by
column name or list of column names
Match on these columns before performing merge operation.

left_by
column name
Field names to match on in the left DataFrame.

right_by
column name
Field names to match on in the right DataFrame.

suffixes
2-length sequence (tuple, list, …)
Suffix to apply to overlapping column names in the left and right side, respectively.

tolerance
int or Timedelta, optional, default None
Select asof tolerance within this range; must be compatible with the merge index.

allow_exact_matches
bool, default True
If True, allow matching with the same ‘on’ value (i.e. less-than-or-equal-to / greater-than-or-equal-to)

If False, don’t match the same ‘on’ value (i.e., strictly less-than / strictly greater-than).

direction
‘backward’ (default), ‘forward’, or ‘nearest’
Whether to search for prior, subsequent, or closest matches.

Returns:
DataFrame

49-2、参数

49-2-1、left(必须)：左侧DataFrame对象。

49-2-2、right(必须)：右侧DataFrame对象。

49-2-3、on(可选，默认值为None)：指定用于合并的列，这个列在两个DataFrame中都必须存在。如果没有指定left_on和right_on，则必须提供该参数。

49-2-4、left_on(可选，默认值为None)：左侧DataFrame中用于合并的列。

49-2-5、right_on(可选，默认值为None)：右侧DataFrame中用于合并的列。

49-2-6、left_index(可选，默认值为False)：布尔值，表示是否使用左侧DataFrame的索引来进行合并。

49-2-7、right_index(可选，默认值为False)：布尔值，表示是否使用右侧DataFrame的索引来进行合并。

49-2-8、by(可选，默认值为None)：在执行合并前先对指定列进行分组，by列在两个DataFrame 中都必须存在，类似于SQL中的“分区”合并。

49-2-9、left_by(可选，默认值为None)：左侧DataFrame中用于分组的列。

49-2-10、right_by(可选，默认值为None)：右侧DataFrame中用于分组的列。

49-2-11、suffixes(可选，默认值为('_x', '_y'))：当两个DataFrame中存在同名列时，指定列名的后缀。

49-2-12、tolerance(可选，默认值为None)：指定合并时允许的最大时间差，可以是一个数值或Timedelta对象。

49-2-13、allow_exact_matches(可选，默认值为True)：布尔值，表示是否允许精确匹配。

49-2-14、direction(可选，默认值为'backward')：指定匹配的方向，可以是'backward'(向后匹配)，'forward'(向前匹配)或者 'nearest'(最近匹配)。

49-3、功能

进行“按时间顺序的近似匹配”合并操作，它特别适用于时间序列数据，当两个DataFrame的时间戳并不完全匹配时，可以通过该函数找到最近的匹配点进行合并。

49-4、返回值

返回一个新的DataFrame，该DataFrame包含合并后的结果。

49-5、说明

49-5-1、功能

49-5-1-1、近似时间匹配：pandas.merge_asof()可以在两个DataFrame之间基于时间戳列进行合并，即使时间戳不完全匹配。它会根据指定的方向找到最近的匹配点。

49-5-1-2、方向控制：用户可以指定合并方向，如向后匹配(backward)、向前匹配(forward)或最近匹配(nearest)。

49-5-1-3、容差范围：可以设置一个容差范围(tolerance)，限制匹配点的最大时间差。

49-5-1-4、分组合并：可以按指定列进行分组，然后在每个分组内进行合并。

49-5-1-5、索引合并：允许使用索引进行合并，而不仅限于列。

49-5-2、返回值

49-5-2-1、合并列：用于合并操作的列(例如时间戳列)。

49-5-2-2、原始列：来自左侧和右侧DataFrame的所有列。对于同名列，会根据suffixes参数添加后缀。

49-5-2-3、匹配列：合并时所找到的最近匹配点的对应值。

49-6、用法

49-6-1、数据准备

无

49-6-2、代码示例

# 49、pandas.merge_asof函数
import pandas as pd
# 创建示例DataFrame
df1 = pd.DataFrame({
    'time': pd.to_datetime(['2024-07-13 01:00', '2024-07-13 02:00', '2024-07-13 03:00']),
    'value1': [10, 20, 30]
})
df2 = pd.DataFrame({
    'time': pd.to_datetime(['2024-07-13 01:30', '2024-07-13 02:30']),
    'value2': [100, 200]
})
# 使用merge_asof进行近似时间匹配合并
result = pd.merge_asof(df1, df2, on='time', direction='nearest', suffixes=('_left', '_right'))
print(result)

49-6-3、结果输出

# 49、pandas.merge_asof函数
#                  time  value1  value2
# 0 2024-07-13 01:00:00      10     100
# 1 2024-07-13 02:00:00      20     100
# 2 2024-07-13 03:00:00      30     200

50、pandas.concat函数

50-1、语法

# 50、pandas.concat函数
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)
Concatenate pandas objects along a particular axis.

Allows optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

Parameters:
objs
a sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.

axis
{0/’index’, 1/’columns’}, default 0
The axis to concatenate along.

join
{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).

ignore_index
bool, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

keys
sequence, default None
If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.

levels
list of sequences, default None
Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.

names
list, default None
Names for the levels in the resulting hierarchical index.

verify_integrity
bool, default False
Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.

sort
bool, default False
Sort non-concatenation axis if it is not already aligned. One exception to this is when the non-concatentation axis is a DatetimeIndex and join=’outer’ and the axis is not already aligned. In that case, the non-concatenation axis is always sorted lexicographically.

copy
bool, default True
If False, do not copy data unnecessarily.

Returns:
object, type of objs
When concatenating all Series along the index (axis=0), a Series is returned. When objs contains at least one DataFrame, a DataFrame is returned. When concatenating along the columns (axis=1), a DataFrame is returned.

50-2、参数

50-2-1、objs(必须)：待连接的DataFrame或Series对象的列表或字典。

50-2-2、axis(可选，默认值为0)：沿指定轴进行连接，0表示纵向(沿行)，1表示横向(沿列)。

50-2-3、join(可选，默认值为'outer')：指定连接方式，‘outer’为外连接，‘inner’为内连接。

50-2-4、ignore_index(可选，默认值为False)：若为True，则忽略索引，生成新的整数索引。

50-2-5、keys(可选，默认值为None)：用于构建多层索引，如果提供该参数，则连接结果会有一个多层索引。

50-2-6、levels(可选，默认值为None)：用于构建多层索引级别，必须与keys参数一起使用。

50-2-7、names(可选，默认值为None)：多层索引级别的名称，必须与keys参数一起使用。

50-2-8、verify_integrity(可选，默认值为False)：若为True，检查新连接的对象是否有重复索引，如果有重复，抛出异常。

50-2-9、sort(可选，默认值为False)：若为True，则根据连接的索引进行排序，为了提升性能，默认不排序。

50-2-10、copy(可选，默认值为None)：若为False，则不复制数据。

50-3、功能

用于沿指定轴将多个DataFrame或Series对象进行连接。

50-4、返回值

返回值是一个新的DataFrame或Series，具体取决于输入对象和参数设置。

50-5、说明

返回值的类型和结构主要取决于以下几个因素：

50-5-1、输入对象的类型：输入的对象可以是DataFrame或Series。

50-5-2、连接的轴(axis)：指定是沿行(axis=0)还是沿列(axis=1)进行连接。

50-5-3、连接方式(join)：指定是内连接还是外连接。

50-5-4、是否忽略索引(ignore_index)：决定是否生成新的整数索引。

50-5-5、多层索引(keys, levels, names)：如果提供这些参数，返回的将是一个具有多层索引的DataFrame。

50-6、用法

50-6-1、数据准备

无

50-6-2、代码示例

# 50、pandas.concat函数
import pandas as pd
# 创建示例DataFrame
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
}, index=[0, 1, 2])
df2 = pd.DataFrame({
    'A': ['A3', 'A4', 'A5'],
    'B': ['B3', 'B4', 'B5']
}, index=[3, 4, 5])
# 纵向连接
result = pd.concat([df1, df2], axis=0)
print(result,end='\n\n')
# 横向连接，忽略索引
result = pd.concat([df1, df2], axis=1, ignore_index=True)
print(result, end='\n\n')
# 多层索引
result = pd.concat([df1, df2], keys=['df1', 'df2'])
print(result)

50-6-3、结果输出

# 50、pandas.concat函数
#     A   B
# 0  A0  B0
# 1  A1  B1
# 2  A2  B2
# 3  A3  B3
# 4  A4  B4
# 5  A5  B5

#      0    1    2    3
# 0   A0   B0  NaN  NaN
# 1   A1   B1  NaN  NaN
# 2   A2   B2  NaN  NaN
# 3  NaN  NaN   A3   B3
# 4  NaN  NaN   A4   B4
# 5  NaN  NaN   A5   B5

#         A   B
# df1 0  A0  B0
#     1  A1  B1
#     2  A2  B2
# df2 3  A3  B3
#     4  A4  B4
#     5  A5  B5

51、pandas.get_dummies函数

51-1、语法

# 51、pandas.get_dummies函数
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Convert categorical variable into dummy/indicator variables.

Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

Parameters:
data
array-like, Series, or DataFrame
Data of which to get dummy indicators.

prefix
str, list of str, or dict of str, default None
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

prefix_sep
str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

dummy_na
bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.

columns
list-like, default None
Column names in the DataFrame to be encoded. If columns is None then all the columns with object, string, or category dtype will be converted.

sparse
bool, default False
Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).

drop_first
bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the first level.

dtype
dtype, default bool
Data type for new columns. Only a single dtype is allowed.

Returns:
DataFrame
Dummy-coded data. If data contains other columns than the dummy-coded one(s), these will be prepended, unaltered, to the result.

51-2、参数

51-2-1、data(必须)：要转换的输入数据，可以是数组、Series或DataFrame。

51-2-2、prefix(可选，默认值为None)：前缀字符串，用于哑变量列的命名，如果输入是DataFrame，可以传递一个字典来分别为每一列指定前缀。

51-2-3、prefix_sep(可选，默认值为'_')：前缀和分类值之间的分隔符。例如，如果前缀是A，分类值是cat，那么结果列名将是A_cat。

51-2-4、dummy_na(可选，默认值为False)：如果为True，则会为NaN/缺失值添加一列指示变量，缺失值将被视为一个有效的分类。

51-2-5、columns(可选，默认值为None)：指定要转换的列，如果未指定，将转换所有分类变量列(包括object和category类型的列)。

51-2-6、sparse(可选，默认值为False)：如果为True，返回的哑变量列将是稀疏的(SparseDataFrame 或 SparseArray)，这对于大数据集可能更有效。

51-2-7、drop_first(可选，默认值为False)：如果为True，则删除第一个分类变量的哑变量列，以避免多重共线性，这在回归模型中很常用。

51-2-8、dtype(可选，默认值为None)：指定输出哑变量列的dtype，默认情况下，输出列为uint8类型。

51-3、功能

用于将分类变量转换为哑变量(虚拟变量)或指标变量，它可以将带有分类数据的列转换为多个二进制(0/1)列，方便在机器学习模型中使用。

51-4、返回值

返回值是一个DataFrame，其中包含原始数据框中的所有非分类变量列，以及为每个分类变量生成的哑变量列。

51-5、说明

无

51-6、用法

51-6-1、数据准备

无

51-6-2、代码示例

# 51、pandas.get_dummies函数
# 51-1、基本用法
import pandas as pd
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['c', 'c', 'b'],
    'C': [1, 2, 3]
})
print('原始数据框：')
print(df, end='\n\n')
result = pd.get_dummies(df)
print('基本用法：')
print(result, end='\n\n')

# 51-2、指定前缀和前缀分隔符
import pandas as pd
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['c', 'c', 'b'],
    'C': [1, 2, 3]
})
result = pd.get_dummies(df, prefix=['colA', 'colB'], prefix_sep='-')
print('指定前缀和前缀分隔符：')
print(result, end='\n\n')

# 51-3、删除第一个哑变量列
import pandas as pd
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['c', 'c', 'b'],
    'C': [1, 2, 3]
})
result = pd.get_dummies(df, drop_first=True)
print('删除第一个哑变量列：')
print(result, end='\n\n')

# 51-4、为特定列生成哑变量
import pandas as pd
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['c', 'c', 'b'],
    'C': [1, 2, 3]
})
result = pd.get_dummies(df, columns=['A'])
print('为特定列生成哑变量：')
print(result)

51-6-3、结果输出

# 51、pandas.get_dummies函数
# 51-1、基本用法
# 原始数据框：
#    A  B  C
# 0  a  c  1
# 1  b  c  2
# 2  a  b  3

# 基本用法：
#    C    A_a    A_b    B_b    B_c
# 0  1   True  False  False   True
# 1  2  False   True  False   True
# 2  3   True  False   True  False

# 51-2、指定前缀和前缀分隔符
# 指定前缀和前缀分隔符：
#    C  colA-a  colA-b  colB-b  colB-c
# 0  1    True   False   False    True
# 1  2   False    True   False    True
# 2  3    True   False    True   False

# 51-3、删除第一个哑变量列
# 删除第一个哑变量列：
#    C    A_b    B_c
# 0  1  False   True
# 1  2   True   True
# 2  3  False  False

# 51-4、为特定列生成哑变量
# 为特定列生成哑变量：
#    B  C    A_a    A_b
# 0  c  1   True  False
# 1  c  2  False   True
# 2  b  3   True  False

二、推荐阅读

1、Python筑基之旅

2、Python函数之旅

3、Python算法之旅

4、Python魔法之旅

5、博客个人主页

原文地址：https://blog.csdn.net/ygb_1024/article/details/140375808

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：在PyQt中为自己开发的软件实现远程文件“一机一码”授权管理实例
下一篇：字符串替换 replace和replaceAll的区别以及replaceFirst

海外媒体发稿与宣发：拓展全球影响力的关键-大舍传媒
总之，海外媒体发稿与宣发是一项综合性的工作，需要精心策划、持续投入和专业的执行。通过掌握正确的策略和技巧，充分利用这一工具，您将能够在国际舞台上大放异彩，实现拓展全球影响力的目标。无论是企业寻求业务增
阅读更多2024-11-07
OBOO鸥柏丨传媒广告行业的创新应用解决数字技术短板
OBOO鸥柏立式广告机作为这一领域的创新显示产品新技术，搭载VS6.0/满天星(MTSTAR)信息发布系统网络云平台技术科技，以其独特的技术优势和卓越的展览展示宣传播放应用效果，鸥柏信发系统远程集中管
阅读更多2024-11-07
SSLHandshakeException错误解决方案
导致，不同https安全协议不一致，TLS协议版本越高，HTTPS通信的安全性越高，但是相较于低版本TLS协议，高版本TLS协议对浏览器的兼容性较差。查阅资料，确定是由于JDK版本问题，测试项目中使用
阅读更多2024-11-07
C语言 | Leetcode C语言题解之第541题反转字符串II
C语言 | Leetcode C语言题解之第541题反转字符串II
阅读更多2024-11-07
wps怎么算出一行1和0两种数值中连续数值1的个数,出现0后不再计算？
在WPS表格中，要计算一行中连续1的个数，并且在遇到0之后停止计数，可以使用一个自定义的公式。假设你的数据存储在A1到A10的单元格中，你可以使用以下步骤来实现这个目标。这个公式表示：如果A2是1，则
阅读更多2024-11-07
STM32中，定时器使用ETR引脚和使用APB1时钟是否一致？
例如，当使用ETR引脚作为定时器的触发源时，可能需要配置定时器的时钟源为APB1时钟（或其他适当的时钟源），以确保定时器能够正确地响应外部触发信号并进行计数。在STM32中，定时器的时钟源可以选择来自
阅读更多2024-11-07
【c++语言程序设计】字符串与浅层复制（深拷贝与浅拷贝）
适合处理结构化文本输入，指定分隔符来分割输入内容，例如CSV文件的逐行读取。
阅读更多2024-11-07
产品如何3D建模？如何根据使用场景选购3D扫描仪？
随着科技的飞速发展，3D模型已从昔日的小众应用转变为各行各业不可或缺的利器。在文博、电商、家居、汽车、建筑及游戏影视等众多领域，3D模型以其直观、真实的视觉体验发挥着至关重要的作用。它不仅使用户能深入
阅读更多2024-11-07
赠你一只金色的眼 - 富集分析和表达数据可视化
GOplot包用于生物数据的可视化。更确切地说，该包将表达数据与功能分析的结果整合并进行可视化。但是要注意该包不能用于执行这些分析，只能把分析结果进行可视化。在所有科学领域，由于空间限制和结果所需的简
阅读更多2024-11-07
【蓝桥杯选拔赛真题78】python电话号码第十五届青少年组蓝桥杯python选拔赛真题算法思维真题解析
给定一个长度为 11 的字符串 S，表示电话号码，然后将电话号码中第三位数字后的连续四位数字用"*"替换，并输出替换后的字符串。例如:S ="13900001234&qu
阅读更多2024-11-07