自学内容网 自学内容网

PySpark之金融数据分析(Spark RDD、SQL练习题)

目录

一、数据来源

二、PySpark RDD编程

1、查询特定日期的资金流入和流出情况

2、活跃用户分析

三、PySpark SQL编程

1、按城市统计2014年3月1日的平均余额

2、统计每个城市总流量前3高的用户

四、总结


一、数据来源

本文使用的数据来源于天池大赛数据集,由蚂蚁金服提供,包含用户基本信息、申购赎回记录、收益率、银行间拆借利率等多个维度,本文通过PySpark实现对该数据集的简单分析。

数据来源:天池-资金流入流出预测-挑战Baseline

二、PySpark RDD编程

数据都已上传到HDFS的/data目录下,对于分析结果,保存至/output目录下。

1、查询特定日期的资金流入和流出情况

使用user_balance_table,计算出所有用户在每一天的总资金流入和总资金流出量。

输出格式如下:

<日期> <资金流入量> <资金流出量>

代码如下:

from pyspark.sql.functions import col, mean, round, row_number, sum
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from datetime import datetime

# 创建对象
conf = SparkConf().setAppName("data analysis")
sc = SparkContext(conf=conf)

# 读取CSV文件为RDD
lines_rdd = sc.textFile("/data/user_balance_table.csv")
# 获取表头
header = lines_rdd.first()
data_rdd = lines_rdd.filter(lambda row: row != header)
# 每行转为列表
split_rdd = data_rdd.map(lambda line: line.split(","))

# 提取列索引
extracted_rdd = split_rdd.map(lambda fields: (fields[1], (int(fields[4]), int(fields[8]))))
# 分组聚合,分别计算流入和流出总额
aggregated_rdd = extracted_rdd.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
# 结果转换
result_rdd = aggregated_rdd.map(lambda x: "\t".join([x[0], str(x[1][0]), str(x[1][1])]))

# 插入表头行
final_rdd = sc.parallelize(["日期\t资金流入量\t资金流出量"]) + result_rdd
final_rdd.saveAsTextFile("/output/result1.txt")

print("已完成1、查询特定日期的资金流入和流出情况")

输出结果如下:

日期资金流入量资金流出量
20140805394780870221706539
20140808233903717311648757
20140811331550471418603336
20140814257702660211939431
20140820308378692202452782
20140823141412027199377531
20140826306945089285478563
20140829267554713273756380
20140830199708772196374134
20140827302194801468164147
20140821251763517219963356
20140818298499146259169016
20140815244551620236516007
20140812258493673309754858
20140803173825397127125217
20140728371762756345986909
20140725181641088262874791
20140722243084133369043423
20140719210318023155464283
20140716394890140234775948
20140713179759885199459990
20140710283095921326009240
20140707272182847317612569
20140409383347565289330278
20140403363877120266605457
20140331398884905423852634
20140328225966355405443946
20140831275090213292943033
20140825309574223312413411
20140822246316056179349206
20140816215059736219214339
20140813261506619303975517
20140810259534870189909225
20140804330640884322907524
20140729228093046303480103
20140723265461894308353077
20140717253011280298279385
20140714254797524284753279
20140711208671021240050748
20140708224240103340453063
20140630334054112456547794
20140321282351818259655286
20140324313180334437825259
20140402355347118272612066
20140414309853269415986984
20140417355792647265341592
20140420191259529161057781
20140501193045106143362755
20140504303087562413222034
20140513275241493257918375
20140519259077930293791406
20140522344636549251108485
20140603270887462385622582
20140612332365185236467885
20140615166080126116623756
20140323167456369186443311
20140326272935544450254233
20140329160250985155006056
20140407196936223176966561
20140410386567460286914864
20140416387847838255914640
20140422285248757268810141
20140425220927432227764292
20140506318002728341108696
20140515313367089238307643
20140524160073254154409868
20140527299049555321965845
20140530226547701312802179
20140605380042567355645445
20140614181574530229916191
20140623232227670373779624
20140415428681231285293076
20140427146837951191915377
20140502125336258121222064
20140505370924149309330781
20140508392838756273187499
20140514305474522316225442
20140517170983868145854372
20140526344890868274707572
20140529320549863313827800
20140622147180819361285652
20140625264663201547295931
20140628153826161283400236
20140706199569025195530758
20140709278005555269642881
20140712177644343149081488
20140718208959595208671287
20140724277044480347622431
20140727151406251166610652
20140730209917272250117716
20140802189092130172250225
20140217706118060405809267
2013112016616300286951218
20131111147986439227124951
201311026213230026705506
201310246953083058183921
201310158570430435099198
201310128190113631322983
201310137114998131850475
2013102910205014443052468
20131104300027403130970051
2013111016481347185293644
2013111317455497679121440
2013111611808570528996272
2013111913136541970308235
2013112212852868782717921
2013112514714360270113127
20131203268666856101662045
20131209174652812133355299
2013121412904559652234651
2013120812488690376314292
20131205190825287118256819
2013112410239499185555947
2013112122116693874733962
20131118195766477136916782
20131112148469647131263524
20131106181474910131707820
2013102812999337555089464
201310256993645352225802
2013102214561149696172537
201310195314844326684697
20140216331324628183057224
20131216196489931156157326
2013121316467269090778214
20131210179471612147147341
2013120713888260571916047
201311238910901848882131
20140223337053711195214052
20140313346286910316973542
20140310497338076308040624
20140317339008082440624592
20140314315897431311575572
20140311430500816496039886
20140302276202230246199417
20140225563505889299155352
20140309244752519206312503
20140315287407002242799048
20140318435479117410240726
20140221353981687178914448
20140218593563145271125324
20140212763009770354585919
20140209341911234262954049
20140206227860276111635336
2014020520904399095407704
20140208358255468183432237
20140211818803205243058162
20140220478925191230855410
20140207578818221273482213
201402016428717261527060
20140117345426853127385210
20140116374947083153167426
20140119185082022117629351
20140125440854623185524856
20140127657199484280645861
20131220133867400153678963
20131217193980797116537843
20140112228255344141505812
2013122111200077983151594
2013122215454321783998249
2013122819280800661383220
20140102434956739190155450
20140105206030707156781996
20140113447207050178923772
20140110237797636117259153
20140107589726496137972793
20131224305406334118607911
201308244436726814316598
201310118267475727666473
20131002115627087233874
201309302839805030181353
201309213240634026596179
201309063892318633417903
201308314765530322012016
201308284660231933696861
201308164168353633739844
201308134872091928790328
20130804457452548263965
201307295351207618599364
201307235813614724404051
201310107209461352244531
201308205367550930131225
20130817176705194674983
201308115551954315372680
201308054363220315797507
201307304748124313048582
201307244842251836258592
201307255743341838212836
20130731545696379534040
201308032159578920701797
201308065086618416401387
201308154242122252659327
201308185970309215218300
201308274526802860137633
201308305648884420630752
201309083879690914919122
201309119894445971674082
201309143097544336312660
201310041990768923412350
201309043336670827452600
201309073407618320344424
201309109468414337128363
201309192477804811418512
20131003101011946223263
201310097242229933999376
20130711440751973508800
20130714226153032784107
20130720204390794601143
201307182423450511765356
201307154812855513107943
20130712341839048492573
20130706367512721616635
20130710306965062597169
20130713151647173482829
20130719336801249244769
201307224044889619144267
20140817149978271139564084
20140824130195484191080151
20140809160262764163611708
20140806288821016282346594
20140731191728916277194379
20140704211649838264494550
20140412177642053123295320
20140406129477254139576683
20140325314345006312710515
20140322191700135138039412
20140828245082751297893861
20140819266401973254929877
20140807247646474253659514
20140801374884735252540858
20140726128268053282653341
20140720176449304174462836
20140705169383796272535138
20140702384555819328950951
20140327266231082359071642
20140330205533934264714811
20140405202336542163199682
20140408354770149250015131
20140411237829882277077434
20140423313677307278470936
20140426151625415158122962
20140429330607104307578349
20140507417327518239372999
20140510287240171147248328
20140516231967423282094916
20140525166943526231004758
20140528276134813415891684
20140531146823669142862063
20140606301413900274862067
20140609366114374287520152
20140618335262709421920230
20140621177999186231003775
20140624245450766428471509
20140627264282703399444352
20140320365011495336076380
20140401453320585277429358
20140404251895894200192637
20140413208172985178934722
20140419268729366146374940
20140428324937272327724735
20140503185094488199568043
20140509281479009247743971
20140512325108597293952908
20140518164419642153440019
20140521297799722250223726
20140602158219402170409506
20140608302171269169525332
20140611327661453246127540
20140617270350693502560223
20140620251582530286583065
20140626297628039418742109
20140629158801540261170799
20140701384428753374164541
20140418239300383225952909
20140421301134667295635256
20140424318358891224536754
20140430260091330281835975
20140511182424063152945581
20140520453955303260040720
20140523249546195229000787
20140601183489775149829253
20140604274460744303978838
20140607187801995146203577
20140610354031597298190025
20140613216923770386799040
20140616387308484492349489
20140619338609087284956260
20140703297894643289009780
20140715334810012261722182
20140721378088594434191479
20140213626698794362757986
2013111711178662242358155
2013111416799605852554150
2013110823174947192856307
2013110518090481482689864
2013103010966326063908701
201310276800754346520570
2013102110603029045781403
201310189815145015938355
2013101413031530040652625
201310178720321723274764
201310204776668150884342
2013102310609311273788462
201310265575250649357211
2013110115090494566933119
2013110716796346483306452
2013112813976042561453591
2013120615219715982625571
2013121214036000792090275
2013121520588238754757146
20140215255393179121463693
20131211239916977110314592
20131202345313258128465086
20131130116865492119660311
2013112713477412886129498
2013111512427070458210254
2013110912646702845885771
201311036870787030895754
201310319092670155607833
2013101610094809151261982
2013120421945637269981329
2013120113788999294632233
20131129122536344123982166
2013112625803770265802983
20140226680801145284819682
20140301362865580211279011
20140304524146340250562978
20140319359014064429298917
20140316269387391390584547
20140307380139779291087220
20140308243274169140323202
20140305454295491209072753
20140227672909288492786036
20140224656317045473470156
201402048365364644927333
20140222282022706134735118
20140228428721754322030204
20140303505305862513017360
20140306561787770243149884
20140312377007897416491268
201402038432984846109194
201401318732417548132389
201402025791276124395720
20140214490710434263261899
20140219408367966323990451
20140210647465140431753411
20140126528402520331752519
201309293443384840295202
20140129952479658215164666
20140123481450097209024328
20140120335761027442996084
20140114356907128159778389
20130721211423942681331
20140122498730606292856188
20140128843424742248334316
20140130227916211105288105
20140124614103894345829001
20140121412854611302825136
20140118233200688142869842
20140115391408718177618247
2013122916147334755129989
20131226220287409241852564
20140103342074805127714255
20131223198747367138478245
20140106442494042190917629
20140109280698487140391237
20131218180215804202220872
20131219233631357164023344
20131225380076148172755480
20131231265851934134232530
20140108264025160213880074
2014011124340343871530182
2014010418608591099869074
2014010133092656577367755
20131230286358778179826624
20131227211118967163522548
201309254727416874838162
2013100810980691277856096
201310052636468613966453
201309274216612127639603
201309247425085956680792
201309184292660847203221
201309153705968325020715
201309126857368445147220
201309094931247345621186
201309038050788039328144
201308254330628818117808
201308228913073725000672
2013081910918420928339260
201308102061530726614408
201308074390808129708706
201308015336996218864468
201307264472181739192369
201308294751866635944968
201309015623980259339949
201308267449654120637986
201308233787857525680323
201308143054116715031683
201308084449349029551691
201308023806453640150769
201307271719445115058893
20130728362553827683211
201308093342518630131015
201308129452050228586669
201308213818444629983342
2013090214084473917785524
201309053871650521800904
201309177620481558260798
201309202836576215188254
201309235465816058285879
201309266871869348467529
201310011913749912813267
201310074279773328894400
201309137165594660512675
2013091616165621045184589
201309226751416947962055
201309283999579858105720
201310063873065313464528
201307172901568210911513
20130705116487492763587
20130702290373902554548
20130708572582668347729
20130709267989413473059
20130703272707705953867
20130701324883485525022
20130704183211856410729
2013070789622323982735
201307165062284711864981

2、活跃用户分析

使用 user_balance_table ,定义活跃用户为在指定月份内有至少5天记录的用户,统计2014年8月的活跃用户总数。

输出格式如下:

<活跃用户总数>

代码如下:

# 提取user_id、report_date字段
mapped_rdd = split_rdd.map(lambda fields: (fields[0], datetime.strptime(fields[1], "%Y%m%d")))
# 筛选出2014年8月的记录
filtered_rdd = mapped_rdd.filter(lambda x: x[1].year == 2014 and x[1].month == 8)
# 聚合
grouped_rdd = filtered_rdd.groupByKey()
# 统计记录数量,过滤活跃用户
active_users_rdd = grouped_rdd.filter(lambda x: len(list(x[1])) >= 5)

# 统计活跃用户总数
total_active_users = active_users_rdd.map(lambda x: x[0]).distinct().count()

print("已完成2、活跃用户分析\n2014年8月期间的活跃用户总数为:", total_active_users)

三、PySpark SQL编程

1、按城市统计2014年3月1日的平均余额

计算每个城市在2014年3月1日的用户平均余额(tBalance),按平均余额降序排列。

输出格式如下:

<城市ID> <平均余额>

代码如下:

spark = SparkSession.builder.getOrCreate()
# 从HDFS读取CSV文件
df_bal = spark.read.csv("/data/user_balance_table.csv", header=True, inferSchema=True)
df_city = spark.read.csv("/data/user_profile_table.csv", header=True, inferSchema=True)

df = df_bal.select("user_id", "report_date", "tBalance")\
           .filter(col("report_date") == "20140301")\
           .join(df_city.select("user_id","city"), on="user_id")

result_df = df.groupBy("city").agg(mean(col("tBalance")).cast("int").alias("tBalance_mean"))\
           .sort(col("tBalance_mean").desc())

# 将结果转换为RDD,并进行格式调整(添加表头,并以tab分隔每列)
data_rdd = result_df.rdd.map(lambda row: "\t".join(map(str, row)))
result2 = sc.parallelize(["城市ID\t平均余额"] + data_rdd.collect())
result2.saveAsTextFile("/output/result2.txt")
print("已完成1、按城市统计2014年3月1日的平均余额")

输出结果如下:

城市ID平均余额
62819492795923
63019492650775
60819492643912
64819492087617
64119491929838
64121491896363
65819491526555

2、统计每个城市总流量前3高的用户

统计每个城市中每个用户在2014年8月的总流量(定义为total purchase_amt+total_redeem_amt),并输出每个城市总流量排名前三的用户ID及其总流量。

输出格式如下:

<城市ID> <用户ID> <总流量>

代码如下:

windowSpec = Window.partitionBy("city").orderBy(col("total_amt").desc())
df = df_bal.select("user_id", "report_date", "total_purchase_amt", "total_redeem_amt")\
           .filter(df_bal.report_date.like("201408%"))\
           .join(df_city.select("user_id","city"), on="user_id")

result_df = df.groupBy("city", "user_id") \
            .agg((sum(col("total_purchase_amt")) + sum(col("total_redeem_amt"))).alias("total_amt"))\
            .filter(col("total_amt") > 0)\
            .withColumn("rn", row_number().over(windowSpec))\
            .filter(col("rn") <= 3)\
            .sort(col("city"), col("rn"))\
            .select("city", "user_id", "total_amt")

# 输出结果
data_rdd = result_df.rdd.map(lambda row: "\t".join(map(str, row)))
result3 = sc.parallelize(["城市ID\t用户ID\t总流量"] + data_rdd.collect())
result3.saveAsTextFile("/output/result3.txt")
print("已完成2、统计每个城市总流量前3高的用户")

sc.stop()
spark.stop()

输出结果如下:

城市ID用户ID总流量
608194927235108475680
60819492774676065458
60819491894555304049
628194915118149311909
628194911397124293438
628194925814104428054
63019492429109171121
63019492682595374030
63019491093274016744
641194966275162566
64119492103049933641
64119491676949383506
641214922585200516731
641214914472138262790
64121492514770594902
64819491202651161825
648194967049626204
64819491487734488733
6581949949438854436
65819492687623449539
65819492176121136440

四、总结

RDD编程主要练习了filter、map、reduceByKey、saveAsTextFile、groupByKey等算子的使用,Spark SQL编程主要练习了DataFrame操作、聚合函数、窗口函数等内容。


原文地址:https://blog.csdn.net/weixin_44458771/article/details/145260559

免责声明:本站文章内容转载自网络资源,如侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!