pandas-数据统计函数,max,min,value_counts,agg,apply,map

张开发
2026/4/11 11:08:46 15 分钟阅读

分享文章

pandas-数据统计函数,max,min,value_counts,agg,apply,map
数据准备test-data.csv中的数据订单ID,商品类别,子类别,销售地区,销售月份,单价,销量,折扣,销售额,利润1001,电子产品,手机,华东,2024-01,5999,2,0.05,11398.1,1199.81002,家居用品,沙发,华北,2024-01,3999,1,0.0,3999.0,799.81003,电子产品,笔记本,华南,2024-01,7999,1,0.1,7199.1,1599.81004,服装,男装,华东,2024-01,599,3,0.2,1437.6,239.61005,家居用品,灯具,华北,2024-02,899,2,0.0,1798.0,359.61006,电子产品,平板,华南,2024-02,3499,2,0.05,6648.1,699.81007,服装,女装,华东,2024-02,799,4,0.1,2876.4,559.21008,家居用品,床垫,华北,2024-02,2999,1,0.0,2999.0,599.81009,电子产品,耳机,华南,2024-03,1299,5,0.15,5517.25,899.151010,服装,童装,华东,2024-03,399,6,0.2,1149.6,239.41011,家居用品,桌椅,华北,2024-03,1599,2,0.0,3198.0,639.61012,电子产品,手表,华南,2024-03,1999,3,0.1,5397.3,1079.71013,服装,男装,华东,2024-04,699,2,0.05,1328.1,209.71014,家居用品,窗帘,华北,2024-04,499,4,0.1,1796.4,359.21015,电子产品,手机,华南,2024-04,5999,1,0.0,5999.0,1199.81016,服装,女装,华东,2024-04,899,3,0.15,2285.85,419.551017,家居用品,地毯,华北,2024-05,799,2,0.0,1598.0,319.61018,电子产品,笔记本,华南,2024-05,8999,1,0.05,8549.05,1799.81019,服装,童装,华东,2024-05,499,5,0.2,1996.0,399.01020,家居用品,灯具,华南,2024-05,999,3,0.1,2697.3,539.11021,电子产品,平板,华东,2024-06,3799,2,0.0,7598.0,759.81022,服装,男装,华北,2024-06,799,3,0.1,2157.3,419.71023,家居用品,沙发,华南,2024-06,4299,1,0.05,4084.05,859.81024,电子产品,耳机,华东,2024-06,1499,4,0.15,5096.6,899.41025,服装,女装,华北,2024-07,999,2,0.0,1998.0,399.6importpandasaspdimportnumpyasnp dfpd.read_csv(./test-data.csv)订单ID商品类别子类别销售地区销售月份单价销量折扣销售额利润01001电子产品手机华东2024-01599920.0511398.101199.8011002家居用品沙发华北2024-01399910.003999.00799.8021003电子产品笔记本华南2024-01799910.107199.101599.8031004服装男装华东2024-0159930.201437.60239.6041005家居用品灯具华北2024-0289920.001798.00359.6051006电子产品平板华南2024-02349920.056648.10699.8061007服装女装华东2024-0279940.102876.40559.2071008家居用品床垫华北2024-02299910.002999.00599.8081009电子产品耳机华南2024-03129950.155517.25899.1591010服装童装华东2024-0339960.201149.60239.40101011家居用品桌椅华北2024-03159920.003198.00639.60111012电子产品手表华南2024-03199930.105397.301079.70121013服装男装华东2024-0469920.051328.10209.70131014家居用品窗帘华北2024-0449940.101796.40359.20141015电子产品手机华南2024-04599910.005999.001199.80151016服装女装华东2024-0489930.152285.85419.55161017家居用品地毯华北2024-0579920.001598.00319.60171018电子产品笔记本华南2024-05899910.058549.051799.80181019服装童装华东2024-0549950.201996.00399.00191020家居用品灯具华南2024-0599930.102697.30539.10201021电子产品平板华东2024-06379920.007598.00759.80211022服装男装华北2024-0679930.102157.30419.70221023家居用品沙发华南2024-06429910.054084.05859.80231024电子产品耳机华东2024-06149940.155096.60899.40241025服装女装华北2024-0799920.001998.00399.60统计函数 sum,max,min,mean,value_counts# 总销售额df[销售额].sum()100801.1# 最大的销售额df[销售额].max()11398.1# 最小的销售额df[销售额].min()1149.6# 平均销售额df[销售额].mean()4032.0440000000003# 统计每种商品出现的次数df[商品类别].value_counts()商品类别 电子产品 9 家居用品 8 服装 8 Name: count, dtype: int64# 商品类别中 出现次数最多的商品,也就是中位数df[商品类别].mode()0 电子产品 Name: 商品类别, dtype: object# 销售额列 的总数和利润列的总数df[[销售额,利润]].sum()销售额 100801.1 利润 17499.3 dtype: float64分组groupby# 按“商品类别”分组对“销量”求和category_salesdf.groupby(商品类别)[销量].sum()category_sales商品类别 家居用品 16 服装 28 电子产品 21 Name: 销量, dtype: int64# 按“商品类别”分组对“销售额”求和df.groupby(商品类别)[销售额].sum()商品类别 家居用品 22169.75 服装 15228.85 电子产品 63402.50 Name: 销售额, dtype: float64# 多个分组字段用列表传入df.groupby([商品类别,销售地区])[利润].sum()商品类别 销售地区 家居用品 华北 3077.60 华南 1398.90 服装 华东 2066.45 华北 819.30 电子产品 华东 2859.00 华南 7278.05 Name: 利润, dtype: float64# numeric_onlyTrue 只处理数值列df.groupby(销售月份).mean(numeric_onlyTrue)订单ID单价销量折扣销售额利润销售月份2024-011002.54649.01.750.08756008.4500959.75002024-021006.52049.02.250.03753580.3750554.60002024-031010.51324.04.000.11253815.5375714.46252024-041014.52024.02.500.07502852.3375547.06252024-051018.52824.02.750.08753710.0875764.37502024-061022.52599.02.500.07504733.9875734.67502024-071025.0999.02.000.00001998.0000399.6000# 也 可以先挑选出列 再进行统计df[[订单ID,单价,销量,折扣,销售额,利润,销售月份]].groupby(销售月份).mean()订单ID单价销量折扣销售额利润销售月份2024-011002.54649.01.750.08756008.4500959.75002024-021006.52049.02.250.03753580.3750554.60002024-031010.51324.04.000.11253815.5375714.46252024-041014.52024.02.500.07502852.3375547.06252024-051018.52824.02.750.08753710.0875764.37502024-061022.52599.02.500.07504733.9875734.67502024-071025.0999.02.000.00001998.0000399.6000高级操作map()用于Series单列对每个元素执行函数映射转换。apply()可用于Series逐元素 或DataFrame逐行 / 列执行更复杂的自定义逻辑。agg()用于分组后的数据对分组结果执行聚合操作如求和、均值等。mapmap()处理单列Series的元素级映射作用对 Series 中的每个元素应用函数或字典映射返回新的 Series长度不变。适用场景值替换、简单转换如分类值编码、单位转换。category_map{电子产品:电子,家居用品:家居,服装:服饰}# 用 map() 替换 df[类别简称] df[商品类别].map(category_map)df[类别简称]df[商品类别].map({电子产品:电子,家居用品:家居,服装:服饰})df[[商品类别,类别简称]].head(3)商品类别类别简称0电子产品电子1家居用品家居2电子产品电子# 定义转换函数defprice_level(price):return高价ifprice5000else低价# 用 map() 应用函数df[价格等级]df[单价].map(price_level)df[[价格等级,单价]].head(3)价格等级单价0高价59991低价39992高价7999applyapply()处理复杂逻辑支持 Series/DataFrame作用对 Series类似 map()但可处理更复杂的逻辑如多条件判断。对 DataFrame按行axis1或列axis0应用函数处理跨列 / 跨行数据。# 定义复杂逻辑函数defprofit_rating(profit):ifprofit1000:return高利润elifprofit500:return中利润else:return低利润# 用 apply() 应用到 Seriesdf[利润评级]df[利润].apply(profit_rating)df[[利润,利润评级]].head(5)利润利润评级01199.8高利润1799.8中利润21599.8高利润3239.6低利润4359.6低利润# 定义跨列计算函数row 代表一行数据通过 row[列名] 访问值defprofit_rate(row):returnrow[利润]/row[销售额]# 利润率 利润 / 销售额# 按行应用axis1 表示行df[利润率]df.apply(profit_rate,axis1)df[[利润,销售额,利润率]].head(3)利润销售额利润率01199.811398.10.1052631799.83999.00.20000021599.87199.10.222222aggagg()分组后的聚合操作作用对 groupby() 分组后的结果应用聚合函数如求和、均值返回聚合后的统计量长度通常缩短。适用场景分组统计如按类别求总销量、平均利润。# 挑选出df.groupby(商品类别)[利润].agg([sum,mean,max])summeanmax商品类别家居用品4476.50559.562500859.8服装2885.75360.718750559.2电子产品10137.051126.3388891799.8# 按 商品类别 分组对 销量 求总和对 利润 求总和与均值。df.groupby(商品类别).agg(总销量(销量,sum),总利润(利润,sum),平均利润(利润,mean))总销量总利润平均利润商品类别家居用品164476.50559.562500服装282885.75360.718750电子产品2110137.051126.338889df.groupby([商品类别,销售地区]).agg(总销售额(销售额,sum),最大销量(销量,max))总销售额最大销量商品类别销售地区家居用品华北15388.404华南6781.353服装华东11073.556华北4155.303电子产品华东24092.704华南39309.805

更多文章