1.如何使用Pandas执行聚合?
1.聚合后没有DataFrame!发生了什么?
1.如何主要聚合字符串列(到list
s、tuple
s、strings with separator
)?
1.如何汇总计数?
1.如何创建由聚合值填充的新列?
我经常看到这样的问题,询问Pandas聚合功能的不同方面。今天,关于聚合及其各种用例的大部分信息都分散在几十篇措辞糟糕、无法搜索的帖子中。这里的目的是为子孙后代整理一些更重要的观点。
本问答将成为一系列有用的用户指南中的下一部分:
- How to pivot a dataframe,
- Pandas concat
- How do I operate on a DataFrame with a Series for every column?
- Pandas Merging 101
请注意,这篇文章并不意味着要取代关于聚合和groupby的文档,所以也请阅读这些文档!
2条答案
按热度按时间e3bfsja21#
问题1
如何使用Pandas执行聚合?
扩展聚合文件。
聚合函数是减少返回对象维数的函数。这意味着输出的Series/DataFrame具有与原始数据相同或更少的行。
一些常见的聚合函数如下表所示:
通过过滤柱和Cython实现的功能进行聚合:
聚合函数用于所有未在
groupby
函数中指定的列,此处为A, B
列:您还可以在
groupby
函数之后的列表中仅指定用于聚合的某些列:使用函数
DataFrameGroupBy.agg
得到相同的结果:对于应用于一列的多个函数,请使用
tuple
s列表-新列和聚合函数的名称:如果要传递多个函数,则可以传递
tuple
s的list
:然后按列得到
MultiIndex
:要转换为列,展平
MultiIndex
,请将map
与join
配合使用:另一个解决方案是传递集合函数列表,然后将
MultiIndex
扁平化,对于另一个列名使用str.replace
:如果要使用聚合函数分别指定每个列,请传递
dictionary
:您也可以传递自定义函数:
问题2
聚合后没有DataFrame!发生了什么?
按两列或多列聚合:
首先检查Pandas对象的
Index
和type
:如何将
MultiIndex Series
放入色谱柱有两种解决方案:as_index=False
Series.reset_index
:如果按一列分组:
...使用
Index
获得Series
:解决方案与
MultiIndex Series
中的相同:问题3
如何主要聚合字符串列(到
list
s、tuple
s、strings with separator
)?可以传递
list
、tuple
、set
来转换列,而不是传递聚合函数:另一种方法是使用
GroupBy.apply
:若要转换为带分隔符的字符串,请仅在
.join
是字符串列时使用.join
:如果是数值列,请使用lambda函数和
astype
转换为string
s:另一种解决方案是转换为
groupby
之前的字符串:要转换所有列,不要在
groupby
之后传递列列表。因为自动排除“多余”列,所以没有任何列D
。这意味着排除所有数值列。所以需要把所有的列都转换成字符串,然后得到所有的列:
问题4
如何汇总计数?
函数
GroupBy.size
为size
的每一组:函数
GroupBy.count
排除缺失值:此函数应用于多列,以便对非缺失值进行计数:
Series.value_counts
是一个相关函数。它返回包含降序唯一值计数的对象的大小,因此第一个元素是出现频率最高的元素。默认情况下,它不包括NaN
的值。如果您想要与使用函数
groupby
+size
相同的输出,请添加Series.sort_index
:问题5
如何创建由聚合值填充的新列?
方法
GroupBy.transform
返回一个对象,该对象的索引与分组对象相同(大小相同)。有关详细信息,请参见Pandas文档。
ddrv8njm2#
If you are coming from an R or SQL background, here are three examples that will teach you everything you need to do aggregation the way you are already familiar with:
Let us first create a Pandas dataframe
Here is how the table we created looks like:
| key1 | key2 | value1 | value2 |
| ------------ | ------------ | ------------ | ------------ |
| a | c | 1 | 9 |
| a | c | 2 | 8 |
| a | d | 2 | 7 |
| b | d | 3 | 6 |
| a | e | 3 | 5 |
1. Aggregating With Row Reduction Similar to SQL
Group By
1.1 If Pandas version
>=0.25
Check your Pandas version by running
print(pd.__version__)
. If your Pandas version is 0.25 or above then the following code will work:The resulting data table will look like this:
| key1 | key2 | mean_of_value1 | sum_of_value2 | count_of_value1 |
| ------------ | ------------ | ------------ | ------------ | ------------ |
| a | c | 1.5 | 17 | 2 |
| a | d | 2.0 | 7 | 1 |
| a | e | 3.0 | 5 | 1 |
| b | d | 3.0 | 6 | 1 |
The SQL equivalent of this is:
1.2 If Pandas version
<0.25
If your Pandas version is older than 0.25 then running the above code will give you the following error:
TypeError: aggregate() missing 1 required positional argument: 'arg'
Now to do the aggregation for both
value1
andvalue2
, you will run this code:The resulting table will look like this:
| key1 | key2 | value1_mean | value1_count | value2_sum |
| ------------ | ------------ | ------------ | ------------ | ------------ |
| a | c | 1.5 | 2 | 17 |
| a | d | 2.0 | 1 | 7 |
| a | e | 3.0 | 1 | 5 |
| b | d | 3.0 | 1 | 6 |
Renaming the columns needs to be done separately using the below code:
2. Create a Column Without Reduction in Rows (
EXCEL - SUMIF, COUNTIF
)If you want to do a SUMIF, COUNTIF, etc., like how you would do in Excel where there is no reduction in rows, then you need to do this instead.
The resulting data frame will look like this with the same number of rows as the original:
| key1 | key2 | value1 | value2 | Total_of_value1_by_key1 |
| ------------ | ------------ | ------------ | ------------ | ------------ |
| a | c | 1 | 9 | 8 |
| a | c | 2 | 8 | 8 |
| a | d | 2 | 7 | 8 |
| b | d | 3 | 6 | 3 |
| a | e | 3 | 5 | 8 |
3. Creating a RANK Column
ROW_NUMBER() OVER (PARTITION BY ORDER BY)
Finally, there might be cases where you want to create a rank column which is the SQL equivalent of
ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC)
.Here is how you do that.
Note: we make the code multi-line by adding
\
at the end of each line.Here is how the resulting data frame looks like:
| key1 | key2 | value1 | value2 | RN |
| ------------ | ------------ | ------------ | ------------ | ------------ |
| a | c | 1 | 9 | 4 |
| a | c | 2 | 8 | 3 |
| a | d | 2 | 7 | 2 |
| b | d | 3 | 6 | 1 |
| a | e | 3 | 5 | 1 |
In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.
Other aggregating operators:
mean()
Compute mean of groupssum()
Compute sum of group valuessize()
Compute group sizescount()
Compute count of groupstd()
Standard deviation of groupsvar()
Compute variance of groupssem()
Standard error of the mean of groupsdescribe()
Generates descriptive statisticsfirst()
Compute first of group valueslast()
Compute last of group valuesnth()
Take nth value, or a subset if n is a listmin()
Compute min of group valuesmax()
Compute max of group values