复杂的Pandas在数据透视表中输出垂直统计数据

rjzwgtxy  于 2023-04-28  发布在  其他
关注(0)|答案(2)|浏览(125)

我有一个大型数据集的运行统计数据多年存储在excel我想处理在pythonPandas和目前的输出与分组的距离范围作为行和年作为列,每个单元格的表将包含计数,平均速度,和最小速度。我已经能够创建输出的排序得到那里,但输出格式不好

import pandas as pd
    pd.set_option('display.max_rows', 1000)
    pd.set_option('display.max_columns', 100)
    # set max width to 100
    pd.set_option('display.width', 100)
    # Replace "file_path.xlsx" with the actual path to your Excel file
    df = pd.read_excel(r"C:\xxxxxx\OneDrive\Documents\garmin running data1.xlsx",
                       sheet_name='jun 15 2018 to apr 19 2023')
    df = df[df['Activity Type'] == 'Running']

# Define the bins for grouping distances and labels for each bin

    bins = [0, .5, 2.4, 3.3, 4.6, 5.4, 6.5, 13, 19.5, 20.2, 30]
    labels = ['false start', '2 miles', '5k', '4 miles', '5 miles', '10K', 'Half', 'Long Train', '20', 'Marathon']

# Extract the year from the date field

    df['Year'] = df['Date'].dt.year

# Create a new column that assigns each distance to a bin with named labels

    df['Distances'] = pd.cut(df['Distance'], bins=bins, labels=labels)
    
    df['Avg Pace'] = pd.to_numeric(df['Avg Pace'], errors='coerce')
    df = df.dropna(subset=['Avg Pace'])

# Create a pivot table that shows the average of average_paces for each distance bin

    pivot_table = pd.pivot_table(df, values='Avg Pace', index=['Distances'], columns=['Year'], aggfunc=('mean', 'count', 'min'))
    #pivot_table['Avg Pace'] = pd.to_datetime(pivot_table['Avg Pace'] / 60 * 86400, unit='s').dt.strftime('%M:%S.%f')

# Divide all the values in the pivot table by 60 and multiply by 86400

    pivot_table = pivot_table / 60 * 86400

# Format the values as mm:ss.0

    pivot_table = pivot_table.applymap(lambda x: pd.to_datetime(x, unit='s').strftime('%M:%S.%S') if not pd.isna(x) else '')

看上面-我想要的是这样的

Year           2018      2019      2020      2021    
Distances                                                                                     
           mean        8:59.59      7:45.45    8:15.22  8:34.36
    5k     min         8:05.15      7:13.45    8:05.22  8:04.36
           count         50     75        85        35    

           mean        8:59.59      7:45.45    8:15.22  8:34.36
    10k    min         8:05.15      7:13.45    8:05.22  8:04.36
           count         50     75        85        35

我得到的是:

count                                                        mean            \
Year             2018      2019      2020      2021      2022      2023      2018      2019   
Distances                                                                                     
false start  00:00.00  00:00.00  00:00.00  00:00.00  00:00.00  00:00.00                       
2 miles      00:00.00  24:00.00  12:00.00  23:59.59  36:00.00  24:00.00  09:14.14  19:00.00   
5k           24:00.00  48:00.00  24:00.00  24:00.00  35:59.59  48:00.00  10:11.11  09:18.18   
4 miles      12:00.00  36:00.00  36:00.00  00:00.00  00:00.00  12:00.00  09:12.12  08:45.45   
5 miles      11:59.59  36:00.00  00:00.00  48:00.00  24:00.00  48:00.00  09:11.11  08:49.49   
10K          47:59.59  24:00.00  48:00.00  24:00.00  12:00.00  36:00.00  09:02.02  08:37.37   
Half         00:00.00  48:00.00  48:00.00  24:00.00  48:00.00  48:00.00  09:13.13  08:34.34   
Long Train   24:00.00  36:00.00  24:00.00  00:00.00  24:00.00  00:00.00  09:08.08  08:21.21   
20           00:00.00  12:00.00  24:00.00  00:00.00  00:00.00  00:00.00            09:00.00   
Marathon     00:00.00  24:00.00  24:00.00  00:00.00  48:00.00  00:00.00            09:34.34   

                                                          min                                \
Year             2020      2021      2022      2023      2018      2019      2020      2021   
Distances                                                                                     
false start                                                                                   
2 miles      13:04.04  14:27.27  09:30.30  08:58.58  08:20.20  08:48.48  09:34.34  08:49.49   
5k           08:56.56  09:03.03  08:41.41  08:26.26  09:34.34  07:27.27  08:16.16  08:13.13   
4 miles      09:00.00  09:06.06  08:36.36  08:30.30  08:28.28  07:35.35  08:05.05  08:18.18   
5 miles      08:48.48  08:59.59  08:36.36  08:27.27  08:26.26  07:43.43  07:45.45  08:17.17   
10K          08:54.54  09:10.10  08:35.35  08:34.34  07:31.31  07:16.16  08:17.17  09:02.02   
Half         08:29.29  08:50.50  08:54.54  08:29.29  09:06.06  07:31.31  08:23.23  08:50.50   
Long Train   07:49.49            08:37.37            09:08.08  07:33.33  07:49.49             
20           08:34.34                                          08:44.44  08:34.34             
Marathon     08:30.30            09:40.40                      09:34.34  08:30.30             

                                 
Year             2022      2023  
Distances                        
false start                      
2 miles      09:19.19  08:58.58  
5k           07:35.35  07:37.37  
4 miles      08:04.04  08:09.09  
5 miles      07:39.39  08:09.09  
10K          07:43.43  08:28.28  
Half         08:33.33  08:23.23  
Long Train   08:03.03            
20                               
Marathon     08:53.53
rta7y2nd

rta7y2nd1#

您可以通过以下方式执行此操作:

import pandas as pd

import pandas as pd

data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
        'Activity Type': ['Running', 'Running', 'Swimming', 'Running', 'Running', 'Running'],
        'Distance': [2.5, 5.7, 0.8, 10.1, 3.4, 21.0],
        'Avg Pace': ['6:30', '7:10', '2:00', '8:20', '7:45', '9:00']}

df = pd.DataFrame(data)

df['Date'] = pd.to_datetime(df['Date'])

df = df[df['Activity Type'] == 'Running']
bins = [0, .5, 2.4, 3.3, 4.6, 5.4, 6.5, 13, 19.5, 20.2, 30]
labels = ['false start', '2 miles', '5k', '4 miles', '5 miles', '10K', 'Half', 'Long Train', '20', 'Marathon']

df['Year'] = df['Date'].dt.year
df['Distances'] = pd.cut(df['Distance'], bins=bins, labels=labels)

df[['Minutes', 'Seconds']] = df['Avg Pace'].str.split(':', n=1, expand=True)
df[['Minutes', 'Seconds']] = df[['Minutes', 'Seconds']].apply(pd.to_numeric, errors='coerce')
df['Avg Pace'] = 1 / (df['Minutes'] + df['Seconds'] / 60) / (df['Distance'] / 1.60934)

df['Avg Pace'] = pd.to_numeric(df['Avg Pace'], errors='coerce')
df = df.dropna(subset=['Avg Pace'])
pivot_table = pd.pivot_table(df, values='Avg Pace', index=['Distances', 'Year'], aggfunc=('mean', 'count', 'min'))
pivot_table = pivot_table / 60 * 86400
pivot_table = pivot_table.applymap(lambda x: pd.to_datetime(x, unit='s').strftime('%M:%S.%S') if not pd.isna(x) else '')
pivot_table = pivot_table.unstack(level=-1)
pivot_table = pivot_table.stack(level=0)

print(pivot_table)

也就是说

Year                   2022
Distances                  
false start count  00:00.00
            mean           
            min            
2 miles     count  00:00.00
            mean           
            min            
5k          count  24:00.00
            mean   02:22.22
            min    02:22.22
4 miles     count  24:00.00
            mean   01:27.27
            min    01:27.27
5 miles     count  00:00.00
            mean           
            min            
10K         count  24:00.00
            mean   00:56.56
            min    00:56.56
Half        count  24:00.00
            mean   00:27.27
            min    00:27.27
Long Train  count  00:00.00
            mean           
            min            
20          count  00:00.00
            mean           
            min            
Marathon    count  24:00.00
            mean   00:12.12
            min    00:12.12
sr4lhrrt

sr4lhrrt2#

我是这样做的:

import pandas as pd

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 100)

df = pd.read_excel(r"C:\xxxx\OneDrive\Documents\garmin running data1.xlsx",
                   sheet_name='jun 15 2018 to apr 19 2023')
df = df[df['Activity Type'] == 'Running']

bins = [0, .5, 2.4, 3.3, 4.6, 5.4, 6.5, 13, 19.5, 20.2, 30]
labels = ['false start', '2 miles', '5k', '4 miles', '5 miles', '10K', 'Half', 'Long Train', '20', 'Marathon']

df['Year'] = df['Date'].dt.year
df['Distances'] = pd.cut(df['Distance'], bins=bins, labels=labels)

df['Avg Pace'] = pd.to_numeric(df['Avg Pace'], errors='coerce')
df = df.dropna(subset=['Avg Pace'])

pivot_table_min = pd.pivot_table(df, values='Avg Pace', index=['Distances'], columns=['Year'], aggfunc='min')
pivot_table_mean = pd.pivot_table(df, values='Avg Pace', index=['Distances'], columns=['Year'], aggfunc='mean')
pivot_table_count = pd.pivot_table(df, values='Avg Pace', index=['Distances'], columns=['Year'], aggfunc='count')

min_formatter = lambda x: f"{pd.to_datetime(x * 60 * 24, unit='s').strftime('%M:%S.%S')}" if not pd.isna(x) else ''
mean_formatter = lambda x: f"{pd.to_datetime(x * 60 * 24, unit='s').strftime('%M:%S.%S')}" if not pd.isna(x) else ''
count_formatter = lambda x: f"{int(x)}" if not pd.isna(x) else ''

pivot_table_min = pivot_table_min.applymap(min_formatter)
pivot_table_mean = pivot_table_mean.applymap(mean_formatter)
pivot_table_count = pivot_table_count.applymap(count_formatter)

# Create a MultiIndex DataFrame to achieve the desired output format
formatted_pivot_table = pd.DataFrame(columns=pivot_table_mean.columns, index=pd.MultiIndex.from_product([pivot_table_mean.index, ['mean', 'min', 'count']], names=['Distances', None]))

# Fill the values for each row and column
for row in pivot_table_mean.index:
    for column in pivot_table_mean.columns:
        formatted_pivot_table.loc[(row, 'mean'), column] = pivot_table_mean.loc[row, column]
        formatted_pivot_table.loc[(row, 'min'), column] = pivot_table_min.loc[row, column]
        formatted_pivot_table.loc[(row, 'count'), column] = pivot_table_count.loc[row, column]

print(formatted_pivot_table)

最后我创建了3个透视表,并通过一些重新格式化将它们合并在一起

相关问题