已经有一个答案,它处理一个相对简单的 Dataframe ,给定here。然而,我手头的这个框架有多个列和大量的行。一个DataFrame包含三个沿着axis=0连接的DataFrame。(一个的底端连接到下一个的顶部。)它们由一行NaN值分隔。我如何通过沿着NaN行拆分一个数据来创建三个 Dataframe ?
uz75evzq1#
与您链接的答案一样,您希望创建一个标识组编号的列。然后,您可以使用相同的解决方案。为此,您必须测试一行的所有值是否为NaN。我不知道pandas中是否有这样一个内置的测试,但是pandas有一个测试来检查一个Series是否充满了NaN。所以你要做的是对 Dataframe 的转置执行,这样你的“Series”实际上就是你的行:
NaN
pandas
Series
df["group_no"] = df.isnull().all(axis=1).cumsum()
在这一点上,你可以使用相同的技术,从该答案分裂的边框。您可能希望在最后执行.dropna(),因为结果中仍然有NaN行。
.dropna()
j8ag8udp2#
在2022年遇到了同样的问题。下面是我用NaN在行上分割行的方法,需要注意的是,这依赖于pip install python-rle的游程编码:
pip install python-rle
import rledef nanchucks(df): # It chucks NaNs outta dataframes # True if whole row is NaN df_nans = pd.isnull(df).sum(axis="columns").astype(bool) values, counts = rle.encode(df_nans) df_nans = pd.DataFrame({"values": values, "counts": counts}) df_nans["cum_counts"] = df_nans["counts"].cumsum() df_nans["start_idx"] = df_nans["cum_counts"].shift(1) df_nans.loc[0, "start_idx"] = 0 df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column df_nans["end_idx"] = df_nans["cum_counts"] - 1 # Only keep the chunks of data w/o NaNs df_nans = df_nans[df_nans["values"] == False] indices = [] for idx, row in df_nans.iterrows(): indices.append((row["start_idx"], row["end_idx"])) return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
import rle
def nanchucks(df):
# It chucks NaNs outta dataframes
# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)
df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1
# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]
indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))
return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
示例如下:
sample_df1 = pd.DataFrame({ "a": [1, 2, np.nan, 3, 4], "b": [1, 2, np.nan, 3, 4], "c": [1, 2, np.nan, 3, 4],})sample_df2 = pd.DataFrame({ "a": [1, 2, np.nan, 3, 4], "b": [1, 2, 3, np.nan, 4], "c": [1, 2, np.nan, 3, 4],})print(nanchucks(sample_df1))# [ a b c# 0 1.0 1.0 1.0# 1 2.0 2.0 2.0,# a b c# 3 3.0 3.0 3.0# 4 4.0 4.0 4.0]print(nanchucks(sample_df2))# [ a b c# 0 1.0 1.0 1.0# 1 2.0 2.0 2.0,# a b c# 4 4.0 4.0 4.0]
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"b": [1, 2, 3, np.nan, 4],
print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]
print(nanchucks(sample_df2))
dzjeubhm3#
使用NaNs改进其他支持多行的答案:
NaNs
from IPython.display import displayimport pandas as pddef split_df_if_row_full_nans(df, reset_header=False): # grouping df = (df .assign(_nan_all_cols=df.isnull().all(axis=1)) .assign(_group_no=lambda df_: df_._nan_all_cols.cumsum()) .query('_nan_all_cols == False') # Drop rows where _nan_all_cols is True .drop(columns=['_nan_all_cols']) # Drop the _nan_all_cols column .reset_index(drop=True) ) # splitting dfs = {df.iloc[rows[0],0]: (df .iloc[rows] .drop(columns=['_group_no']) ) for _, rows in df.groupby('_group_no').groups.items()} if reset_header: # rename column and set index for k, v in dfs.items(): dfs[k] = (v .rename(columns=v.iloc[0]) .drop(index=v.index[0]) ) # TODO: this part seems to only works if length of the df is > 1 # dfs[k].set_index(dfs[k].columns[0], drop=True, inplace=True) # # display # for df in dfs.values(): # display(df) return dfssample_df1 = pd.DataFrame({ "a": [1, 2, np.nan, 3, 4], "b": [1, 2, np.nan, 3, 4], "c": [1, 2, np.nan, 3, 4],})sample_df2 = pd.DataFrame({ "a": [1, 2, np.nan, 3, 4], "b": [1, 2, 3, np.nan, 4], "c": [1, 2, np.nan, 3, 4],})for df in split_df_if_row_full_nans(sample_df1).values(): display(df)# 1.0 1.0 1.0# 1 2 2 2# 3.0 3.0 3.0# 3 4 4 4for df in split_df_if_row_full_nans(sample_df2).values(): display(df)# 1.0 1.0 1.0# 1 2 2 2# 2 NaN 3 NaN# 3 3 NaN 3# 4 4 4 4
from IPython.display import display
import pandas as pd
def split_df_if_row_full_nans(df, reset_header=False):
# grouping
df = (df
.assign(_nan_all_cols=df.isnull().all(axis=1))
.assign(_group_no=lambda df_: df_._nan_all_cols.cumsum())
.query('_nan_all_cols == False') # Drop rows where _nan_all_cols is True
.drop(columns=['_nan_all_cols']) # Drop the _nan_all_cols column
.reset_index(drop=True)
)
# splitting
dfs = {df.iloc[rows[0],0]: (df
.iloc[rows]
.drop(columns=['_group_no'])
for _, rows in df.groupby('_group_no').groups.items()}
if reset_header:
# rename column and set index
for k, v in dfs.items():
dfs[k] = (v
.rename(columns=v.iloc[0])
.drop(index=v.index[0])
# TODO: this part seems to only works if length of the df is > 1
# dfs[k].set_index(dfs[k].columns[0], drop=True, inplace=True)
# # display
# for df in dfs.values():
# display(df)
return dfs
for df in split_df_if_row_full_nans(sample_df1).values():
display(df)
# 1.0 1.0 1.0
# 1 2 2 2
# 3.0 3.0 3.0
# 3 4 4 4
for df in split_df_if_row_full_nans(sample_df2).values():
# 2 NaN 3 NaN
# 3 3 NaN 3
# 4 4 4 4
注意:此方法使用.isnull().all(axis=1),即仅当所有值都是NaN时才进行拆分。
.isnull().all(axis=1)
3条答案
按热度按时间uz75evzq1#
与您链接的答案一样,您希望创建一个标识组编号的列。然后,您可以使用相同的解决方案。
为此,您必须测试一行的所有值是否为
NaN
。我不知道pandas
中是否有这样一个内置的测试,但是pandas
有一个测试来检查一个Series是否充满了NaN
。所以你要做的是对 Dataframe 的转置执行,这样你的“Series
”实际上就是你的行:在这一点上,你可以使用相同的技术,从该答案分裂的边框。
您可能希望在最后执行
.dropna()
,因为结果中仍然有NaN
行。j8ag8udp2#
在2022年遇到了同样的问题。下面是我用NaN在行上分割行的方法,需要注意的是,这依赖于
pip install python-rle
的游程编码:示例如下:
dzjeubhm3#
使用
NaNs
改进其他支持多行的答案:注意:此方法使用
.isnull().all(axis=1)
,即仅当所有值都是NaN
时才进行拆分。