我目前正在尝试解析包含结构化信息的excel files。我感兴趣的数据位于excel工作表的子区域中。基本上,excel包含键-值对,其中键通常以可预测的方式命名(使用regex找到)。键位于同一列中,值对位于excel工作表中键的右侧。
正则表达式模式pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
可预测地匹配键,因此,如果我能找到键所在的列和键所在的行,我就能找到感兴趣的子范围并进一步解析它。
目标:
1.获取与正则表达式匹配的行索引的列表(例如[5, 6, 8, 9]
)
1.查找包含与regex匹配的键的列(例如Unnamed: 3
)
当我使用df_original = pd.read_excel(filename, sheet_name=sheet)
读取excel时, Dataframe 如下所示
df_original = pd.DataFrame({'Unnamed: 0':['Value', 'Name', np.nan, 'Mark', 'Molly', 'Jack', 'Tom', 'Lena', np.nan, np.nan],
'Unnamed: 1':['High', 'New York', np.nan, '5000', '5250', '4600', '2500', '4950', np.nan, np.nan],
'Unnamed: 2':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Unnamed: 3':['Other', 125, 127, np.nan, np.nan, 'Temperature (C)', 'Strength', np.nan, 'Temperature (F)', 'Comment'],
'Unnamed: 4':['Other 2', 25, 14.125, np.nan, np.nan, np.nan, '1500', np.nan, np.nan, np.nan],
'Unnamed: 5':[np.nan, np.nan, np.nan, np.nan, np.nan, 25, np.nan, np.nan, 77, 'Looks OK'],
'Unnamed: 6':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Add water'],
})
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------|
| 0 | Value | High | nan | Other | Other 2 | nan | nan |
| 1 | Name | New York | nan | 125 | 25 | nan | nan |
| 2 | nan | nan | nan | 127 | 14.125 | nan | nan |
| 3 | Mark | 5000 | nan | nan | nan | nan | nan |
| 4 | Molly | 5250 | nan | nan | nan | nan | nan |
| 5 | Jack | 4600 | nan | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | nan | Strength | 1500 | nan | nan |
| 7 | Lena | 4950 | nan | nan | nan | nan | nan |
| 8 | nan | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
此代码查找感兴趣的行并求解目标1。
df = df_original.dropna(how='all', axis=1)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
mask = np.column_stack([df[col].str.contains(pattern, regex=True, na=False) for col in df])
row_range = df.loc[(mask.any(axis=1))].index.to_list()
print(df.loc[(mask.any(axis=1))].index.to_list())
[5, 6, 8, 9]
display(df.loc[row_range])
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+-----------------+--------------+--------------+--------------|
| 5 | Jack | 4600 | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | Strength | 1500 | nan | nan |
| 8 | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
解决目标2的最简单方法是什么?基本上,我希望找到至少包含一个与regex模式匹配的值的列。所需的输出将是[Unnamed: 5]
。可能有一些同时解决目标1和2的简单方法。例如:
col_of_interest = 'Unnamed: 3' # <- find this value
col_range = df_original.columns[df_original.columns.to_list().index(col_of_interest): ]
print(col_range)
Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], dtype='object')
target = df_original.loc[row_range, col_range]
display(target)
+----+-----------------+--------------+--------------+--------------+
| | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+-----------------+--------------+--------------+--------------|
| 5 | Temperature (C) | nan | 25 | nan |
| 6 | Strength | 1500 | nan | nan |
| 8 | Temperature (F) | nan | 77 | nan |
| 9 | Comment | nan | Looks OK | Add water |
+----+-----------------+--------------+--------------+--------------+
2条答案
按热度按时间jobtbby31#
一个选项是用xlsx_单元从pyjanitor;它将每个单元作为单个行读取;这样你就有了更多的操作自由;对于您用例来说,它可能很方便,是一种替代方案:
读入数据
筛选符合模式的列:
查找与
vals
位于同一行,并且列大于vals
中的列的值:合并
vals
和result
以获得最终输出hjzp0vay2#
请尝试以下两个选项之一:
选项1(假设“[Tt]温度(C)”行下面没有我们 * 不 * 希望包括的非-
NaN
数据)说明
df.stack
将列名作为一个级别添加到索引中,并在一个列中获取所有数据。Series.str.contains
来查找r'[Tt]emperature'
的匹配项。我们将Series.idxmax
链接到“[r] return the row label of the maximum value”。也就是说,这将是第一个True
,因此我们将返回(5, 'Unnamed: 3')
,分别存储在idx
和col
中。df
开始选择,即从索引5
和列Unnamed: 3
开始。如果我们只想从这里开始所有数据(向右和向下),我们可以用途:df_original.loc[idx:, col:]
,最后,删除仅具有NaN
值的所有剩余行。选项2(我们不想包含的“[Tt]温度(C)”行下方的潜在数据)
说明
index values
,而不仅仅是第一个(对于“[Tt] temperature(C)”)。在tmp[tmp].index
之后,我们得到tmp
:df.loc
的坐标。即,对于索引选择,我们需要所有值,因此我们使用index.get_level_values
;对于列,我们只需要第一个值(当然它们应该都相同:Unnamed: 3
)的数据。