Pandas找到与regex匹配的单元格位置

dwbf0jvd  于 2022-11-18  发布在  其他
关注(0)|答案(2)|浏览(157)

我目前正在尝试解析包含结构化信息的excel files。我感兴趣的数据位于excel工作表的子区域中。基本上,excel包含键-值对,其中键通常以可预测的方式命名(使用regex找到)。键位于同一列中,值对位于excel工作表中键的右侧。
正则表达式模式pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'可预测地匹配键,因此,如果我能找到键所在的列和键所在的行,我就能找到感兴趣的子范围并进一步解析它。
目标:
1.获取与正则表达式匹配的行索引的列表(例如[5, 6, 8, 9]
1.查找包含与regex匹配的键的列(例如Unnamed: 3

当我使用df_original = pd.read_excel(filename, sheet_name=sheet)读取excel时, Dataframe 如下所示

df_original = pd.DataFrame({'Unnamed: 0':['Value', 'Name', np.nan, 'Mark', 'Molly', 'Jack', 'Tom', 'Lena', np.nan, np.nan],
                   'Unnamed: 1':['High', 'New York', np.nan, '5000', '5250', '4600', '2500', '4950', np.nan, np.nan],
                   'Unnamed: 2':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
                   'Unnamed: 3':['Other', 125, 127, np.nan, np.nan, 'Temperature (C)', 'Strength', np.nan, 'Temperature (F)', 'Comment'],
                   'Unnamed: 4':['Other 2', 25, 14.125, np.nan, np.nan, np.nan, '1500', np.nan, np.nan, np.nan],
                   'Unnamed: 5':[np.nan, np.nan, np.nan, np.nan, np.nan, 25, np.nan, np.nan, 77, 'Looks OK'],
                   'Unnamed: 6':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Add water'],
                  })

+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
|    | Unnamed: 0   | Unnamed: 1   |   Unnamed: 2 | Unnamed: 3      | Unnamed: 4   | Unnamed: 5   | Unnamed: 6   |
|----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------|
|  0 | Value        | High         |          nan | Other           | Other 2      | nan          | nan          |
|  1 | Name         | New York     |          nan | 125             | 25           | nan          | nan          |
|  2 | nan          | nan          |          nan | 127             | 14.125       | nan          | nan          |
|  3 | Mark         | 5000         |          nan | nan             | nan          | nan          | nan          |
|  4 | Molly        | 5250         |          nan | nan             | nan          | nan          | nan          |
|  5 | Jack         | 4600         |          nan | Temperature (C) | nan          | 25           | nan          |
|  6 | Tom          | 2500         |          nan | Strength        | 1500         | nan          | nan          |
|  7 | Lena         | 4950         |          nan | nan             | nan          | nan          | nan          |
|  8 | nan          | nan          |          nan | Temperature (F) | nan          | 77           | nan          |
|  9 | nan          | nan          |          nan | Comment         | nan          | Looks OK     | Add water    |
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+

此代码查找感兴趣的行并求解目标1。

df = df_original.dropna(how='all', axis=1)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
mask = np.column_stack([df[col].str.contains(pattern, regex=True, na=False) for col in df])
row_range = df.loc[(mask.any(axis=1))].index.to_list()

print(df.loc[(mask.any(axis=1))].index.to_list())

[5, 6, 8, 9]

display(df.loc[row_range])

+----+--------------+--------------+-----------------+--------------+--------------+--------------+
|    | Unnamed: 0   |   Unnamed: 1 | Unnamed: 3      |   Unnamed: 4 | Unnamed: 5   | Unnamed: 6   |
|----+--------------+--------------+-----------------+--------------+--------------+--------------|
|  5 | Jack         |         4600 | Temperature (C) |          nan | 25           | nan          |
|  6 | Tom          |         2500 | Strength        |         1500 | nan          | nan          |
|  8 | nan          |          nan | Temperature (F) |          nan | 77           | nan          |
|  9 | nan          |          nan | Comment         |          nan | Looks OK     | Add water    |
+----+--------------+--------------+-----------------+--------------+--------------+--------------+

解决目标2的最简单方法是什么?基本上,我希望找到至少包含一个与regex模式匹配的值的列。所需的输出将是[Unnamed: 5]。可能有一些同时解决目标1和2的简单方法。例如:

col_of_interest = 'Unnamed: 3' # <- find this value
col_range = df_original.columns[df_original.columns.to_list().index(col_of_interest): ]
print(col_range)

Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], dtype='object')

target = df_original.loc[row_range, col_range]
display(target)

+----+-----------------+--------------+--------------+--------------+
|    | Unnamed: 3      |   Unnamed: 4 | Unnamed: 5   | Unnamed: 6   |
|----+-----------------+--------------+--------------+--------------|
|  5 | Temperature (C) |          nan | 25           | nan          |
|  6 | Strength        |         1500 | nan          | nan          |
|  8 | Temperature (F) |          nan | 77           | nan          |
|  9 | Comment         |          nan | Looks OK     | Add water    |
+----+-----------------+--------------+--------------+--------------+
jobtbby3

jobtbby31#

一个选项是用xlsx_单元从pyjanitor;它将每个单元作为单个行读取;这样你就有了更多的操作自由;对于您用例来说,它可能很方便,是一种替代方案:

# pip install pyjanitor
import pandas as pd
import janitor as jn

读入数据

df = jn.xlsx_cells('test.xlsx', include_blank_cells=False)
df.head()
     value internal_value coordinate  row  column data_type  is_date number_format
0    Value          Value         A2    2       1         s    False       General
1     High           High         B2    2       2         s    False       General
2    Other          Other         D2    2       4         s    False       General
3  Other 2        Other 2         E2    2       5         s    False       General
4     Name           Name         A3    3       1         s    False       General

筛选符合模式的列:

bools = df.value.str.startswith(('Temperature', 'Strength', 'Comment'), na = False)

vals = df.loc[bools, ['value', 'row', 'column']]

vals
              value  row  column
16  Temperature (C)    7       4
20         Strength    8       4
24  Temperature (F)   10       4
26          Comment   11       4

查找与vals位于同一行,并且列大于vals中的列的值:

bools = df.column.gt(vals.column.unique().item()) & df.row.between(vals.row.min(), vals.row.max())

result = df.loc[bools, ['value', 'row', 'column']]
result
        value  row  column
17         25    7       6
21       1500    8       5
25         77   10       6
27   Looks OK   11       6
28  Add water   11       7

合并valsresult以获得最终输出

(vals
.drop(columns='column')
.rename(columns={'value':'val'})
.merge(result.drop(columns='column'))
) 
               val  row      value
0  Temperature (C)    7         25
1         Strength    8       1500
2  Temperature (F)   10         77
3          Comment   11   Looks OK
4          Comment   11  Add water
hjzp0vay

hjzp0vay2#

请尝试以下两个选项之一:

选项1(假设“[Tt]温度(C)”行下面没有我们 * 不 * 希望包括的非-NaN数据)

pattern = r'[Tt]emperature'
idx, col = df_original.stack().str.contains(pattern, regex=True, na=False).idxmax()
res = df_original.loc[idx:, col:].dropna(how='all')

print(res)

        Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5  Temperature (C)        NaN         25        NaN
6         Strength       1500        NaN        NaN
8  Temperature (F)        NaN         77        NaN
9          Comment        NaN   Looks OK  Add water

说明

  • 首先,我们使用df.stack将列名作为一个级别添加到索引中,并在一个列中获取所有数据。
  • 现在,我们可以应用Series.str.contains来查找r'[Tt]emperature'的匹配项。我们将Series.idxmax链接到“[r] return the row label of the maximum value”。也就是说,这将是第一个True,因此我们将返回(5, 'Unnamed: 3'),分别存储在idxcol中。
  • 现在,我们知道从df开始选择,即从索引5和列Unnamed: 3开始。如果我们只想从这里开始所有数据(向右和向下),我们可以用途:df_original.loc[idx:, col:],最后,删除仅具有NaN值的所有剩余行。
    选项2(我们不想包含的“[Tt]温度(C)”行下方的潜在数据)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
tmp = df_original.stack().str.contains(pattern, regex=True, na=False)
tmp = tmp[tmp].index

res = df_original.loc[tmp.get_level_values(0), tmp.get_level_values(1)[1]:]
print(res)

        Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5  Temperature (C)        NaN         25        NaN
6         Strength       1500        NaN        NaN
8  Temperature (F)        NaN         77        NaN
9          Comment        NaN   Looks OK  Add water

说明

  • 基本上,这里的过程与选项1相同,除了我们想要检索所有的index values,而不仅仅是第一个(对于“[Tt] temperature(C)”)。在tmp[tmp].index之后,我们得到tmp
MultiIndex([(5, 'Unnamed: 3'),
            (6, 'Unnamed: 3'),
            (8, 'Unnamed: 3'),
            (9, 'Unnamed: 3')],
           )
  • 在下一步中,我们使用这些值作为df.loc的坐标。即,对于索引选择,我们需要所有值,因此我们使用index.get_level_values;对于列,我们只需要第一个值(当然它们应该都相同:Unnamed: 3)的数据。

相关问题