在一个excel工作表中提取多个子表到pandas框架

sczxawaw 于 2023-10-22 发布在其他

关注(0)|答案(2)|浏览(121)

我有一个Excel表格，看起来像下面这样。

我希望能够在我的Python脚本中将每个表提取到一个pandas框架中（例如df 1 = table_header，df 2 = table_header_2）。这个问题已经在here和这里处理过了。第一个答案隐藏在付费墙后面。第二，我相信@Rotem提供了一个非常有说服力的解决方案，但是在应用它时，我遇到了检测第一个表的开始和索引的问题。我也许能在一点帮助下解决这些问题，但还有另一个想法我想探索。
如果我知道表格标题的名称，并且可以预期它们存在于每个表格中，并且我知道我可以使用openpyxl找到它们的索引，那么我可以执行某种边缘检测吗？类似于我提供的第二个链接中@Rotem使用的边缘检测，以提取附加到表格标题的所有单元格？还有比遍历行/列并检测非None值数量的变化更简单的方法吗？需要注意的是，即使我知道表头的名称，我也不一定知道这些表头的索引，因为表的大小可能会改变。This solution似乎做了一些非常沿着这些线，但我不明白是如何提取所有的单元格从相关的和附加的表。我发现自己对这件事有点无能为力。
提前感谢您的建议。

excel

来源：https://stackoverflow.com/questions/77108380/extracting-multiple-sub-tables-within-a-single-excel-sheet-to-pandas-dataframe

2条答案

按热度按时间

cygmwpex1#

-#此答案如果数据是不包含在'Excel表格' #-
我将添加这个作为第二个答案，所以任何关于它的评论都不会添加到另一个答案的长线索中。
如果数据不包含在表格中，则有必要找到“左上角单元格”（tlc）和“右下角单元格”（brc）。
在这个例子中，使用相同的数据，代码查找“header”名称。我使用“table header 1”和“table header 2”作为这两个部分的分界（我更改了单元格“A1”“table header 1”“中的名称）。标题被添加到列表section_headers中，该列表包含工作表中使用的所有标题名称。
1.给定示例中的两个数据集，它们的TLC都在A列中，我只搜索该列。如果这不是你的实际工作表的情况，那么你可能需要包括其他列，如果它特定的列只有薄层色谱或整个使用范围，如果他们可能出现在任何地方。
1.代码检查列A中每个单元格的值，直到找到从A1到最后使用的行。如果它发现一个值与列表“section_headers”中的一个标题相匹配，那么它将尝试通过从一行向下检查每个单元格来查找该部分的范围，然后跨列检查，直到它是一个空单元格（即，包含值Python None）。然后再往下一行做同样的动作。
1.一旦它得到最后一列和行（即，BRC），然后它使用与前面相同的函数来转换为DF。
这段代码确定了标题下第一个单元格的最后一列和最后一行（因此在“table header 1”中，这是单元格“A2”）。因此，假设数据在行和列中是均匀的，并且与从该单元格测量的数据相匹配。

from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
from openpyxl.utils.cell import get_column_letter as gcl
from openpyxl.utils.cell import coordinate_from_string as cfs
import pandas as pd

def convert_rng_to_df(tlc, l_col, l_row, sheet):
    first_col = cfs(tlc)[0]
    first_row = cfs(tlc)[1]

    rng = f"{first_col}{first_row+1}:{l_col}{l_row}"

    data_rows = []
    for row in sheet[rng]:
        data_rows.append([cell.value for cell in row])

    return pd.DataFrame(data_rows, columns=get_column_interval(first_col, l_col))

filename = 'foo.xlsx'
wb = load_workbook(filename)
ws = wb['Sheet1']

### Add the names of each section header to this list
section_headers = ['table header 1', 'table header 2']

last_col = ''
last_row = ''
df_dict = {}  # Dictionary to hold the dataframes
for cell in ws['A']:  # Looping Column A only
    if cell.value in section_headers:
        tblname = cell.value  # Header of the Data Set found
        tlc = cell.coordinate  # Top Left Cell of the range
        start_row = cfs(tlc)[1]  #
        for x in range(1, ws.max_column+1):  # Find the last used column for the data in this section
            if cell.offset(row=1, column=x).value is None:
                last_col = gcl(x)
                break
        for y in range(1, ws.max_row):  # Find the last used row for the data in this section
            if cell.offset(row=y, column=1).value is None:
                last_row = (start_row + y) - 1
                break

        print(f"Range to convert for '{tblname}' is: '{tlc}:{last_col}{last_row}'")
        df_dict[tblname] = convert_rng_to_df(tlc, last_col, ws)  # Convert to dataframe

print("\n")
### Print the DataFrames
for table_name, df in df_dict.items():
    print(f"DataFrame from '{table_name}'")
    print(df)
    print("----------------------------------\n")

此代码的输出

Range to convert for 'table header 1' is: 'A1:B8'
Range to convert for 'table header 2' is: 'A10:C15'

DataFrame from 'table header 1'
        A        B
0  value1  value11
1  value2  value12
2  value3  value13
3  value4  value14
4  value5  value15
5  value6  value16
6  value7  value17
----------------------------------

DataFrame from 'table header 2'
        A        B        C
0  valueA  valueAA  valueBA
1  valueB  valueAB  valueBB
2  valueC  valueAC  valueBC
3  valueD  valueAD  valueBD
4  valueE  valueAF  valueBE
----------------------------------

赞(0）回复(0）举报 2023-10-22

mm9b1k5b2#

-#这个答案，如果数据包含在“Excel表格”#-
您可以使用Openpyxl获取表信息（坐标或范围），并使用一个通用方法将该范围读入DataFrame。
为了更清楚，我将示例tables更改为具有唯一值和头。

from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
import pandas as pd
from openpyxl.utils.cell import coordinate_from_string as cfs

def convert_rng_to_df(tbl_coords, sheet):
    col_start = cfs(tbl_coords.split(':')[0])[0]
    col_end = cfs(tbl_coords.split(':')[1])[0]

    data_rows = []
    for row in sheet[tbl_coords]:
        data_rows.append([cell.value for cell in row])

    return pd.DataFrame(data_rows, columns=get_column_interval(col_start, col_end))

filename = 'foo.xlsx'
wb = load_workbook(filename)
ws = wb['Sheet1']

### Dictionary to hold the dfs for each table
df_dict = {}

### Get the table coordinates from the worksheet table dictionary
for tblname, tblcoord in ws.tables.items():
    print(f'Table Name: {tblname}, Coordinate: {tblcoord}')
    df_dict[tblname] = convert_rng_to_df(tblcoord, ws)  # Convert to dataframe

### Print the DataFrames
for table_name, df in df_dict.items():
    print(f"DataFrame from Table '{table_name}'")
    print(df)
    print("----------------------------------\n")

这将给予一个输出;

Table Name: Table1, Coordinate: A1:B8
Table Name: Table2, Coordinate: A10:C15
DataFrame from Table 'Table1'
              A        B
0  table header  Column1
1        value1  value11
2        value2  value12
3        value3  value13
4        value4  value14
5        value5  value15
6        value6  value16
7        value7  value17
----------------------------------

DataFrame from Table 'Table2'
                A        B        C
0  table header 2  Column1  Column2
1          valueA  valueAA  valueBA
2          valueB  valueAB  valueBB
3          valueC  valueAC  valueBC
4          valueD  valueAD  valueBD
5          valueE  valueAF  valueBE
----------------------------------

赞(0）回复(0）举报 2023-10-22

我来回答

在一个excel工作表中提取多个子表到pandas框架

2条答案

相关问题

热门标签

最新问答