如何读取一个csv文件与Pandas更多的分隔符在行中的头?

lmyy7pcs  于 2023-06-03  发布在  其他
关注(0)|答案(4)|浏览(269)

我想从Pandas中使用read_csv函数读取csv文件,该文件在标题中的行中有更多的分隔符。Pandas认为第一列是多索引的。'NAME'列可以有任意数量的分隔符,受影响的列可以是任何一个(我们不知道哪一个受影响),甚至不止一个。
我尝试调优read_csv的关键字参数,但没有成功。我使用Python 3.7.0和Pandas 0.25.0。但是,Excel可以正确读取文件。

import pandas

with open('test.csv', mode='w') as csv_file:
    csv_file.write('A,NAME,B\n')
    csv_file.write('a, Peter, Parker, b\n')

df = pandas.read_csv('test.csv', header=0, delimiter=',')
print(df)

预期输出:

A            NAME   B
0  a   Peter, Parker   b

实际产量:

A     NAME   B
a   Peter   Parker   b

其他示例:

import pandas

with open('test.csv', mode='w') as csv_file:
    csv_file.write('A,NAME,B,PLACE\n')
    csv_file.write('a, Peter, Parker, b, Queens, New York City\n')

df = pandas.read_csv('test.csv', header=0, delimiter=',')
print(df)

预期输出:

A            NAME   B                 PLACE
0  a   Peter, Parker   b Queens, New York City

实际产量:

A NAME        B           PLACE
a  Peter   Parker    b   Queens   New York City
xzlaal3s

xzlaal3s1#

是不是有点像

df = pandas.read_csv('test.csv', header=0, delimiter=',')
df = df.reset_index()
df["NAME"] = df["A"] + ", " + df["NAME"]
df["A"] = df["Unnamed: 0"]
df = df.drop("Unnamed: 0", axis=1)

可能吗?它不是完全回答这个问题,但可以为您的df做的伎俩。
编辑:另一种可能性,如果文件也是.xls/.xlsx,pd.read_excel("name.xls")应该可以解决您的问题

carvr3hs

carvr3hs2#

解决方法:

with open('test.csv', mode='w') as csv_file:  
    csv_file.write('A,NAME,B\n')  
    csv_file.write('a, Peter, Parker, b\n')  
    csv_file.write('aa, John, Lee, Mary, bb\n')            

df=pd.DataFrame(columns=["A","NAMES","B"])                                                                           

with open("test.csv") as ff:  
   for line in ff:  
       A,N= line.split(",",maxsplit=1)  
       N,B= N.rsplit(",",maxsplit=1)  
       df.loc[len(df.index)]= [A.strip(),N.strip(),B.strip()] 

df.drop(0,axis="index")

    A            NAMES   B
1   a    Peter, Parker   b
2  aa  John, Lee, Mary  bb
rvpgvaaj

rvpgvaaj3#

# Read first line as a list using ',' as delimiter
with open('test.csv', 'r') as f:
  header = f.readline().replace('\n', '').split(',')
# Read file skipping first line (header) using two character delimiter ', '
df = pandas.read_csv('test.csv', skiprows = 1, header = None, delimiter=', ', engine = 'python')
header = header + ["missing-column"] # In your example file header has only 3 columns but data has 4
# Assign header list as dataframe columns names
df.columns = header
print(df)
disbfnqx

disbfnqx4#

最初的想法

我同意一些评论者的观点,即问题在于文件最初保存的格式。以这种方式在文件中保存字符串不是正确的CSV格式。它们应该用双引号括起来。在我看来,解决这个问题的最好方法是首先正确地保存它。
当然,考虑到你的情况,这可能是不可能的,所以让我们看看我们能做些什么。

您的示例

下面是.csv文件表示,我认为您的第一个示例应该是:

A,NAME,B
a,Peter, Parker,b

我也看到了同样的行为,当加载到pandas:

A     NAME  B
a   Peter   Parker  b
  • (此外,Excel也没有为我正确加载它,这很奇怪,但离题了。)*

立足解决方案

我看到两个关键信息可以帮助我们:
1.由标题中的项数提供给我们的预期列数。这告诉我们应该在输出的每一行中的项目总数。
1.我们知道第一个头必须与每行第一部分的数据相匹配。是第一个例子中的唯一索引项,还是第二个例子中的第一个多索引项,等等。
我不明白这两个假设会给我们带来什么。这是因为我们知道我们有多少列,应该有多少列,但我们不知道哪些列应该在哪里折叠。我不认为在一个全面的一般情况下这是可知的。
除非...
这个假设是最不可靠的,但是我们 * 可能 * 能够假设在被错误地移动到下一个列标签的子字符串上会有一个前导空格。这是你提供的两个例子中的情况,它可以保存我们的皮肤。如果这个假设成立,那么你可以这样做:

前导空格可以假设的解决方案

我们要一根一根地重建柱子。我们将第一列放在第一个标题中,然后查看下一个标题是否有前导空格。如果是,我们可以添加到当前的一个,然后检查下一个。这应该适用于任意数量的断开列,只要它们都有一个前导空格。
代码:

# Import IO so we can import the string directly. This makes it much more clear
# the data that we are working with without us having to load a file that is
# unseen in the code
import io

raw_csv_string = """A,NAME,B,PLACE
a,Peter, Parker,b,Queens, New York City"""

# A buffer that pandas can read from as if it was a file
string_buffer = io.StringIO(raw_csv_string)

import pandas as pd

df = pd.read_csv(string_buffer)

def re_collapse_broken_comma_columns(df:pd.DataFrame) -> pd.DataFrame:
    # Store the columns that we should expect to see
    true_headers = df.columns
    
    # Pull the data that got incorrectly placed in the index out of the index
    flattened_index = df.reset_index()
    
    # index helper that will keep track of where we are as we move through the
    # names of the true headers
    true_header_i = 0
    # Helper boolean for adding new columns
    direct_add = True
    
    # Empty dataframe that our cleaned data will go in
    output = pd.DataFrame()
    
    for i, column in enumerate(flattened_index.columns):
        # Store the data in this particular column
        data = flattened_index[column]
    
        # Check to make sure we still have true headers to put data into
        if true_header_i < len(true_headers):
            # Store the name for conveinience
            current_true_column = true_headers[true_header_i]
        else:
            raise RuntimeError(
                "We ran out of true headers. Cannot be fixed with this method."
                + " There is a comma cleaved column in the data that needs to be"
                + " recombined which didn't fit the criteria for combination."
            )
    
        # If direct_add we are safe to place a column directly in a true_header
        # with no priors
        if direct_add:
            output[current_true_column] = data
        # Else we must place in the previous true_header
        else:
            # Insert the column we lost
            output[current_true_column] = output[current_true_column] + "," + data
    
        # Data in the next column if there is one
        next = None
        if i + 1 < len(flattened_index.columns):
            next = flattened_index[flattened_index.columns[i + 1]]
    
        # If there is no next, we are done here
        if next is None:
            break
    
        #
        whitespace_test = next.str.contains("\A\s", na=False)
    
        if whitespace_test.iloc[0] & (len(whitespace_test.unique() == 1)):
            direct_add = False
        else:
            direct_add = True
            true_header_i += 1
    
    # Final validation
    if len(true_headers) != len(output.columns):
        raise RuntimeError(
            "Recombination occurred too often. We have fewer columns in the"
            + " output than were expected."
        )

    return output

output = re_collapse_broken_comma_columns(df)

返回的输出 Dataframe 在您提供的两个示例以及我运行的其他一些测试用例中具有正确的形状。同样,如果假设不能成立,它就会破裂,但我认为这是我们能做的最好的。

相关问题