csv Pandas,如果没有就分开

mdfafbf1  于 2023-01-28  发布在  其他
关注(0)|答案(2)|浏览(128)

我已经创建了一个程序,这样我就可以将数据集的列拆分为4列,但是我的一些数据集只有2列,所以当到达该部分时,会抛出一个错误,我相信if else语句可以帮助解决这个问题。
下面是我的程序的代码:

import pandas as pd
import os
   
# reading csv file from url
filepath = "C:/Users/username/folder1/folder2/folder3/b 2 col.csv"
file_encoding = 'cp1252' 
data = pd.read_csv(filepath , header=None, names = list(range(0,4)) , encoding=file_encoding)

data.columns =['ID', 'Name',  'S ID', 'SName']

# new data frame with split value columns
new = data["Name"].str.split(",", n = 1, expand = True)

# making separate first name column from new data frame
data["Last Name"]= new[0]

# making separate last name column from new data frame
data["First Name"]= new[1]

# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)

# making separate first name column from new data frame
data["S Last Name"]= new[0]

# making separate last name column from new data frame
data["S First Name"]= new[1]

# Saving File name as its path
filename = os.path.basename(filepath) + ".xlsx"
data.to_excel(filename, index=False)

data

本部分负责分割第二组数据

# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)

问题是我的CSV不是都有四列,所以如果我可以在这里实现一个if else来检查数据是否存在,然后继续else跳过并移动到下一节:

# Saving File name as its path
filename = os.path.basename(filepath) + ".xlsx"
data.to_excel(filename, index=False)

data

我相信这个程序能用我的数据集
链接到我的数据集示例:https://drive.google.com/drive/folders/1nkLgo5tSFsxOTCa5EMWZlezDFi8AyaDq?usp=sharing
谢谢你帮忙

jdgnovmf

jdgnovmf1#

IIUC,假设(.csv)文件在同一个文件夹中,下面是pandas.concat的一个命题:

import pandas as pd
import os

filepath = "C:/Users/username/folder1/folder2"
file_encoding = "cp1252"

list_df = []

for filename in os.listdir(filepath):
    if filename.endswith(".csv"):
        df = pd.read_csv(os.path.join(filepath, filename),
                         header=None, encoding=file_encoding, on_bad_lines="skip")
        df = (pd.concat([df.iloc[:, i:i+5].pipe(lambda df_: df_.rename(columns={col:i for i, col in enumerate(df_.columns)}))
                        for i in range(0, df.shape[1], 5)], axis=0)
                  .set_axis(["ID", "FullName", "Street No", "Street Add 1", "Street Add 2"], axis=1)
                  .dropna(how="all"))
        df.insert(0, "filename", filename) #comment this line if don't want to show the filename as a column
        
        list_df.append(df)
                
out = (pd.concat(list_df, ignore_index=True)
           .pipe(lambda df_: df_.join(df_["FullName"]
                                        .str.split(", ", expand=True)
                                        .rename(columns={0: "FirstName", 1: "LastName"}))))

输出:

print(out.head())
​
          filename   ID       FullName    Street No  Street Add 1  Street Add 2 FirstName LastName
0  a 4 col upd.csv  NaN           Name          NaN           NaN           NaN      Name     None
1  a 4 col upd.csv  1.0   Bruce, Wayne  Street No 1  Street Add 1  Street Add 2     Bruce    Wayne
2  a 4 col upd.csv  2.0  James, Gordon  Street No 2  Street Add 2  Street Add 3     James   Gordon
3  a 4 col upd.csv  3.0   Fish, Mooney  Street No 3  Street Add 3  Street Add 4      Fish   Mooney
4  a 4 col upd.csv  4.0   Selina, Kyle  Street No 4  Street Add 4  Street Add 5    Selina     Kyle
0vvn1miw

0vvn1miw2#

您可以像这样将2列拆分为4列。

#if there are some missing columns
data['First Name'] = np.nan
data['Last Name'] = np.nan
data['S First Name'] = np.nan
data['S Last Name'] = np.nan
#if there is not missing values remove above

data[['First Name', 'Last Name']] = data.Name.astype(str).str.split(",", expand=True)

data[['S First Name', 'S Last Name']] = data.SName.astype(str).str.split(",", expand=True)

相关问题