Pandas read_csv中是否有一个函数或聪明的方法可以检查每行的第一个条目,如果满足条件,则将其添加到Dataframe

fdbelqdn  于 2023-01-19  发布在  其他
关注(0)|答案(2)|浏览(85)

□ □我在做什么:
我正在用panda把一个csv文件读入python。一个有效的条目总是以一个县开始。

问题

我遇到的问题是csv文件中有一些问题,导致1.3k行无效。有些行以太多逗号结尾,有些行以无意义开头,有些行有额外的BS,有些行在输入的中途返回,导致下一行无意义等等
∮我的问题∮
正如我之前所说的,每一个有效的行都应该以一个包含在counts中的条目开始。所以我想可能有一些方法来验证每行的第一个条目,并检查该条目是否“在counts中”。如果是,则添加它。如果不是,则忽略它。
有没有可能在.read_csv函数中使用一个参数来解决这个问题?有没有什么我可以用lambda函数来实现的魔法(就像我上面尝试的那样)?也许你可以建议一个更好的方法?

请勿建议:

手动/直接修改CSV文件。是的,我知道我可以这样做,但这不是我所要求的,对于大型数据集或调用请求来说,这是不合理或不可行的。我的目标是学习和解决这个问题
□ □我尝试过的:
我以为这样会有用:

counties = ["Bedfordshire", "Berkshire", "Bristol", ... , "Worcestershire"]

df = pd.read_csv('data/libraries.csv', skiprows=(lambda x: x not in counties))

可惜不是。
我也尝试过使用内置参数来解决这个问题,例如lineterminator ='\n'、skipblanklines=True和其他一些参数来尝试“约束”读取并防止错误,但这不起作用
csv的小样本:

Library service,Library name,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Notes,Weekly hours open,Weekly hours staffed
Barking and Dagenham,Barking,Yes,Yes,LAL,,,,,,72.0,72.0
Barking and Dagenham,Castle Green,Yes,No,,XL,Mar-13,,,Closed as a public library 31.3.2013. Secondary School library Jo Richardson School,,
Barking and Dagenham,Dagenham,No,Yes,LAL,,,Oct-10,Yes,Replaced Fanshawe and Rectory libraries,56.0,56.0
...
Surrey,Stanwell  ,yes,yes,CRL,,,2012,Yes,Rebuilt on site of old library. Dec 2011 - Jan 2012 Closed 4 weeks to move to new library.,,
Opening hours reduced from 30 to 27 from 5 Sep 2016. Now closed at lunch (1pm-2pm),27.0,yes,,,,,,,,,
Surrey,Stoneleigh ,yes,yes,CRL,,,,,CRL from Feb 2013,0.0,0.0
Surrey,Sunbury ,yes,yes,LAL,,,,,Opening hours reduced from 42 to 39.5 from 5 Sep 2016,39.5,39.5
Surrey,Tattenhams ,yes,yes,CRL,,,,,April 2010 closed for 5 weks for building work.,,
CRL from Nov 2012,0.0,no,,,,,,,,,
...
Worcestershire,Wythall Library,yes,no,,XLR,2016,,,,,
Worcestershire,Wythall Library,yes,yes,LAL,,,2016,"yes, replaced Wythall library",,31.5,31.5
York,Acomb,yes,yes,CL,,,,,,58.5,58.5
York,Bishopthorpe,yes,yes,CL,,,,,,21.5,21.5

在中间部分,您可以看到一行以“Opening hours reduced from...”开头,另一行以“CRL from Novo..."开头。我认为这些人为因素会导致大量问题

ql3eal8s

ql3eal8s1#

可以使用一个 Package 器来过滤这些行,然后再将其提供给read_csv

from io import StringIO

def filter_valid(fname, valid):
    def wrapper(fname, valid):
        first = True
        with open(fname) as f:
            for line in f:
                if first or line.split(',', 1)[0] in valid:
                    yield line
                first = False
    return StringIO(''.join(wrapper(fname, valid)))

df = pd.read_csv(filter_valid('libraries.csv', set(counties)), on_bad_lines='skip')

print(df)

输出:

Library service     Library name In use 2010 In use 2016 Type of library Type of closed library  Closed  New building               Replace existing                                                                                       Notes  Weekly hours open  Weekly hours staffed
0          Surrey       Stanwell           yes         yes             CRL                    NaN     NaN        2012.0                            Yes  Rebuilt on site of old library. Dec 2011 - Jan 2012 Closed 4 weeks to move to new library.                NaN                   NaN
1          Surrey      Stoneleigh          yes         yes             CRL                    NaN     NaN           NaN                            NaN                                                                           CRL from Feb 2013                0.0                   0.0
2          Surrey         Sunbury          yes         yes             LAL                    NaN     NaN           NaN                            NaN                                       Opening hours reduced from 42 to 39.5 from 5 Sep 2016               39.5                  39.5
3          Surrey      Tattenhams          yes         yes             CRL                    NaN     NaN           NaN                            NaN                                             April 2010 closed for 5 weks for building work.                NaN                   NaN
4  Worcestershire  Wythall Library         yes          no             NaN                    XLR  2016.0           NaN                            NaN                                                                                         NaN                NaN                   NaN
5  Worcestershire  Wythall Library         yes         yes             LAL                    NaN     NaN        2016.0  yes, replaced Wythall library                                                                                         NaN               31.5                  31.5
abithluo

abithluo2#

感谢@Barmar给我的建议:

# Create an empty list to store the valid rows
valid_rows = []

# Open the CSV file and read the rows
with open("data/libraries.csv", "r") as file:
    reader = csv.reader(file)
    header = True
    for row in reader:
        if header:
            columns = row
            header = False
        elif row[0] in counties:
            valid_rows.append(row)

df = pd.DataFrame(valid_rows, columns=columns)

我意识到我可以做的另一件事是

# Define the custom function to apply to the first column
def validate_county(value):
    if value in counties:
        return value
    else:
        return pd.NaT
    
df = pd.read_csv('data/libraries.csv', converters={0: validate_county})

然而,这只是将NaT条目添加到第一列,并不阻止将整个条目输入到 Dataframe 中,因此绕过了csv导致的错误。不能使用error_bad_lines=False,也不能阻止跳过行。可能有人对后者有建议,我在Barmer建议之前就在使用后者,但前者按预期工作

相关问题