按用户清点帐簿库存

ymdaylpp 于 2021-06-01 发布在 Hadoop

关注(0)|答案(0)|浏览(130)

有一个表格包含2014年的数据。结构如下：每个用户可以发行不同数量的图书类别。

User-id|Book-Category
1      |Thrill        
2      |Thrill       
3      |Mystery       
3      |Mystery

要求是为每个用户找到每种类型的图书类别。这些数据已经存在于csv文件中，但每年都可用。我必须加上所有这些值。如：

data for 2014
u-id|book|count
1   |b1  |2  
1   |b2  |4
...  ...  ...

data for 2015
u-id|book|count
1   |b1  |21
2   |b3  |12  
//like the above format,available till 2018.(user1 with book b1 should have a count of 23

现在，我写了一个python脚本，我只是做了一个字典，迭代每一行，如果键（u-id+book category）存在，则添加count的值，否则，在字典中插入键值对，在该脚本中为每年的文件执行此操作，因为有些文件的大小大于1.5gb，脚本持续运行了7/8个小时，不得不停止它。
代码：

import requests
    import csv
    import pandas as pd

    Dict = {}

    with open('data_2012.csv') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            if row['a']+row['b'] not in Dict:
                Dict[row['a']+row['b']] = row['c']

## like this,iterating over the year wise files and finally writing the data to a different file.'a' and 'b' are mentioned at the first line of the data files for an easy access.

有没有什么方法可以让我们在python中更优雅地实现这个功能，或者编写一个map reduce作业？

hadoop python pandas csv

来源：https://stackoverflow.com/questions/53526580/counting-book-inventory-by-user

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

按用户清点帐簿库存

暂无答案！

相关问题

热门标签

最新问答