使用pandas从URL阅读CSV时的进度(字节)

8zzbczxx  于 2023-09-27  发布在  其他
关注(0)|答案(2)|浏览(129)

因为我需要读取的一些CSV文件非常大(多GB),所以我试图实现一个进度条,当从带有pandas的URL阅读CSV文件时,该进度条指示从总字节中读取的字节数。
我正在尝试实现这样的东西:

  1. from tqdm import tqdm
  2. import requests
  3. from sodapy import Socrata
  4. import contextlib
  5. import urllib
  6. import pandas as pd
  7. url = "https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no"
  8. response = requests.get(url, params=None, stream=True)
  9. response.raise_for_status()
  10. total_size = int(response.headers.get('Content-Length', 0))
  11. block_size = 1000
  12. df = []
  13. last_position = 0
  14. cur_position = 1
  15. with tqdm(desc=url, total=total_size,
  16. unit='iB',
  17. unit_scale=True,
  18. unit_divisor=1024
  19. ) as bar:
  20. with contextlib.closing(urllib.request.urlopen(url=url)) as rd:
  21. # Create TextFileReader
  22. reader = pd.read_csv(rd, chunksize=block_size)
  23. for chunk in reader:
  24. df.append(chunk)
  25. # Here I would like to calculate the current file position: cur_position
  26. bar.update(cur_position - last_position)
  27. last_position = cur_position

有没有办法从pandas TextFileReader中获取文件位置?也许是与C++中的ftell等价的TextFileReader?

rlcwz9us

rlcwz9us1#

没有经过彻底的测试,但是你可以用read()方法实现自定义类,你可以从requests响应中逐行读取并更新tqdm条:

  1. import requests
  2. import pandas as pd
  3. from tqdm import tqdm
  4. url = "https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no"
  5. class TqdmReader:
  6. def __init__(self, resp):
  7. total_size = int(resp.headers.get("Content-Length", 0))
  8. self.resp = resp
  9. self.bar = tqdm(
  10. desc=resp.url,
  11. total=total_size,
  12. unit="iB",
  13. unit_scale=True,
  14. unit_divisor=1024,
  15. )
  16. self.reader = self.read_from_stream()
  17. def read_from_stream(self):
  18. for line in self.resp.iter_lines():
  19. line += b"\n"
  20. self.bar.update(len(line))
  21. yield line
  22. def read(self, n=0):
  23. try:
  24. return next(self.reader)
  25. except StopIteration:
  26. return ""
  27. with requests.get(url, params=None, stream=True) as resp:
  28. df = pd.read_csv(TqdmReader(resp))
  29. print(len(df))

图纸:

  1. https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no: 100%|██████████████████████████████████████████████████████████████████████████████| 2.09M/2.09M [00:00<00:00, 2.64MiB/s]
  2. 7975
展开查看全部
xuo3flqw

xuo3flqw2#

这里是另一个例子,用于执行Pandas分块CSV阅读器,并在缺少总长度或记录编号的情况下显示一些进度信息。

  • 并不总是能够提前知道CSV或其他Pandas阅读器格式中包含的大小或总行数
  • 在本例中,有一个简单的块过滤循环,它提取较大数据集的一些行,以创建适合RAM的较小数据集
  • 示例中的数据集为StackOverflow dump
  • 我们使用tqdm和Jupyter notebook支持来显示HTML进度条,它看起来比notebook中的文本模式进度条更干净
  • 因为当我们对块进行操作时,我们不知道文件的真实的结束,我们不知道总操作将持续多长时间-这可以通过给定tqdm(total=)参数来更改,您可以获得自动估计,但总数必须在Pandas reader之外获得
  • 不管我们是否知道total,我们总是可以显示状态信息,如已过去的时间和已经处理了多少行

  1. from tqdm.auto import tqdm
  2. from pandas.io.parsers.readers import TextFileReader
  3. chunk_size = 2**16 # 64k rows at a time
  4. result_df: pd.DataFrame = None
  5. matched_chunks: list[pd.DataFrame] = []
  6. match_count = row_count = 0
  7. with tqdm() as progress_bar:
  8. reader: TextFileReader
  9. rows_read = 0
  10. with pd.read_csv("csv/Posts.csv", chunksize=chunk_size) as reader:
  11. chunk: pd.DataFrame
  12. for chunk in reader:
  13. # Make Tags column regex friendly
  14. chunk["Tags"] = chunk["Tags"].fillna("")
  15. # Find posts in this chunk that match our tag filter
  16. matched_chunk = chunk.loc[chunk["Tags"].str.contains(tags_regex, case=False, regex=True)]
  17. matched_chunks.append(matched_chunk)
  18. match_count += len(matched_chunk)
  19. row_count += len(chunk)
  20. last = chunk.iloc[-1]
  21. # Show the date where the filter progres is going.
  22. # We are finished when reaching 2023-06
  23. progress_bar.set_postfix({
  24. "Date": last["CreationDate"],
  25. "Matches": match_count,
  26. "Total rows": f"{row_count:,}",
  27. })
  28. # Display rows read as a progress bar,
  29. # but we do not know the end
  30. progress_bar.update(len(chunk))
  31. result_df = pd.concat(matched_chunks)

Full code here

展开查看全部

相关问题