postgresql 使用pandas和SQL炼金术从Postgres获取大部分表？

e0uiprwp 于 2023-06-22 发布在 PostgreSQL

关注(0)|答案(1)|浏览(139)

我需要从Postgres数据库的一个大表（200M+行）中提取一大块（8M+）行。
我现在的设置看起来像这样：

engine = create_engine(url="MY_DB_STRING",
    echo=False,
    execution_options={'stream_results': True},
    pool_pre_ping=True,
    pool_recycle=3600
)
session = scoped_session(sessionmaker(bind=engine))
query = """
SELECT *
FROM MY_TABLE
WHERE status = True
"""
dfs = []
for chunk in pd.read_sql_query(sql=query, con=session.connection(), chunksize=500000)
    df_list.append(chunk)
combined_df = pd.concat(dfs, ignore_index=True)
session.close()

该设置适用于较小的虚拟数据，但实际表需要几个小时。令人烦恼的是，在获取第二块时，它也有可能卡住。如何修改此设置，以有效可靠地获取所有8M+行？

postgresql

来源：https://stackoverflow.com/questions/76420323/fetching-large-portions-of-a-table-from-postgres-with-pandas-and-sql-alchemy

1条答案

按热度按时间

svujldwt1#

这不是一个完整的答案，但我不想把整个代码块放在注解中。

我认为你需要平衡你的块大小与你如何使用流结果。这里有一些关于流结果和yield_per的信息：streaming-with-a-fixed-buffer-via-yield-per默认值似乎是1000，这意味着你要额外努力地填充500，000行。

警告：我不知道这将如何影响您的postgresql服务器内存和/或应用程序服务器内存，所以请先阅读链接并自行考虑。

在阅读了我的警告之后，不妨试试这个：

CHUNKSIZE=500000
dfs = []
for chunk in pd.read_sql_query(sql=query, con=session.connection().execution_options(yield_per=CHUNKSIZE), chunksize=CHUNKSIZE)
    df_list.append(chunk)
combined_df = pd.concat(dfs, ignore_index=True)
session.close()

赞(0）回复(0）举报 2023-06-22

我来回答

postgresql 使用pandas和SQL炼金术从Postgres获取大部分表？

1条答案

相关问题

热门标签

最新问答