从cassandra读取大数据到pythonDataframe(内存错误)

llmtgqce 于 2021-06-14 发布在 Cassandra

关注(0)|答案(1)|浏览(418)

我试着从cassandra读取2048维（1百万条记录）的特征向量到每次崩溃的PandasDataframe。
我有32gbram，但仍然无法将所有数据读入内存，每次尝试将数据加载到内存时，python程序都会崩溃。我需要所有的数据在内存中一次为我的机器学习算法(我的csv数据大小为18gb。）

import pandas as pd

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory

auth_provider = PlainTextAuthProvider(username=CASSANDRA_USER, password=CASSANDRA_PASS)
cluster = Cluster(contact_points=[CASSANDRA_HOST], port=CASSANDRA_PORT,
    auth_provider=auth_provider)

session = cluster.connect(CASSANDRA_DB)
session.row_factory = dict_factory

query = "SELECT * FROM Table"

df = pd.DataFrame()

for row in session.execute(query):
    df = df.append(pd.DataFrame())

在Dataframe中读取数据是正确的方法吗？有没有其他内存有效的方法来读取Dataframe中的所有数据？
作为最后一次尝试，我考虑的选项是：1）降低特征向量维数2）增加ram内存
我不能在csv或任何其他文件系统中存储数据，因为我在cassandra中有一些其他操作要做。
程序每次都会崩溃，消息是由内存问题引起的。

cassandra python DataFrame pandas

来源：https://stackoverflow.com/questions/57579023/read-large-data-from-cassandra-to-python-dataframe-memory-error

1条答案

按热度按时间

yqlxgs2m1#

我在从sqlserver（使用odbc连接）将数据读入dataframe时遇到了类似的问题。这似乎是Pandas方面的问题。与数据在原始数据库中占用的空间相比，Dataframe占用的空间（ram）超过了10倍。
使用h2oDataframe效率更高（在我的例子中，它占用了ram中2-3倍的空间）。
也看看这个帖子。如果你能大量读取数据，那会有所帮助。

赞(0）回复(0）举报 2021-06-14

我来回答

从cassandra读取大数据到pythonDataframe(内存错误)

1条答案

相关问题

热门标签

最新问答