cassandra.cluster.nohostavailable查询大量数据时:“无法完成针对任何主机的操作”

tez616oj  于 2021-06-09  发布在  Cassandra
关注(0)|答案(2)|浏览(397)

我使用以下代码从cassandra查询数据:

from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import SimpleStatement
import pandas as pd

cluster = Cluster(contact_points=['192.168.2.4'],port=9042)
session = cluster.connect()

def testContectRemoteDatabase():
    contact_points = ['192.168.2.4']
    auth_provider = PlainTextAuthProvider(username='XXX', password='XX')
    cluster = Cluster(contact_points=contact_points, auth_provider=auth_provider)
    session = cluster.connect()
    cql_str = 'select * from DB1.mytable ;'
    simple_statement = SimpleStatement(cql_str, consistency_level=ConsistencyLevel.ONE,fetch_size=2000000)
    execute_result = session.execute(simple_statement, timeout=None)
    result = execute_result._current_rows
    cluster.shutdown()
    df = pd.DataFrame(result)
    df.to_csv('./my_test.csv', index=False, mode='w', header=True)

if __name__ == '__main__':
    testContectRemoteDatabase()

当我设置 fetch_size=1000000 ,没有错误,但是当我设置 fetch_size=2000000 ,此错误消息是:

Traceback (most recent call last):
  File "test.py", line 24, in <module>
    testContectRemoteDatabase()
  File "test.py", line 17, in testContectRemoteDatabase
    execute_result = session.execute(simple_statement, timeout=None)
  File "cassandra\cluster.py", line 2618, in cassandra.cluster.Session.execute
  File "cassandra\cluster.py", line 4877, in cassandra.cluster.ResponseFuture.result
cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 192.168.2.4:9042 datacenter1>: ConnectionShutdown('errors=Connection heartbeat timeout after 30 seconds, last_host=192.168.2.4:9042')})

如何修复?

icomxhvb

icomxhvb1#

无界的 SELECT 你做全表扫描是行不通的。cassandra是为oltp工作负载设计的。
您的查询非常昂贵,需要单个协调器从集群中的所有节点检索所有分区。这可能适用于具有少量分区的单节点集群。但是到了这一点,你就会发现你的代码无法扩展。
当环上分布有几十个节点和数百万个分区时,单个协调器节点将无法处理,副本也无法在超时时间内做出响应。
我建议您使用spark进行分析查询。spark connector for cassandra优化了分析查询,能够更好地处理这些查询。它也可以扩展。干杯!

5vf7fwbs

5vf7fwbs2#

正如erick所描述的那样,从cassandra的Angular 来看,您的代码并不是非常理想的,而且当您拥有的数据超过可用内存时,它也不会工作。
如果您只需要将数据从db导出到csv或其他格式—不要重新发明轮子,而是使用dsbulk。它将非常简单:

dsbulk unload -k keyspace -t table -u user -p password -url filename

有关示例,请参阅以下博客文章:
https://www.datastax.com/blog/2019/03/datastax-bulk-loader-introduction-and-loading
https://www.datastax.com/blog/2019/04/datastax-bulk-loader-more-loading
https://www.datastax.com/blog/2019/04/datastax-bulk-loader-common-settings
https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading
https://www.datastax.com/blog/2019/07/datastax-bulk-loader-counting
https://www.datastax.com/blog/2019/12/datastax-bulk-loader-examples-loading-other-locations

相关问题