我正在AWS机器上从Cassandra迁移到阿斯特拉Cassandra,但存在一些问题:
我无法在Astra Cassandra中插入一个列,该列大约有200万个字符和1.77 MB(我有更大的数据要插入,大约有2000万个字符)。有人知道如何解决这个问题吗?
我正在通过Python应用程序(cassandra驱动程序==3.17.0)插入它,这是我得到的错误堆栈:
start.sh[5625]: [2022-07-12 15:14:39,336]
INFO in db_ops: error = Error from server: code=1500
[Replica(s) failed to execute write]
message="Operation failed - received 0 responses and 2 failures: UNKNOWN from 0.0.0.125:7000, UNKNOWN from 0.0.0.181:7000"
info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0, 'failures': 2}
如果我使用一半的字符,它就会工作。
新的Astra Cassandra CQL控制台表说明:
token@cqlsh> describe mykeyspace.series;
CREATE TABLE mykeyspace.series (
type text,
name text,
as_of timestamp,
data text,
hash text,
PRIMARY KEY ((type, name, as_of))
) WITH additional_write_policy = '99PERCENTILE'
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.UnifiedCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair = 'BLOCKING'
AND speculative_retry = '99PERCENTILE';
旧Cassandra表说明:
ansible@cqlsh> describe mykeyspace.series;
CREATE TABLE mykeyspace.series (
type text,
name text,
as_of timestamp,
data text,
hash text,
PRIMARY KEY ((type, name, as_of))
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
数据示例:
{"type": "OP", "name": "book", "as_of": "2022-03-17", "data": [{"year": 2022, "month": 3, "day": 17, "hour": 0, "quarter": 1, "week": 11, "wk_year": 2022, "is_peak": 0, "value": 1.28056854009628e-08}, .... ], "hash": "84421b8d934b06488e1ac464bd46e83ccd2beea5eb2f9f2c52428b706a9b2a10"}
where this json contains 27.000 entries inside the data array like :
{"year": 2022, "month": 3, "day": 17, "hour": 0, "quarter": 1, "week": 11, "wk_year": 2022, "is_peak": 0, "value": 1.28056854009628e-08}
代码的Python部分:
def insert_to_table(self, table_name,**kwargs):
try:
...
elif table_name == "series":
self.session.execute(
self.session.prepare("INSERT INTO series (type, name, as_of, data, hash) VALUES (?, ?, ?, ?, ?)"),
(
kwargs["type"],
kwargs["name"],
kwargs["as_of"],
kwargs["data"],
kwargs["hash"],
),
)
return True
except Exception as error:
current_app.logger.error('src/db/db_ops.py insert_to_table() table_name = %s error = %s', table_name, error)
return False
非常感谢!
2条答案
按热度按时间cyej8jka1#
您正在达到最大变异大小的配置限制。在Cassandra上,默认为16 MB,而在Astra DB上,目前为4 MB(可能会增加,但仍强烈建议使用veyer大单元大小执行插入)。
存储此数据的一种更灵活的方法是修改数据模型,并将带有巨大字符串的大行拆分为几行,每行包含27000条左右的单个条目。通过正确使用分区,您仍然可以通过单个查询检索整个内容(为了方便起见,在数据库和驱动程序之间分页,这将有助于避免读取如此大的单个行时可能出现的烦人超时)。
顺便提一下,我建议您只在
insert_to_table
函数之外创建一次准备好的语句(缓存它或其他东西)。在insert函数中,只需self.session.execute(already_prepared_statement, (value1, value2, ...))
即可显著提高性能。最后一点:我认为驱动程序只能从版本3.24.0开始连接到Astra DB,所以我不确定您是如何使用版本3.17的。我认为版本3.17不知道
cloud
参数与Cluster
构造函数有关。无论如何,我建议您将驱动程序升级到最新版本(目前为3.25.0)。5vf7fwbs2#
你在问题中发布的细节有些不太正确。
在您发布的架构中,
data
列的类型为text
:但您发布的示例数据看起来像是在插入键/值对,奇怪的是,它们的格式似乎类似于CQL集合类型。
如果它真的是一个字符串,那么它的格式将为:
检查数据和代码,然后重试。干杯