生产环境—cassandra写入延迟频繁出现峰值

oknwwptz  于 2021-06-15  发布在  Cassandra
关注(0)|答案(1)|浏览(573)

在生产集群中,集群写入延迟通常从7毫秒到4秒。因此,客户端面临大量的读写超时。每隔几个小时就会重复一次。
观察:群集写入延迟(第99百分位)-4秒本地写入延迟(第99百分位)-10毫秒读写一致性-本地\u一个节点总数-7
我试着用settraceprobability启用跟踪几分钟,发现大部分时间都花在节间通信上

session_id                           | event_id                             | activity                                                                                                                    | source        | source_elapsed | thread
--------------------------------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------+----------------+------------------------------------------
 4267dca2-bb79-11e8-aeca-439c84a4762c | 429c3314-bb79-11e8-aeca-439c84a4762c | Parsing  SELECT * FROM table1 WHERE uaid = '506a5f3b' AND messageid >= '01;'  | cassandranode3 |              7 |                     SharedPool-Worker-47
 4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a20-bb79-11e8-aeca-439c84a4762c |                                                                                                         Preparing statement | Cassandranode3 |             47 |                     SharedPool-Worker-47
 4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a21-bb79-11e8-aeca-439c84a4762c |                                                                                            reading data from /Cassandranode1 | Cassandranode3 |            121 |                     SharedPool-Worker-47
 4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38610-bb79-11e8-aeca-439c84a4762c |                                                                       REQUEST_RESPONSE message received from /cassandranode1 | cassandranode3 |          40614 | MessagingService-Incoming-/Cassandranode1
 4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38611-bb79-11e8-aeca-439c84a4762c |                                                                                     Processing response from /Cassandranode1 | Cassandranode3 |          40626 |                      SharedPool-Worker-5

我尝试检查cassandra节点之间的连接,但没有发现任何问题。cassandra日志充斥着读取超时异常,因为这是一个非常繁忙的集群,每秒读取30k,写入10k。
system.log中的警告:

WARN  [SharedPool-Worker-28] 2018-09-19 01:39:16,999 SliceQueryFilter.java:320 - Read 122 live and 266 tombstone cells in system.schema_columns for key: system (see tombstone_warn_threshold). 2147483593 columns were requested, slices=[-]

在峰值期间,集群只是暂停,简单的命令如“usesystem\u traces”命令也会失败。

cassandra@cqlsh:system_traces> select * from sessions ;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.

我验证了所有节点上的模式版本,并且它是相同的,但是在发布期间,cassandra甚至无法读取元数据。
有人遇到过类似的问题吗?有什么建议吗?

daupos2t

daupos2t1#

(从上面的评论中得到的数据)长时间的完全gc暂停肯定会导致这种情况。添加 -XX:+DisableExplicitGC 由于对system.gc的调用,您将获得完整的gcs,这很可能是由于一个愚蠢的dgc rmi事件,它会定期被调用,而不管是否需要。用更大的堆,这是非常昂贵的。禁用是安全的。
检查gc日志头,确保未设置最小堆大小。我建议设置 -XX:G1ReservePercent=20

相关问题