hadoop在aws多区域配置中尝试写入cassandra时超时

but5z9lq  于 2021-06-02  发布在  Hadoop
关注(0)|答案(2)|浏览(410)

我在aws中运行一个多dc cassandra(开源,而不是dse)集群,其中一个dc(us-west-2)用于分析,另一个(us-east)用于事务存储。我对ec2告密者使用了networktopologystrategy,在hadoop配置中使用了local\u one的一致性级别。hadoop可以毫无问题地从cassandra读取,但是尝试写入会产生超时异常。
跑步 nodetool status 显示dcs配置正确:

  1. Datacenter: us-west-2
  2. =====================
  3. Status=Up/Down
  4. |/ State=Normal/Leaving/Joining/Moving
  5. -- Address Load Owns Host ID Token Rack
  6. UN x.x.x.x 1.01 GB 9.9% 9e7f4393-7ac9-4559-b3ff-de48be50016f -9127921345534057723 2a
  7. UN x.x.x.x 1001.16 MB 11.4% d0760383-c3dd-474c-9261-239b71dba3f1 -9221279003374097975 2b
  8. UN x.x.x.x 1.05 GB 11.7% 3f09fbf5-0d85-4283-9009-0ec0e29223c0 -9140104347498952504 2c
  9. Datacenter: us-east
  10. ===================
  11. Status=Up/Down
  12. |/ State=Normal/Leaving/Joining/Moving
  13. -- Address Load Owns Host ID Token Rack
  14. UN x.x.x.x 1.1 GB 11.3% 5bbd2de4-e1d2-4a17-9f40-034f60b35954 -9061054426204373981 1b
  15. UN x.x.x.x 1.15 GB 11.5% e34c590e-6176-45b2-a8f9-18b4a9a80032 -9216519687724118609 1c
  16. UN x.x.x.x 1.18 GB 10.9% fa0b0a1a-f156-40fc-a267-970d1eb9cddb -9207673937991303291 1a
  17. UN x.x.x.x 1.46 GB 10.7% b18ae406-c9ec-42b7-a365-b0c6e2fe582f -9206671929961171506 1a
  18. UN x.x.x.x 1.13 GB 11.4% 1ac9c1c5-55ad-4048-b1ba-3b9768933ecc -9146100851344467112 1c
  19. UN x.x.x.x 1.53 GB 11.2% dad665bb-68d9-4811-b421-f33333261867 -9178920986366339267 1b

使用columnfamilyoutputformat的堆栈跟踪:

  1. java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
  2. at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:224)
  3. Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
  4. at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
  5. at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
  6. at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
  7. at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
  8. at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:215)
  9. Caused by: java.net.ConnectException: Connection timed out
  10. at java.net.PlainSocketImpl.socketConnect(Native Method)
  11. at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
  12. at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
  13. at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
  14. at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  15. at java.net.Socket.connect(Socket.java:579)
  16. at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
  17. ... 4 more

... 使用cqloutputformat:

  1. java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
  2. at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
  3. Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
  4. at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
  5. at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
  6. at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
  7. at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
  8. at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
  9. Caused by: java.net.ConnectException: Connection timed out
  10. at java.net.PlainSocketImpl.socketConnect(Native Method)
  11. at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
  12. at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
  13. at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
  14. at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  15. at java.net.Socket.connect(Socket.java:579)
  16. at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
  17. ... 4 more

两条痕迹最终都指向 AbstractColumnFamilyOutputFormat.createAuthenticatedClient(host, port, conf) .
然后,我打开了该源代码并向异常添加了一些详细信息,以便它将输出连接到的主机名,这导致了以下跟踪:

  1. java.io.IOException: java.lang.Exception: Unable to connect to host [hostname]
  2. at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
  3. Caused by: java.lang.Exception: Unable to connect to host [hostname]
  4. at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:139)
  5. at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
  6. Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
  7. at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
  8. at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
  9. at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
  10. at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:124)
  11. ... 1 more
  12. Caused by: java.net.ConnectException: Connection timed out
  13. at java.net.PlainSocketImpl.socketConnect(Native Method)
  14. at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
  15. at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
  16. at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
  17. at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  18. at java.net.Socket.connect(Socket.java:579)
  19. at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
  20. ... 4 more

问题是[hostname]是一台不在分析集群中的机器(它在美国东部)。为什么它不能自动地知道这一点,尤其是当读取正常工作时?它似乎在尝试环中的所有节点,而不考虑dc。
作为记录,写入失败使用 CqlOutputFormat , ColumnFamilyOutputFormat ,并通过清管器使用 CqlStorage 以及 CassandraStorage .

qgzx9mmu

qgzx9mmu1#

这个问题归结为两件事:
对于多区域ec2设置,cassandra要求将广播地址设置为公共ip,将侦听地址设置为内部ip。在大多数情况下,您希望rpc\u address是内部ip,但这可能会破坏cassandra的hadoop客户机,该客户机根据广播\u address确定要与之通信的端点。
cassandra的hadoop客户端(特别是ringcache)不支持数据中心节点发现,而是尝试发现环中的所有节点——包括非本地节点。它尊重实际写入的一致性级别,但在我们的示例中,由于#1,它从未达到该级别。
我提交了一张罚单并提交了一个补丁来解决这些问题:
https://issues.apache.org/jira/browse/cassandra-7252

mefy6pfw

mefy6pfw2#

我想说的是,尝试将cassandra.yaml中的write\ u request\ u timeout\ in\ ms设置为一个非常高的数字,看看这是否有帮助。节点本身可能有问题,当它没有响应而仍然显示为启动时。如果它仍然超时,请在您怀疑是导致问题的节点上重新启动服务。

相关问题