我在用 elasticsearch-hadoop:7.7.0
将我的数据从Hive写入es。但我发现当Dataframe的大小和分区太大时,会抛出一些异常:
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: es_rejected_execution_exception: rejected execution of processing of [2255855262][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[rs_test_index][0]] containing [1000] requests, target allocation id: 8nqY2sdDTDmZRKU-jPKcdQ, primary term: 1 on EsThreadPoolExecutor[name = es-node/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@11b7b906[Running, pool size = 48, active threads = 48, queued tasks = 220, completed tasks = 921584808]]
这可能是由es写入线程大小的配置引起的。然后我开始 es.batch.write.retry.count=1
可以确保 RDD
将重试由于上述异常而未能写入es的。为了提高效率,我扩大了分区数。但我还有一个例外:
Attempted to get executor loss reason for executor id 782 at RPC address XX.XX.XX.XX:47562, but got no response. Marking as slave lost.
java.io.IOException: Connection from /XX.XX.XX.XX:42842 closed
而spark应用程序在没有完成所有编写工作的情况下退出。
所以我想知道分区、日期框的大小和es的碎片和副本数之间是否有关系。如果是的话,我怎样才能找到最好的分区号来获得最好的效率呢
暂无答案!
目前还没有任何答案,快来回答吧!