如何启用directjoin-java

uqjltbpv  于 2021-07-13  发布在  Spark
关注(0)|答案(1)|浏览(319)

我有Cassandra3.11.9,Spark3.0.1和SparkCassandra连接器3.0.0(依赖)。我试图使用scc3.0.0的直接连接,但似乎当我在下面的数据集上使用连接时,我得到了spark的广播散列连接。

Dataset<Row> metlistinitial = sp.read().format("org.apache.spark.sql.cassandra")
            .options(new HashMap<String, String>() {
                {
                    put("keyspace", "mdb");
                    put("table", "experiment");
                }
            })
            .load().select(col("experimentid"), col("description"))
            .join(dfexplist,"experimentid")
            .filter(col("description").notEqual("Unidentified"));
metlistinitial.explain();

== Physical Plan ==

* (1) Project [experimentid#6, description#7]

+- *(1) BroadcastHashJoin [experimentid#6], [experimentid#4], Inner, BuildRight
   :- *(1) Project [experimentid#6, description#7]
   :  +- *(1) Filter NOT (description#7 = Unidentified)
   :     +- BatchScan[experimentid#6, description#7] Cassandra Scan: mdb.experiment
 - Cassandra Filters: []
 - Requested Columns: [experimentid,description]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])), [id=#19]
  +- LocalTableScan [experimentid#4]

我是否应该启用与cassandra表的直接连接?现在做连接大约需要8分钟,我想看看直接连接是否会更快。

1qczuiv0

1qczuiv01#

刚刚找到了!似乎我只是需要通过添加

.config("spark.sql.extensions","com.datastax.spark.connector.CassandraSparkExtensions")

在spark配置中。惊人的表现。现在只需要8秒!

相关问题