我有Cassandra3.11.9,Spark3.0.1和SparkCassandra连接器3.0.0(依赖)。我试图使用scc3.0.0的直接连接,但似乎当我在下面的数据集上使用连接时,我得到了spark的广播散列连接。
Dataset<Row> metlistinitial = sp.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mdb");
put("table", "experiment");
}
})
.load().select(col("experimentid"), col("description"))
.join(dfexplist,"experimentid")
.filter(col("description").notEqual("Unidentified"));
metlistinitial.explain();
== Physical Plan ==
* (1) Project [experimentid#6, description#7]
+- *(1) BroadcastHashJoin [experimentid#6], [experimentid#4], Inner, BuildRight
:- *(1) Project [experimentid#6, description#7]
: +- *(1) Filter NOT (description#7 = Unidentified)
: +- BatchScan[experimentid#6, description#7] Cassandra Scan: mdb.experiment
- Cassandra Filters: []
- Requested Columns: [experimentid,description]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])), [id=#19]
+- LocalTableScan [experimentid#4]
我是否应该启用与cassandra表的直接连接?现在做连接大约需要8分钟,我想看看直接连接是否会更快。
1条答案
按热度按时间1qczuiv01#
刚刚找到了!似乎我只是需要通过添加
在spark配置中。惊人的表现。现在只需要8秒!