使用rsparkling在databricks上启动h2o上下文

eyh26e7m  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(515)

问题

我想在azuredatabricks的多节点集群上使用h2o的起泡水,分别通过rstudio和r笔记本进行交互和作业。我可以开始一个h2o簇和一个气泡水上下文 rocker/verse:4.0.3 和一个 databricksruntime/rbase:latest (以及 databricksruntime/standard )docker容器在本地计算机上,但当前不在databricks群集上。似乎有一个经典的类路径问题。

Error : java.lang.ClassNotFoundException: ai.h2o.sparkling.H2OConf
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
    at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at sparklyr.StreamHandler.handleMethodCall(stream.scala:106)
    at sparklyr.StreamHandler.read(stream.scala:61)
    at sparklyr.BackendHandler.$anonfun$channelRead0$1(handler.scala:58)
    at scala.util.control.Breaks.breakable(Breaks.scala:42)
    at sparklyr.BackendHandler.channelRead0(handler.scala:39)
    at sparklyr.BackendHandler.channelRead0(handler.scala:14)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:295)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)

我试过的

设置:单节点azuredatabricks集群,7.6ml(包括apachespark3.0.1,scala 2.12),带有“标准的\uf4s”驱动程序(我的用例是多节点的,但我尽量让事情简单)
设置 options() ,例如。, options(rsparkling.sparklingwater.version = "2.3.11") 或者 options(rsparkling.sparklingwater.version = "3.0.1") 设置 config ,例如。,

conf$`sparklyr.shell.jars` <- c("/databricks/spark/R/lib/h2o/java/h2o.jar")

或者 sc <- sparklyr::spark_connect(method = "databricks", version = "3.0.1", config = conf, jars = c("/databricks/spark/R/lib/h2o/java/h2o.jar")) (或 "~/R/x86_64-pc-linux-gnu-library/3.6/h2o/java/h2o.jar" 或者 "~/R/x86_64-pc-linux-gnu-library/3.6/rsparkling/java/sparkling_water_assembly.jar" 作为databricks rstudio上的.jar位置)
以下说明:http://docs.h2o.ai/sparkling-water/3.0/latest-stable/doc/deployment/rsparkling_azure_dbc.html
对于Sparking water 3.32.1.1-1-3.0,选择spark 3.0.2
spark 3.0.2不能作为一个集群使用,在我的方法中选择了3.0.1

Error in h2o_context(sc) : could not find function "h2o_context"

在本地计算机上工作的dockerfile


# get the base image (https://hub.docker.com/r/databricksruntime/standard; https://github.com/databricks/containers/blob/master/ubuntu/standard/Dockerfile)

FROM databricksruntime/standard

# not needed if using `FROM databricksruntime/r-base:latest` at top

ENV DEBIAN_FRONTEND noninteractive

# go into the repo directory

RUN . /etc/environment \
  # Install linux depedendencies here
  && apt-get update \
  && apt-get install libcurl4-openssl-dev libxml2-dev libssl-dev -y \
  # not needed if using `FROM databricksruntime/r-base:latest` at top
  && apt-get install r-base -y

# install specific R packages

RUN R -e 'install.packages(c("httr", "xml2"))'

# sparklyr and Spark

RUN R -e 'install.packages(c("sparklyr"))'

# h2o

# RSparkling 3.32.0.5-1-3.0 requires H2O of version 3.32.0.5.

RUN R -e 'install.packages(c("statmod", "RCurl"))'
RUN R -e 'install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zermelo/5/R")'

# rsparkling

# RSparkling 3.32.0.5-1-3.0 is built for 3.0.

RUN R -e 'install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.32.0.5-1-3.0/R")'

# connect to H2O cluster with Sparkling Water context

RUN R -e 'library(sparklyr); sparklyr::spark_install("3.0.1", hadoop_version = "3.2"); Sys.setenv(SPARK_HOME = "~/spark/spark-3.0.1-bin-hadoop3.2"); library(rsparkling); sc <- sparklyr::spark_connect(method = "databricks", version = "3.0.1"); sparklyr::spark_version(sc); h2oConf <- H2OConf(); hc <- H2OContext.getOrCreate(h2oConf)'
mcvgt66p

mcvgt66p1#

在我的例子中,我需要在databricks工作区、集群或作业中安装一个“库”。我可以上传它,也可以让数据库从maven坐标获取它。
在databricks工作区中:
单击主页图标
单击“共享”>“创建”>“库”
单击“maven”(作为“库源”)
单击“坐标”框旁边的“搜索包”链接
单击下拉框并选择“maven central”
输入 ai.h2o.sparkling-water-package 进入“查询”框
选择最近的“artifact id”和“release”匹配您的 rsparkling 版本,对我来说 ai.h2o:sparkling-water-package_2.12:3.32.0.5-1-3.0 单击“选项”下的“选择”
单击“创建”创建库
谢天谢地,当作为databricks作业运行时,这不需要更改我的databricks r笔记本


# install specific R packages

install.packages(c("httr", "xml2"))

# sparklyr and Spark

install.packages(c("sparklyr"))

# h2o

# RSparkling 3.32.0.5-1-3.0 requires H2O of version 3.32.0.5.

install.packages(c("statmod", "RCurl"))
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zermelo/5/R")

# rsparkling

# RSparkling 3.32.0.5-1-3.0 is built for 3.0.

install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.0/3.32.0.5-1-3.0/R")

# connect to H2O cluster with Sparkling Water context

library(sparklyr)
sparklyr::spark_install("3.0.1", hadoop_version = "3.2")
Sys.setenv(SPARK_HOME = "~/spark/spark-3.0.1-bin-hadoop3.2")
sparklyr::spark_default_version()
library(rsparkling)

SparkR::sparkR.session()
sc <- sparklyr::spark_connect(method = "databricks", version = "3.0.1")
sparklyr::spark_version(sc)

# next command will not work without adding https://mvnrepository.com/artifact/ai.h2o/sparkling-water-package_2.12/3.32.0.5-1-3.0 file as "Library" to Databricks cluster

h2oConf <- H2OConf()
hc <- H2OContext.getOrCreate(h2oConf)

相关问题