与sparknlp的句子相似性只能在google dataproc上使用一个句子,当提供多个句子时失败

aelbi1ox  于 2021-05-17  发布在  Spark
关注(0)|答案(0)|浏览(339)

将以下colab python代码(请参阅下面的链接)部署到google cloud上的dataproc,并且它仅在输入列表是一个包含一个项的数组时工作,当输入列表包含两个项时,pyspark作业将在下面的get\u similarity method中的“for r in result.collect()”行中由于以下错误而终止:

java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:745)
input_list=["no error"]                 <---- works
input_list=["this", "throws EOF error"] <---- does not work

使用spark nlp链接到colab以获得句子相似性:https://colab.research.google.com/github/johnsnowlabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/sentence_similarity.ipynb#scrollto=6e0y5wt4个

def get_similarity(input_list):
    df = spark.createDataFrame(pd.DataFrame({'text': input_list}))
    result = light_pipeline.transform(df)
    embeddings = []
    for r in result.collect():
        embeddings.append(r.sentence_embeddings[0].embeddings)
    embeddings_matrix = np.array(embeddings)
    return np.matmul(embeddings_matrix, embeddings_matrix.transpose())

我尝试过在hadoop集群配置中将“dfs.datanode.max.transfer.threads”更改为8192,但仍然没有成功

hadoop_config.set('dfs.datanode.max.transfer.threads', "8192")

当input\u list在数组中有多个项时,如何使此代码工作?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题