udf(pyspark)-类型错误

lawou6xi 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(470)

我尝试用spacy和pandas-udf（pyspark）进行实体提取，但是我得到了一个错误。
使用自定义项工作没有错误，但速度很慢。我做错什么了？
每次加载模型都是为了避免加载错误- Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. 工作自定义项：

def __get_entities(x):

    global nlp
    nlp = spacy.load("en_core_web_lg")
    ents=[]

    doc = nlp(x)

    for ent in doc.ents:
        if ent.label_ == 'PERSON' OR ent.label_ == 'ORG':
            ents.append(ent.label_)

    return ents

get_entities_udf = F.udf(__get_entities), T.ArrayType(T.StringType()))

带错误的自定义项：

def __get_entities(x):

    global nlp
    nlp = spacy.load("en_core_web_lg")
    ents=[]

    doc = nlp(x)

    for ent in doc.ents:
        if ent.label_ == 'PERSON' OR ent.label_ == 'ORG':
            ents.append(ent.label_)

    return pd.Series(ents)

get_entities_udf = F.pandas_udf(lambda x: __get_entities(x), "array<string>", F.PandasUDFType.SCALAR)

错误消息：

TypeError: Argument 'string'has incorrect type (expected str, got series)

sparkDataframe示例：

df = spark.createDataFrame([
  ['John Doe'],
  ['Jane Doe'],
  ['Microsoft Corporation'],
  ['Apple Inc.'],
]).toDF("name",)

新建列：

df_new = df.withColumn('entity',get_entities_udf('name'))

apache-spark pyspark user-defined-functions pandas spacy

来源：https://stackoverflow.com/questions/63681625/pandas-udf-pyspark-incorrect-type-error

1条答案

按热度按时间

dvtswwa31#

您需要将输入视为 pd.Series 而不是单一值
我通过对代码进行一点重构就可以让它正常工作。通知 x.apply 特定于Pandas并将函数应用于 pd.Series .

def entities(x):
    global nlp
    import spacy
    nlp = spacy.load("en_core_web_lg")
    ents=[]

    doc = nlp(x)

    for ent in doc.ents:
        if ent.label_ == 'PERSON' or ent.label_ == 'ORG':
            ents.append(ent.label_)
    return ents

def __get_entities(x):
    return x.apply(entities)

get_entities_udf = pandas_udf(lambda x: __get_entities(x), "array<string>", PandasUDFType.SCALAR)

df_new = df.withColumn('entity',get_entities_udf('name'))

df_new.show()

+--------------------+--------+
|                name|  entity|
+--------------------+--------+
|            John Doe|[PERSON]|
|            Jane Doe|[PERSON]|
|Microsoft Corpora...|   [ORG]|
|          Apple Inc.|   [ORG]|
+--------------------+--------+

赞(0）回复(0）举报 2021-05-27

我来回答

udf(pyspark)-类型错误

1条答案

相关问题

热门标签

最新问答