udf(pyspark)-类型错误

lawou6xi  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(470)

我尝试用spacy和pandas-udf(pyspark)进行实体提取,但是我得到了一个错误。
使用自定义项工作没有错误,但速度很慢。我做错什么了?
每次加载模型都是为了避免加载错误- Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. 工作自定义项:

def __get_entities(x):

    global nlp
    nlp = spacy.load("en_core_web_lg")
    ents=[]

    doc = nlp(x)

    for ent in doc.ents:
        if ent.label_ == 'PERSON' OR ent.label_ == 'ORG':
            ents.append(ent.label_)

    return ents

get_entities_udf = F.udf(__get_entities), T.ArrayType(T.StringType()))

带错误的自定义项:

def __get_entities(x):

    global nlp
    nlp = spacy.load("en_core_web_lg")
    ents=[]

    doc = nlp(x)

    for ent in doc.ents:
        if ent.label_ == 'PERSON' OR ent.label_ == 'ORG':
            ents.append(ent.label_)

    return pd.Series(ents)

get_entities_udf = F.pandas_udf(lambda x: __get_entities(x), "array<string>", F.PandasUDFType.SCALAR)

错误消息:

TypeError: Argument 'string'has incorrect type (expected str, got series)

sparkDataframe示例:

df = spark.createDataFrame([
  ['John Doe'],
  ['Jane Doe'],
  ['Microsoft Corporation'],
  ['Apple Inc.'],
]).toDF("name",)

新建列:

df_new = df.withColumn('entity',get_entities_udf('name'))
dvtswwa3

dvtswwa31#

您需要将输入视为 pd.Series 而不是单一值
我通过对代码进行一点重构就可以让它正常工作。通知 x.apply 特定于Pandas并将函数应用于 pd.Series .

def entities(x):
    global nlp
    import spacy
    nlp = spacy.load("en_core_web_lg")
    ents=[]

    doc = nlp(x)

    for ent in doc.ents:
        if ent.label_ == 'PERSON' or ent.label_ == 'ORG':
            ents.append(ent.label_)
    return ents

def __get_entities(x):
    return x.apply(entities)

get_entities_udf = pandas_udf(lambda x: __get_entities(x), "array<string>", PandasUDFType.SCALAR)

df_new = df.withColumn('entity',get_entities_udf('name'))

df_new.show()

+--------------------+--------+
|                name|  entity|
+--------------------+--------+
|            John Doe|[PERSON]|
|            Jane Doe|[PERSON]|
|Microsoft Corpora...|   [ORG]|
|          Apple Inc.|   [ORG]|
+--------------------+--------+

相关问题