我尝试用spacy和pandas-udf(pyspark)进行实体提取,但是我得到了一个错误。
使用自定义项工作没有错误,但速度很慢。我做错什么了?
每次加载模型都是为了避免加载错误- Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
工作自定义项:
def __get_entities(x):
global nlp
nlp = spacy.load("en_core_web_lg")
ents=[]
doc = nlp(x)
for ent in doc.ents:
if ent.label_ == 'PERSON' OR ent.label_ == 'ORG':
ents.append(ent.label_)
return ents
get_entities_udf = F.udf(__get_entities), T.ArrayType(T.StringType()))
带错误的自定义项:
def __get_entities(x):
global nlp
nlp = spacy.load("en_core_web_lg")
ents=[]
doc = nlp(x)
for ent in doc.ents:
if ent.label_ == 'PERSON' OR ent.label_ == 'ORG':
ents.append(ent.label_)
return pd.Series(ents)
get_entities_udf = F.pandas_udf(lambda x: __get_entities(x), "array<string>", F.PandasUDFType.SCALAR)
错误消息:
TypeError: Argument 'string'has incorrect type (expected str, got series)
sparkDataframe示例:
df = spark.createDataFrame([
['John Doe'],
['Jane Doe'],
['Microsoft Corporation'],
['Apple Inc.'],
]).toDF("name",)
新建列:
df_new = df.withColumn('entity',get_entities_udf('name'))
1条答案
按热度按时间dvtswwa31#
您需要将输入视为
pd.Series
而不是单一值我通过对代码进行一点重构就可以让它正常工作。通知
x.apply
特定于Pandas并将函数应用于pd.Series
.