如何将自定义函数应用于pyspark框架列

kwvwclae 于 2024-01-06 发布在 Spark

关注(0)|答案(2)|浏览(182)

@pandas_udf(StringType())
def convert_num(y):
    try:
        if y.endswith('K')==True:
            y = list(y)
            y.remove(y[''.join(y).find('K')])
            if ''.join(y).startswith('€')==True:
                y.remove(y[''.join(y).find('€')])
            else:
                pass
            try :
                return str(int(''.join(y))*1000)
            except:
                return y
        elif y.endswith('M')==True:
            y = list(y)
            y.remove(y[''.join(y).find('M')])
            if ''.join(y).startswith('€')==True:
                y = list(y)
                y.remove(y[''.join(y).find('€')])
            else:
                pass
            try :
                return str(float(''.join(y))*1000000)
            except:
                return y
    except:
        return y

字符串
我把上面的UDF作为pandas UDF。
在我的spark框架中有一个名为Value的列。我想应用这个函数并转换它。
我用这个

from pyspark.sql.functions import *
df.select(convert_num(df.Value).alias('converted')).take(5)

型
但是它返回给我的是相同的值，而不是转换它。你可以在下面看到结果。

Row(Player_name='T. Almada', Images='https://cdn.sofifa.net/players/245/371/24_60.png', Age=22, National_team='Argentina', Positions="['CAM', 'CM', 'CF']", Overall=79, Potential_overall=87, Current_club='Atlanta United', Current_contract='2022 ~ 2025', **Value='€39.5M'**, Wage='€10K', Total_stats=2050, **converted_amount='€39.5M'**)

型
我哪里做错了。

pyspark

来源：https://stackoverflow.com/questions/77718771/how-to-apply-custom-function-to-a-pyspark-dataframe-column

2条答案

按热度按时间

gwbalxhn1#

问题是@pandas_udf装饰器意味着convert_num(y)期望y是Series，但您将y视为字符串。
同样从调试的Angular 来看，在except块中返回y的多个try/except块将很难确定一个或多个try块中任何错误代码的来源（如果你得到相同的列值，那么在某个地方抛出了异常，但是从哪个except块？）。
请注意，如果删除外部try/except块，那么运行df.select(convert_num(df.Value)).take(5)将抛出：

AttributeError: 'Series' object has no attribute 'endswith'

字符串
你可以通过重新构造你的convert_num函数来解决这个问题，将输入y视为Series，并在仍然使用相同的字符串逻辑的情况下输出Series：

@pandas_udf(StringType())
def convert_num(s):
    def convert_string(y):
        if y.endswith('K')==True:
            y = list(y)
            y.remove(y[''.join(y).find('K')])
            if ''.join(y).startswith('€')==True:
                y.remove(y[''.join(y).find('€')])
            else:
                pass
            try:
                return str(int(''.join(y))*1000)
            except:
                return y
        elif y.endswith('M')==True:
            y = list(y)
            y.remove(y[''.join(y).find('M')])
            if ''.join(y).startswith('€')==True:
                y = list(y)
                y.remove(y[''.join(y).find('€')])
            else:
                pass
            try:
                return str(float(''.join(y))*1000000)
            except:
                return y
        else:
            return y
    return s.apply(convert_string)

型
PySpark DataFrame df示例：

+-----------+------+
|Player_name| Value|
+-----------+------+
|    PlayerA|€39.5M|
|    PlayerB| €390K|
+-----------+------+

型
转换Value列后输出df：

df.select(convert_num(df.Value).alias('converted'))
+----------+
| converted|
+----------+
|39500000.0|
|    390000|
+----------+

型

展开查看全部

赞(0）回复(0）举报 2024-01-06

bsxbgnwa2#

正如有人在评论中问到的，如何在不使用pandas的情况下在spark中实现自定义转换，这里有一个简单的例子：

>>> from pyspark.sql.functions import col, udf
>>> df = session.createDataFrame((("Jhon Doe", 1995), ("Elvis Kribs", 1998)))
>>> df = df.toDF("Name", "YOB")
>>> df.printSchema()
root
 |-- Name: string (nullable = true)
 |-- YOB: long (nullable = true)
>>> @udf
... def custom_tranformaion(val): return val + " UDF Value!"
...
>>> df.withColumn("TransformedValue", custom_tranformaion("Name")).show(truncate=False)
+-----------+----+----------------------+
|Name       |YOB |TransformedValue      |
+-----------+----+----------------------+
|Jhon Doe   |1995|Jhon Doe UDF Value!   |
|Elvis Kribs|1998|Elvis Kribs UDF Value!|
+-----------+----+----------------------+

字符串
请注意@udf装饰器，它接受一个值并返回转换后的值，该值将成为框架的一部分。
注意：spark优化器通常很难优化udf中的代码。
参考文献：1 2

展开查看全部

赞(0）回复(0）举报 2024-01-06

我来回答

如何将自定义函数应用于pyspark框架列

2条答案

相关问题

热门标签

最新问答