在pyspark中添加一个包含范围内随机数的列

xxhby3vn  于 2023-05-16  发布在  Spark
关注(0)|答案(2)|浏览(135)

我想用随机数生成一个列,如下所示:

df=df.withColumn("random_col",random.randint(100000, 1000000))

上面给了我一个错误:
Assert错误:col应为列

5anewei6

5anewei61#

首先,我会确保你输入了正确的东西...
尝试导入:从pyspark.sql.functions导入兰德
然后尝试类似这行代码的东西:
df1 = df.withColumn(“random_col”,rand()> 100000,1000000)
You also could check out this resource. It looks like it may be helpful for what you are doing
希望这有帮助!

8nuwlpux

8nuwlpux2#

遇到这个问题,找不到任何具体的东西,最终弄明白了,希望这有助于任何人卡住:

# To add a column with values from a range of random values first create the column in a new Spark dataframe.

# import libraries
import random
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType, StringType, StructField, StructType

# Define new df schema
schema = StructType(
[
   StructField("id", StringType(), nullabe=False),
   StructField("random_value", IntegerType(), nullabe=False)
]

# create empty list
data = list()

for i in range(0, 200):  # adjust values as you wish
      data.append(
            {
                 "random_value": random.randint(500, 10000)  # adjust values as you wish
            }
       )

# Create the Spark dataframe
df = spark.createDataFrame(data, schema)

# Add id ordering
df1 = df.withColumn("id", F.monotonically_increasing_id())
  • 然后,您需要在其他 Dataframe 上添加另一个id列,以连接相应的id列并附加“random_value”列。有关在预先存在的 Dataframe 上创建id列并连接的更多信息,请参阅this great example

相关问题