pyspark 如何将 Dataframe 中的数组分配给变量

p1tboqfb 于 2022-11-28 发布在 Spark

关注(0)|答案(2)|浏览(191)

我需要在dataframe中提取数组字段，并将其分配给一个变量，以便进一步处理。我正在使用collect（）函数，但它不能正常工作。

- 输入 Dataframe ：**

- 如何获取和分配如下变量：**

英语=[A，B，C]
西班牙语=[]

pyspark

来源：https://stackoverflow.com/questions/74538022/how-to-assign-array-in-a-dataframe-to-a-variable

2条答案

按热度按时间

7gcisfzg1#

我提出的最简单的解决方案是使用collect提取数据，并将其显式地分配给预定义的变量，如下所示：

from pyspark.sql.types import StringType, ArrayType, StructType, StructField

schema = StructType([
    StructField("Department", ArrayType(StringType()), True),
    StructField("Language", StringType(), True)
  ])

df = spark.createDataFrame([(["A", "B", "C"], "English"), ([], "Spanish")], schema)

English = df.collect()[0]["Department"]
Spanish = df.collect()[1]["Department"]
print(f"English: {English}, Spanish: {Spanish}")

# English: ['A', 'B', 'C'], Spanish: []

赞(0）回复(0）举报 2022-11-28

aiqt4smr2#

编辑：我完全脑放屁，错过了这是一个PySpark的问题。
下面的代码在convert your PySpark Dataframe to pandas的情况下可能还是有帮助的，这对你的情况来说可能并不像听起来那么荒谬。如果表太大而不能放入一个PandasDataFrame，那么它就太大而不能在一个变量中存储所有数组。你可以先使用.filter()和.select()来缩小它。
旧答案：
最好的方法实际上取决于 Dataframe 的复杂性。

# To recreate your dataframe

df = pd.DataFrame({
    'Department': [['A','B', 'C']],
    'Language': 'English'
})

df.loc[df.Language == 'English']
# Will return all rows where Language is English.  If you only want Department then:

df.loc[df.Language == 'English'].Department
# This will return a list containing your list. If you are always expecting a single match add [0] as in:

df.loc[df.Language == 'English'].Department[0]
#Which will return only your list
# The alternate method below isn't great but might be preferable in some circumstances, also only if you expect a single match from any query.

department_lookup = df[['Language', 'Department']].set_index('Language').to_dict()['Department']

department_lookup['English']
#returns your list

# This will make a dictionary where 'Language' is the key and 'Department' is the value. It is more work to set up and only works for a two-column relationship but you might prefer working with dictionaries depending on the use-case

如果你有数据类型的问题，它可能会处理如何加载DataFrame，而不是你如何访问它。

# If I saved and reload the df as so: 
df.to_csv("the_df.csv")
df = pd.read_csv("the_df.csv")

# Then we would see that the dtype has become a string, as in "[A, B, C]" rather than ["A", "B", "C"]

# We can typically correct this by giving pandas a method for converting the incoming string to list.  This is done with the 'converters' argument, which takes a dictionary where trhe keys are column names and the values are functions, as such:

df = pd.read_csv("the_df.csv", converters = {"Department": lambda x: x.strip("[]").split(", "))

# df['Department'] should have a dtype of list

值得注意的是，lambda函数只有在python将一个python列表转换为一个字符串以存储 Dataframe 时才是可靠的。

赞(0）回复(0）举报 2022-11-28

我来回答

pyspark 如何将 Dataframe 中的数组分配给变量

2条答案

相关问题

热门标签

最新问答