从Pyspark中充满此类列表的列扩展给定列表的范围

mmvthczy  于 2022-11-01  发布在  Spark
关注(0)|答案(1)|浏览(180)

我需要扩展一个从给定的起始编号到结束编号的范围,例如,如果我有[1,4],我需要输出为[1,2,3,4]。我一直试图使用这个代码块,作为一个逻辑,但是,我不能使它成为动态的。当我在其中传递许多列表时,我得到一个错误。

  1. # Create an empty list
  2. My_list = []
  3. # Value to begin and end with
  4. start = 10
  5. print(start)
  6. end = 20
  7. print(end)
  8. # Check if start value is smaller than end value
  9. if start < end:
  10. # unpack the result
  11. My_list.extend(range(start, end))
  12. # Append the last value
  13. # My_list.append(end)
  14. # Print the list
  15. print(My_list)

输出:10 20 [10、11、12、13、14、15、16、17、18、19]
这就是我需要的!但是...
我正在努力做到这一点:

  1. import pandas as pd
  2. My_list = []
  3. isarray = []
  4. pd_df = draft_report.toPandas()
  5. for index, row in pd_df.iterrows():
  6. My_list = row[14] #14 is the place of docPage in the df
  7. start = My_list[1] #reads the 1st element eg: 1 in [1,16]
  8. print(start)
  9. end = My_list[3] #reads the last element eg: 16 in [1,16]
  10. print(end)
  11. if start < end:
  12. isarray.extend(range(int(start, end)))
  13. isarray.append(int(end))
  14. print(isarray)

输出量:

  1. An error was encountered:
  2. 'str' object cannot be interpreted as an integer
  3. Traceback (most recent call last):
  4. TypeError: 'str' object cannot be interpreted as an integer

数据如下所示:

  1. docPages
  2. [1,16]
  3. [17,22]
  4. [23,24]
  5. [25,27]
vm0i2vca

vm0i2vca1#

由于源列是StringType(),因此首先需要将字符串转换为数组-这可以使用from_json函数来完成。然后使用sequence函数中的结果数组元素。

  1. data_sdf. \
  2. withColumn('arr',
  3. func.sort_array(func.from_json('arr_as_str', 'array<integer>'))
  4. ). \
  5. withColumn('arr_range', func.expr('sequence(arr[0], arr[1], 1)')). \
  6. show(truncate=False)
  7. # +----------+--------+-------------------------------------------------------+
  8. # |arr_as_str|arr |arr_range |
  9. # +----------+--------+-------------------------------------------------------+
  10. # |[1,16] |[1, 16] |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]|
  11. # |[17,22] |[17, 22]|[17, 18, 19, 20, 21, 22] |
  12. # |[23,24] |[23, 24]|[23, 24] |
  13. # |[25,27] |[25, 27]|[25, 26, 27] |
  14. # +----------+--------+-------------------------------------------------------+

如果源列是一个ArrayType()字段,则可以直接使用sequence函数创建一个区域。
参见下面例子。

  1. data_sdf. \
  2. withColumn('doc_range', func.expr('sequence(doc_pages[0], doc_pages[1], 1)')). \
  3. show(truncate=False)
  4. # +---------+-------------------------------------------------------+
  5. # |doc_pages|doc_range |
  6. # +---------+-------------------------------------------------------+
  7. # |[1, 16] |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]|
  8. # |[17, 22] |[17, 18, 19, 20, 21, 22] |
  9. # |[23, 24] |[23, 24] |
  10. # |[25, 27] |[25, 26, 27] |
  11. # +---------+-------------------------------------------------------+
展开查看全部

相关问题