我使用withcolumn with udf获得一个新列,然后选择两个列并将其分配给一个新的df。但是当我在这个新的df上运行count()时,它给了我typeerror:'nonetype'对象是不可订阅的。show()工作正常。我想知道新df的长度。这是我的密码:
# Find all entities with names that are palindromes
# (name reads the same way forward and reverse, e.g. madam):
# print the count and show() the resulting Spark DataFrame
from pyspark.sql.types import BooleanType
def is_palindrome(entity_name):
return entity_name == entity_name[::-1]
spark_udf = udf(is_palindrome, BooleanType())
palindrome_df = cb_sdf.withColumn('is_palindrome', spark_udf('name'))
palindrome_df = palindrome_df.where(palindrome_df['is_palindrome']).select('name', 'is_palindrome')
print(palindrome_df.show())
print(palindrome_df.count())
下面是我得到的输出和错误消息:
+------+-------------+
| name|is_palindrome|
+------+-------------+
| KAYAK| true|
| ooVoo| true|
| 63336| true|
| TipiT| true|
| beweb| true|
| CSC| true|
| CBC| true|
| OQO| true|
| SAS| true|
| e4e| true|
| PHP| true|
| ivi| true|
| ADDA| true|
|izeezi| true|
| siXis| true|
| STATS| true|
| 8x8| true|
| IXI| true|
| GLG| true|
| 2e2| true|
+------+-------------+
only showing top 20 rows
None
---------------------------------------------------------------------------
PythonException Traceback (most recent call last)
<ipython-input-24-7fd424328e85> in <module>()
10 palindrome_df = palindrome_df.where(palindrome_df['is_palindrome']).select('name', 'is_palindrome')
11 print(palindrome_df.show())
---> 12 print(palindrome_df.count())
2 frames
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in deco(*a,**kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process
serializer.dump_stream(out_iter, outfile)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
for obj in iterator:
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 200, in _batched
for item in iterator:
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
return lambda *a: f(*a)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
return f(*args,**kwargs)
File "<ipython-input-24-7fd424328e85>", line 7, in is_palindrome
TypeError: 'NoneType' object is not subscriptable
提前谢谢!
1条答案
按热度按时间nuypyhwy1#
可能在Dataframe的某个地方有空值,但在显示的前20行中没有。这就是为什么在计算整个Dataframe时出现错误,而在显示Dataframe中的20行时却没有。
要防止空值使程序崩溃,请将udf更改为: