pyspark日志：在错误的日志级别打印信息

yuvru6vn 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(999)

谢谢你的时间！
我想在调试代码时创建并打印我的（大量）数据的清晰摘要到我的输出，但是一旦完成就停止创建和打印这些摘要以加快速度。有人建议我使用我实现的日志记录。它可以像预期的那样将文本字符串作为消息打印到输出中—但是，当打印Dataframe的摘要时，它似乎忽略了日志级别，创建它们并始终打印它们。
日志记录是正确的，还是有更好的方法？我可以#阻止代码行或使用if语句等，但这是一个庞大的代码，我知道我需要在未来做进一步的元素添加相同的检查-似乎正是什么日志应该工作。

from pyspark.sql.functions import col,count
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
df = spark.createDataFrame([(1,2),(3,4)],["COLA","COLB"])
print "1"
logger.setLevel(logging.DEBUG)
logger.debug("1 - DEBUG - Print the message and show the table")
logger.debug(df.show())
print "2"
logger.setLevel(logging.INFO)
logger.debug("2 - INFO - Don't print the message or show the table")
logger.debug(df.show())
print "3"
logger.setLevel(logging.INFO)
logger.debug("3 - INFO - Don't print the message or show the collected data")
logger.debug(df.collect())
print "4"
logger.setLevel(logging.DEBUG)
logger.debug("4 - DEBUG - Print the message and the collected data")
logger.debug(df.collect())

输出：

1
DEBUG:__main__:1 - DEBUG - Print the message and show the table
+----+----+
|COLA|COLB|
+----+----+
|   1|   2|
|   3|   4|
+----+----+
DEBUG:__main__:None
2
+----+----+
|COLA|COLB|
+----+----+
|   1|   2|
|   3|   4|
+----+----+
3
4
DEBUG:__main__:4 - DEBUG - Print the message and the collected data
DEBUG:__main__:[Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]

apache-spark pyspark Logging qubole

来源：https://stackoverflow.com/questions/61782919/pyspark-logging-printing-information-at-the-wrong-log-level

1条答案

按热度按时间

zour9fqk1#

如果我们使用 df.show() （或） df.collect() 那么，即使它们处于 logger.debug .
如果我们将日志级别设置为 DEBUG 然后我们就可以看到了 INFO 水平测井。
如果我们将日志级别设置为 INFO 那我们就看不见了 DEBUG 水平测井。
您可以通过存储 collect()/take(n) 将结果转换为变量，然后在日志中使用该变量。

from pyspark.sql.functions import col,count
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
df = spark.createDataFrame([(1,2),(3,4)],["COLA","COLB"])
# storing results but don't use collect on huge dataset instead use `.take`
res=df.collect() 
# get 10 records from df
res=df.take(10)
print "1"
# 1
logger.setLevel(logging.DEBUG)
logger.debug("1 - DEBUG - Print the message and show the table")
# DEBUG:__main__:1 - DEBUG - Print the message and show the table
logger.debug(res)
# DEBUG:__main__:[Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]
print "2"
# 2
logger.setLevel(logging.INFO)
logger.debug("2 - INFO - Don't print the message or show the table")
logger.debug(res) #this won't print as loglevel is INFO.
logger.info("result: " + str(res)) #this will get printed out
# INFO:__main__:result: [Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]

使用 .take 而不是 .collect() .

展开查看全部

赞(0）回复(0）举报 2021-05-27

我来回答

pyspark日志：在错误的日志级别打印信息

1条答案

相关问题

热门标签

最新问答