对于log4j配置,HADOOP_CONF_DIR似乎否决了SPARK_CONF_DIR

bvjveswy  于 2023-03-23  发布在  Hadoop
关注(0)|答案(1)|浏览(166)

嗨,我们正在yarn-client模式下运行一个spark驱动程序spark version = spark 3.2.1
我们设置了以下环境变量:

  • HADOOP_CONF_DIR =指向包含所有hadoop配置文件(如hdfs-site.xml、hive-site.xml等)的文件夹。它还包含log4j.properties文件。
  • SPARK_CONF_DIR =指向包含spark-defaults文件和www.example.com文件的文件夹log4j2.properties

以下是log4j.propertiesHADOOP_CONF_DIR引用的文件夹中www.example.com文件的内容:

log4j.rootLogger=${hadoop.root.logger}
hadoop.root.logger=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n

以下是log4j2.propertiesSPARK_CONF_DIR引用的文件夹中www.example.com文件的内容:

# Log files location
property.basePath = ${env:LOG_PATH}
# Set everything to be logged to the console
appender.rolling.type = RollingFile
appender.rolling.name = fileLogger
appender.rolling.fileName= ${basePath}/vdp-ingestion.log
appender.rolling.filePattern= ${basePath}/vdp-ingestion_%d{yyyyMMdd}.log.gz
# log in json-format -> based on LogstashJsonEventLayout
appender.rolling.layout.type = JsonTemplateLayout
appender.rolling.layout.eventTemplateUri = classpath:LogstashJsonEventLayoutV1.json
# overrule message -> by default treated as a string, however we want an object so we can use the native JSON format
# and use the underlying objects in kibana log filters
appender.rolling.layout.eventTemplateAdditionalField[0].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[0].key = message
appender.rolling.layout.eventTemplateAdditionalField[0].value = {"$resolver": "message", "fallbackKey": "message"}
appender.rolling.layout.eventTemplateAdditionalField[0].format = JSON
appender.rolling.layout.eventTemplateAdditionalField[1].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[1].key = pid
appender.rolling.layout.eventTemplateAdditionalField[1].value = {"$resolver": "pattern", "pattern": "%pid"}
appender.rolling.layout.eventTemplateAdditionalField[1].format = JSON
appender.rolling.layout.eventTemplateAdditionalField[2].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[2].key = tid
appender.rolling.layout.eventTemplateAdditionalField[2].value = {"$resolver": "pattern", "pattern": "%tid"}
appender.rolling.layout.eventTemplateAdditionalField[2].format = JSON

appender.rolling.policies.type = Policies
# RollingFileAppender rotation policy
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 10MB
appender.rolling.policies.time.type = TimeBasedTriggeringPolicy
appender.rolling.policies.time.interval = 1
appender.rolling.policies.time.modulate = true
appender.rolling.strategy.type = DefaultRolloverStrategy
appender.rolling.strategy.delete.type = Delete
appender.rolling.strategy.delete.basePath = ${basePath}
appender.rolling.strategy.delete.maxDepth = 10
appender.rolling.strategy.delete.ifLastModified.type = IfLastModified

# Delete all files older than 30 days
appender.rolling.strategy.delete.ifLastModified.age = 30d

rootLogger.level = INFO
rootLogger.appenderRef.rolling.ref = fileLogger

logger.spark.name = org.apache.spark
logger.spark.level = WARN
logger.spark.additivity = false
logger.spark.appenderRef.stdout.ref = fileLogger

# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
logger.spark.repl.Main.level = WARN
logger.spark.repl.SparkIMain$exprTyper.level = INFO
logger.spark.repl.SparkILoop$SparkILoopInterpreter.level = INFO

# Settings to quiet third party logs that are too verbose
logger.jetty.name = org.sparkproject.jetty
logger.jetty.level = WARN
logger.jetty.util.component.AbstractLifeCycle.level = ERROR

logger.parquet.name = org.apache.parquet
logger.parquet.level = ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.hadoop.name = org.apache.hadoop
logger.hadoop.level = WARN
logger.hadoop.hive.metastore.RetryingHMSHandler.level = FATAL
logger.hadoop.hive.ql.exec.FunctionRegistry.level = ERROR
logger.spark.sql.level = WARN

当我们启动pyspark程序时,它会找到log4j2.properties文件,我们可以看到所有依赖项的所有非根级别日志都被捕获到json中
然而,由于某些原因,www.example.com的设置log4j.properties应用于spark驱动程序日志,并且所有这些都报告给控制台。如果我们更改log4j.properties文件中的级别或格式,这些设置将应用于驱动程序日志输出。
为什么spark会使用hadooplog4j.properties文件而不是log4j2.properties文件?我们是否遗漏了一些设置?
我们还尝试log4j2.properties在spark-defaults中为驱动程序提供www.example.com文件额外的java选项:

spark.driver.extraJavaOptions -Djava.net.preferIPv4Stack=true -Djava.security.auth.login.config=conf/jaas_driver.conf -Djava.security.krb5.conf=conf/krb5_driver.conf -Dsun.security.krb5.debug=false -Dlog4j.configurationFile=file:/spark_conf_dir/log4j2.properties

其中spark_conf_dir = SPARK_CONF_DIR引用的文件夹
但这也不起作用。由于某种原因,系统总是log4j.properties为驱动程序应用www.example.com设置。它似乎用log4j2.properties文件中的设置推翻log4j.properties文件中的设置。
这是在虚拟机上。如果我们删除log4j.propertiesHADOOP_CONF_DIR中的www.example.com文件,那么驱动程序不会报告任何内容(可能是默认错误,但目前没有显示)。
如果我们用同样的程序构建一个docker,但是用pyspark从一个基础python镜像构建,我们就不会有这个问题。然后驱动程序和依赖的spark包的日志输出以json格式在日志文件中传递。
我希望在spark.driver.extraJavaOptions中提供-Dlog4j.configurationFile=file:/spark_conf_dir/log4j2.properties可以解决这个问题。或者对于log4j配置,SPARK_CONF_DIR优先于HADOOP_CONF_DIR。

uyto3xhc

uyto3xhc1#

我终于找到了上述问题的根本原因,并将其记录在这里,以便如果有人遇到同样的问题,希望能帮助他/她。
log4j 1和log4j 2配置相互干扰的根本原因是由于一个环境中SPARK_HOME引用的目录中有一个jar文件,而另一个环境中没有。
这个jar文件嵌入了log4j 1日志记录,因此在log4j.propertiesHADOOP_CONF_DIR中找到的类路径上查找www.example.com文件,并将这些设置应用于spark驱动程序的根日志记录器。
删除此jar文件后,所有日志记录都通过log4j 2进行,log4j2.properties文件位于SPARK_CONF_DIR中。

相关问题