如何使用subprocess.run()运行配置单元查询?

pieyvz9o  于 2021-06-27  发布在  Hive
关注(0)|答案(1)|浏览(368)

所以我尝试使用 subprocess 模块,并将输出保存到文件中 data.txt 以及日志(进入 log.txt ),但我好像有点麻烦。我看了这个要点,也看了这个所谓的问题,但似乎都没有给我什么我需要的。
下面是我要做的:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")

# note - "hive -e [query]" would normally just print all the results

# to the console after finishing

proc = subprocess.run(["hive" , "-e" '"{}"'.format(query)],
                    stdin=subprocess.PIPE,
                    stdout=data_buff,
                    stderr=log_buff,
                    shell=True)

log_buff.close()
data_buff.close()

我还研究了关于subprocess.run()和subprocess.popen的问题,我想 .run() 因为我希望这个过程在结束前停止。
最终输出应该是一个文件 data.txt 以制表符分隔的查询结果,以及 log.txt 所有由配置单元作业生成的日志记录。任何帮助都会很好。
更新:
通过上述方法,我目前得到以下输出:
日志.txt

[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties

数据.txt

[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$

我可以验证java/hive进程是否运行:

[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
  PID TTY          TIME CMD
14096 pts/0    00:00:00 hive
14141 pts/0    00:00:07 java
14259 pts/0    00:00:00 ps
16275 ?        00:00:00 sshd
16276 pts/0    00:00:00 bash

但它看起来没有完成,没有记录我想要的一切。

wmomyfyw

wmomyfyw1#

因此,我通过以下设置成功地实现了这一点:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")

# Remove shell=True from proc, and add "> outfile.txt" to the command

proc = subprocess.Popen(["hive" , "-e", '"{}"'.format(query), ">", "{}".format(outfile)],
                    stdin=subprocess.PIPE,
                    stdout=data_buff,
                    stderr=log_buff)

# keep track of job runtime and set limit

start, elapsed, finished, limit  = time.time(), 0, False, 60
while not finished:
    try:
        outs, errs = proc.communicate(timeout=10)
        print("job finished")
        finished = True
    except subprocess.TimeoutExpired:
        elapsed = abs(time.time() - start) / 60. 
        if elapsed >= 60:
            print("Job took over 60 mins")
            break 
        print("Comm timed out. Continuing")
        continue

print("done")

log_buff.close()
data_buff.close()

根据需要生产产品。我知道 process.communicate() 但这之前并不奏效。我认为这个问题与没有添加带有 > ${outfile} 到配置单元查询。
请随意添加任何细节。我从来没见过有人要绕圈子 proc.communicate() 所以我怀疑我可能做错了什么。

相关问题