shell 从并行运行的多个bash脚本中获取退出代码

gab6jxml  于 2022-11-16  发布在  Shell
关注(0)|答案(4)|浏览(159)

我正在并行运行4个bash脚本,所有4个脚本都同时运行:
./script1.sh & ./script2.sh & ./script3.sh & ./script4.sh
我想在其中一个失败时退出。我尝试使用类似退出代码的东西,但是它似乎不适合并行脚本。有解决方案吗?任何bash/python解决方案都是受欢迎的。

lbsnaicq

lbsnaicq1#

TL;DR

parallel --line-buffer --halt now,fail=1 ::: ./script?.sh
echo $?
42

实际答案

在并行运行作业时,我发现考虑GNU Parallel非常有用,因为它可以简化很多方面:

  • 资源分配[专业]计
  • 负载在多个CPU和网络之间分布
  • 记录和输出标记
  • 错误处理-这方面在这里特别重要
  • 计划,重新启动
  • 输入输出文件名派生和重命名
  • 进度报告

因此,我创建了4个虚拟作业script1.shscript4.sh,如下所示:

#!/bin/bash
echo "script1.sh starting..."
sleep 5
echo "script1.sh complete"

script3.sh除外,它在其他之前失败:

#!/bin/bash
echo "script3.sh starting..."
sleep 2
echo "script3.sh dying"
exit 42

因此,以下是并行运行4个作业的默认方式,每个作业的输出都被收集起来,并一个接一个地显示:

parallel ::: ./script*.sh
script3.sh starting...
script3.sh dying
script1.sh starting...
script1.sh complete
script4.sh starting...
script4.sh complete
script2.sh starting...
script2.sh complete

您可以首先看到script3.sh模具,然后首先收集并显示其所有输出,接着是其他模具的分组输出。简单地说,输出按作业分组,并在每个作业完成时显示。
现在让我们再做一次,但只按行缓冲输出,而不是等待作业完成并按作业收集输出:

parallel --line-buffer ::: ./script*.sh 
script1.sh starting...
script2.sh starting...
script3.sh starting...
script4.sh starting...
script3.sh dying
script1.sh complete
script2.sh complete
script4.sh complete

我们可以清楚地看到,script3.sh在其他线程之前死亡和退出,但它们仍然运行到完成。
现在,我们希望GNU Parallel在任何一个作业死亡时立即终止任何正在运行的作业:

parallel --line-buffer --halt now,fail=1 ::: ./script?.sh
script2.sh starting...
script1.sh starting...
script3.sh starting...
script4.sh starting...
script3.sh dying
parallel: This job failed:
./script3.sh

您可以看到script3.sh已死亡,其他作业都没有完成,因为GNU Parallel已将它们杀死。
您还可以获得失败退出状态:

echo $?
42

它比我展示的 * 灵活得多 *。你可以将now更改为soon,而不是杀死其他作业,它不会启动任何新的作业。你可以将fail=1更改为success=50%,这样当一半的作业成功退出时,它就会停止,等等。
您还可以添加--eta--bar来生成进度报告,并在网络上分发作业等等。值得一阅读,在CPU越来越胖(更多内核)而不是越来越高(更多GHz)的今天-有一个很好的PDF可用here

注意:默认情况下,GNU Parallel将保持与CPU内核数一样多的作业并行运行。因此,如果您的内核数少于4个,您可能应该在我的建议答案中加上-j 4,以告诉它即使只有1或2个内核也可以并行运行多达4个作业。

h79rfbju

h79rfbju2#

这里有一个脚本可以帮你完成这件事。我从here借用(并修改)了non_blocking_wait函数。

#!/bin/bash

# Run your scripts here... Following sleep commands as an example
sleep 5 &
sleep 3 &
sleep 3 &

# Here, we get the pid of each running process an put in the array "pids"
pids=( $(jobs -p | tr '\n' ' ') )

echo "pids = ${pids[@]}"

non_blocking_wait()
{
    PID=$1
    if [ ! -d "/proc/$PID" ]; then
        wait $PID
        CODE=$?
    else
        CODE=127
    fi

    echo $CODE
}

while true; do

    # Check if all processes are still running
    n_running=$(jobs -l | grep -c "Running")

    if [ "${n_running}" -ne "3" ]; then
        # At least one processes finished/returned here,
        # check if exited in error
        for pid in ${pids[@]}; do
            ret=$(non_blocking_wait ${pid})
            echo "non_blocking_wait ${pid} ret = ${ret}"
            if [ "${ret}" -ne "0" ] && [ "${ret}" -ne "127" ]; then
                echo "Process ${pid} exited with error ${ret}"
                # Here we can take any desirable action such as
                # killing all children and exiting the program:
                kill $(jobs -p) > /dev/null 2>&1
                exit 1
            fi
        done

        if [ "${n_running}" -eq "0" ]; then
            echo "All processes finished successfully"
            exit 0
        fi
    fi

    sleep 1
done

如果只是运行它,它将在所有进程结束时退出0:

$ ./script.sh 
pids = 17913 17914 17915
non_blocking_wait 17913 ret = 127 
non_blocking_wait 17914 ret = 0 
non_blocking_wait 17915 ret = 0 
non_blocking_wait 17913 ret = 127 
non_blocking_wait 17914 ret = 0 
non_blocking_wait 17915 ret = 0 
non_blocking_wait 17913 ret = 0 
All processes finished successfully

您可以从其中一个sleep命令中移除参数,使其失败,并看到程序立即返回:

$ ./script.sh 
sleep: missing operand
Try 'sleep --help' for more information.
pids = 18005 18006 18007
non_blocking_wait 18005 ret = 127 
non_blocking_wait 18006 ret = 1 
Process 18006 exited with error 1
ctehm74n

ctehm74n3#

一种解决方案是使用子过程:

import subprocess
import time

def do_that(scripts):
    ps = [subprocess.Popen('./'+s, shell=True) for s in scripts]
    while True:
        done = True
        for p in ps:
            rc = p.poll()
            if rc is None:  # Script is still running
                done = False
            elif rc:
                # if rc==0, script success to finish
                # otherwise it failed
                print('This script run failed:', p.args)
                running = set(ps) - {p}
                for i in running:
                    i.terminate()
                    print('Force terminate', i.args)
                return 1
        if done:
            print('All done.')
            return 0

def timeit(func):
    def runner(*args, **kwargs):
        start = time.time()
        res = func(*args, **kwargs)
        end = time.time()
        print(func.__name__, 'cost:', round(end-start,1))
        return res
    return runner

@timeit
def main():
    scripts = ('script1.sh', 'script2.sh')
    do_that(scripts)

if __name__ == '__main__':
    main()
ou6hu8tu

ou6hu8tu4#

wait -n等待下一个程序退出并返回其退出状态。

pids=( )
./script1.sh & pids+=( $! )
./script2.sh & pids+=( $! )
./script3.sh & pids+=( $! )
./script4.sh & pids+=( $! )
for _ in "${pids[@]}"; do
  wait -n || { rc=$?; kill "${pids[@]}"; exit "$rc"; }
done

相关问题