Paddle 在k8s集群中通过service暴露端口,创建分布式任务失败

tez616oj  于 2022-04-21  发布在  Java
关注(0)|答案(2)|浏览(226)
  • 版本、环境信息:

   1)PaddlePaddle版本:0.14
   2)CPU:Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
   4)系统环境:k8s(v1.10.7),镜像信息(hub.baidubce.com/paddlepaddle/paddle: 0.14.0)

  • 训练信息

   1)多记多卡
   2)Operator信息

  • 复现信息:在k8s中先启动4个service,暴露端口6174,然后再启动与之对应的4个pod,其中两个为PSERVER,两个为TRAINER,通过service的clusterIP和端口可以访问相应的pod提供的服务。

在pod中写入相应的环境变量,以第一个PSERVER的pod为例:
PADDLE_TRAINING_ROLE=PSERVER
PADDLE_PSERVER_IPS=10.104.213.246,10.97.10.70
PADDLE_PSERVER_PORT=6174
PADDLE_TRAINERS=2
PADDLE_CURRENT_IP=10.104.213.246

代码为官网demo示例:
import os
import sys, getopt
import paddle
import paddle.fluid as fluid

from visualdl import LogWriter

EPOCH_NUM = 30
BATCH_SIZE = 8

train_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.uci_housing.train(), buf_size=500),
batch_size=BATCH_SIZE)

def train(save_dir, log_dir):
y = fluid.layers.data(name='y', shape=[1], dtype='float32')
x = fluid.layers.data(name='x', shape=[13], dtype='float32')
y_predict = fluid.layers.fc(input=x, size=1, act=None)

logwriter = LogWriter(log_dir, sync_cycle=10)
with logwriter.mode("train") as writer:
    loss_scalar = writer.scalar("loss")

loss = fluid.layers.square_error_cost(input=y_predict, label=y)
avg_loss = fluid.layers.mean(loss)
opt = fluid.optimizer.SGD(learning_rate=0.001)
opt.minimize(avg_loss)

# use cpu

place = fluid.CPUPlace()

# use gpu

# place = fluid.CUDAPlace(0)

feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
exe = fluid.Executor(place)

# fetch distributed training environment setting

training_role = os.getenv("PADDLE_TRAINING_ROLE", None)
port = os.getenv("PADDLE_PSERVER_PORT", "6174")
pserver_ips = os.getenv("PADDLE_PSERVER_IPS", "")
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
eplist = []
for ip in pserver_ips.split(","):
    eplist.append(':'.join([ip, port]))
pserver_endpoints = ",".join(eplist)
trainers = int(os.getenv("PADDLE_TRAINERS"))
current_endpoint = os.getenv("PADDLE_CURRENT_IP", "") + ":" + port

t = fluid.DistributeTranspiler()
t.transpile(
    trainer_id = trainer_id,
    pservers = pserver_endpoints,
    trainers = trainers)

if training_role == "PSERVER":
    pserver_prog = t.get_pserver_program(current_endpoint)
    startup_prog = t.get_startup_program(current_endpoint, pserver_prog)
    exe.run(startup_prog)
    exe.run(pserver_prog)
elif training_role == "TRAINER":
    trainer_prog = t.get_trainer_program()
    exe.run(fluid.default_startup_program())

    for epoch in range(EPOCH_NUM):
        for batch_id, batch_data in enumerate(train_reader()):
            avg_loss_value, = exe.run(trainer_prog,
                                  feed=feeder.feed(batch_data),
                                  fetch_list=[avg_loss])
            loss_scalar.add_record(batch_id + epoch * 50, avg_loss_value[0])
            if (batch_id + 1) % 10 == 0:
                print("Epoch: {0}, Batch: {1}, loss: {2}".format(
                    epoch, batch_id, avg_loss_value[0]))
    # destory the resource of current trainer node in pserver server node
    if trainer_id == 0:
        fluid.io.save_params(executor=exe, dirname=save_dir,
             main_program=trainer_prog)
    exe.close()
else:
    raise AssertionError("PADDLE_TRAINING_ROLE should be one of [TRAINER, PSERVER]")

opts, args = getopt.getopt(sys.argv[1:], "hs:l:", ["save_dir=", "log_dir="])
save_dir="./output"
log_dir="./log"
for op, value in opts:
if op == "-h":
print 'test.py -s <save_dir>'
sys.exit()
elif op in ("-s", "--save_dir"):
save_dir=value
elif op in ("-l", "--log_dir"):
log_dir=value
train(save_dir, log_dir)

  • 问题描述:运行第一个PSERVER时报错:

block map: {'fc_0.b_0': [(0L, 1L)], 'fc_0.w_0': [(0L, 13L)]}
block map: {'fc_0.w_0@GRAD': [(0L, 13L)], 'fc_0.b_0@GRAD': [(0L, 1L)]}
E0315 02:21:11.156191433 77 server_chttp2.cc:38] {"created":"@1552616471.156142167","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":305,"referenced_errors":[{"created":"@1552616471.156129936","description":"Unable to configure socket","fd":10,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":202,"referenced_errors":[{"created":"@1552616471.156123980","description":"OS Error","errno":99,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":175,"os_error":"Cannot assign requested address","syscall":"bind"}]}]}

Aborted at 1552616471 (unix time) try "date -d @1552616471" if you are using GNU date

PC: @ 0x0 (unknown)

SIGSEGV (@0x50) received by PID 25 (TID 0x7fb41cb84700) from PID 80; stack trace:

@ 0x7fb5816ca390 (unknown)
@ 0x7fb53259bf2e grpc::ServerInterface::RegisteredAsyncRequest::IssueRequest()
@ 0x7fb53255060f paddle::operators::distributed::AsyncGRPCServer::TryToRegisterNewOne()
@ 0x7fb532550f4e paddle::operators::distributed::AsyncGRPCServer::StartServer()
@ 0x7fb53247c364 paddle::operators::RunServer()
@ 0x7fb532481587 std:🧵:_Impl<>::_M_run()
@ 0x7fb54053fc80 (unknown)
@ 0x7fb5816c06ba start_thread
@ 0x7fb5813f641d clone
@ 0x0 (unknown)
Segmentation fault (core dumped)
另外:通过paddle_trainer 的方式在k8s上用此方式启动分布式任务是可以成功的。

6rqinv9w

6rqinv9w1#

看日志端口是被占用了

qrjkbowd

qrjkbowd2#

看日志端口是被占用了

问题解决了,不能使用service的IP,要用pod的IP

相关问题