docker AWS SageMaker管道模型端点部署失败

jexiocij 于 2022-11-28 发布在 Docker

关注(0)|答案(1)|浏览(180)

我想部署一个有2个容器的Sagemaker管道模型。我指的是：链接：https://sagemaker.readthedocs.io/en/stable/api/inference/pipeline.html。
第一个容器将包含图像预处理代码，第二个容器将包含模型推理代码。我已经更新了这两个容器的docker文件，使其具有以下行：

# Set a docker label to enable container to use SAGEMAKER_BIND_TO_PORT environment variable if present
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

我已经通过使用单个容器部署普通端点来分别测试了两个容器。两个端点都已部署并按预期工作。但当我尝试部署管道模型时，端点未部署并出现以下错误：

UnexpectedStatusException: Error hosting endpoint sagemaker-inference-pipeline-endpoint: Failed.
Reason:  The container-1,container-2 for production variant AllTraffic did not pass the ping health check. 
Please check CloudWatch logs for this endpoint..

我已经检查了两个容器的cloudwatch日志，没有显示与“运行状况检查”失败相关的错误。请查看1个容器的cloudwatch日志（第2个也相同）：

Starting the inference server with 2 workers.
[2022-11-20 14:50:44 +0000] [15] [INFO] Starting gunicorn 20.1.0
[2022-11-20 14:50:44 +0000] [15] [INFO] Listening at: unix:/tmp/gunicorn.sock (15)
[2022-11-20 14:50:44 +0000] [15] [INFO] Using worker: sync
[2022-11-20 14:50:44 +0000] [18] [INFO] Booting worker with pid: 18
[2022-11-20 14:50:44 +0000] [19] [INFO] Booting worker with pid: 19

请注意：为了测试的目的，现在我还更新了我的代码，它做以下事情：

始终将运行状况检查返回为True（状态200）
每个输入-输出内容类型是：“文本/纯文本”

请指导我在不知不觉中错过了什么或在哪里犯了错误。提前感谢。
我尝试过的事情总结：
1.作为端点部署分别测试了两个容器。
1.我已经阅读了文档部分，并了解到我们需要告诉docker有关端口绑定的信息。在docker文件中添加了以下行：

# Set a docker label to enable container to use SAGEMAKER_BIND_TO_PORT environment variable if present
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

1.更新了两个容器各自代码文件中的代码，以始终将运行状况检查返回为通过（状态：200个）
1.每个输入-输出内容类型更新为：“text/plain”（以便即使在容器间通信中也没有异常）

docker

来源：https://stackoverflow.com/questions/74509933/aws-sagemaker-pipeline-model-endpoint-deployment-failing

1条答案

按热度按时间

gab6jxml1#

更新：我能够解决此问题
实际的问题是端点无法ping容器。这是因为，当有多个容器时，每个容器都使用某个动态端口进行通信，而端点需要知道每个容器使用的是哪个端口。因此，我们需要编写一个自定义代码来将nginx.conf文件中的端口值[8080]替换为['SAGEMAKER_BIND_TO_PORT']环境变量中的值。
执行上述操作的代码引用自以下sagemker示例：https://github.com/aws/amazon-sagemaker-examples/tree/main/contrib/inference_pipeline_custom_containers/containers
在serve文件中，使用下面的**start_server（）**函数：

def start_server():
    print('Starting the inference server with {} workers.'.format(model_server_workers))

    # link the log streams to stdout/err so they will be logged to the container logs
    subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
    subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])
    
    port = os.environ.get("SAGEMAKER_BIND_TO_PORT", 8080)
    print("using port: ", port)
    with open("nginx.conf.template") as nginx_template:
        template = Template(nginx_template.read())    
    nginx_conf = open("/opt/program/nginx.conf", "w")
    nginx_conf.write(template.substitute(port=port))
    nginx_conf.close()

    nginx = subprocess.Popen(['nginx', '-c', '/opt/program/nginx.conf'])
    gunicorn = subprocess.Popen(['gunicorn',
                                 '--timeout', str(model_server_timeout),
                                 '-k', 'sync',
                                 '-b', 'unix:/tmp/gunicorn.sock',
                                 '-w', str(model_server_workers),
                                 'wsgi:app'])

    signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))

    # If either subprocess exits, so do we.
    pids = set([nginx.pid, gunicorn.pid])
    while True:
        pid, _ = os.wait()
        if pid in pids:
            break

    sigterm_handler(nginx.pid, gunicorn.pid)
    print('Inference server exiting')

使用nginx.conf.template而不是nginx.conf，它将依次创建带有正确端口的nginx.conf文件：

worker_processes 1;
daemon off; # Prevent forking

pid /tmp/nginx.pid;
error_log /var/log/nginx/error.log;

events {
  # defaults
}

http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /var/log/nginx/access.log combined;
  
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen $port deferred;
    client_max_body_size 5m;

    keepalive_timeout 5;
    proxy_read_timeout 1200s;

    location ~ ^/(ping|invocations) {
      proxy_set_header X-Forwarded-For $$proxy_add_x_forwarded_for;
      proxy_set_header Host $$http_host;
      proxy_redirect off;
      proxy_pass http://gunicorn;
    }

    location / {
      return 404 "{}";
    }
  }
}

赞(0）回复(0）举报 2022-11-28

我来回答

docker AWS SageMaker管道模型端点部署失败

1条答案

相关问题

热门标签

最新问答