tengine 2.3.3 生产环境频繁coredump

ecbunoof  于 2022-11-02  发布在  其他
关注(0)|答案(7)|浏览(136)

Ⅰ. Issue Description

tengine 2.3.3 生产环境频繁coredump

Ⅱ. Describe what happened

tengine 2.3.3 生产环境频繁coredump

Ⅲ. Describe what you expected to happen

正常运行

Ⅳ. How to reproduce it (as minimally and precisely as possible)

生产正常运行, 每天产生几十过 coredump文件, 看了都是同一个位置导致的
分析coredump文件如下:

[root@saas1 coredump]# gdb  /usr/local/nginx/sbin/nginx core.599714 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7

Reading symbols from /usr/local/nginx/sbin/nginx...done.
BFD: Warning: /home/coredump/core.599714 is truncated: expected core file size >= 10044428288, found: 104857600.
[New LWP 599714]
Cannot access memory at address 0x7f0e36636128
Cannot access memory at address 0x7f0e36636120
Failed to read a valid object file image from memory.
Core was generated by `nginx: worker process                                          '.
Program terminated with signal 11, Segmentation fault.

# 0  ngx_http_upstream_get_peer (rrp=0x29379a0) at src/http/ngx_http_upstream_round_robin.c:642

642	src/http/ngx_http_upstream_round_robin.c: 没有那个文件或目录.
(gdb) bt

# 0  ngx_http_upstream_get_peer (rrp=0x29379a0) at src/http/ngx_http_upstream_round_robin.c:642

# 1  ngx_http_upstream_get_round_robin_peer (pc=<error reading variable: Cannot access memory at address 0x7fff19ac2ea0>,

    pc@entry=<error reading variable: Cannot access memory at address 0x7fff19ac2ef8>, data=0x29379a0) at src/http/ngx_http_upstream_round_robin.c:532
(gdb) 
(gdb) p rrp
$1 = (ngx_http_upstream_rr_peer_data_t *) 0x29379a0
(gdb) p rrp->peers
$2 = (ngx_http_upstream_rr_peers_t *) 0x38f4730
(gdb) p rrp->peers->peer
$3 = (ngx_http_upstream_rr_peer_t *) 0x2fb0ee0
(gdb) p rrp->peers->peer->next
$4 = (ngx_http_upstream_rr_peer_t *) 0x2fb0df0
(gdb) p rrp->peers->peer->next->next
$5 = (ngx_http_upstream_rr_peer_t *) 0x2fb0d00
(gdb) p rrp->peers->peer->next->next->next
$6 = (ngx_http_upstream_rr_peer_t *) 0x2fb0c10
(gdb) p rrp->peers->peer->next->next->next->next
$7 = (ngx_http_upstream_rr_peer_t *) 0x0
(gdb) 
(gdb) p rrp->peers->peer->down
$8 = 1
(gdb) p rrp->peers->peer->next->down
$9 = 0
(gdb) p rrp->peers->peer->next->next->down
$10 = 1
(gdb) p rrp->peers->peer->next->next->next->down
$11 = 0
(gdb) p rrp->peers->peer->next->next->next->next->down
Cannot access memory at address 0xb0
(gdb)

最近新引入和启用 了 nginx-upsync-module-2.1.3 ( https://github.com/weibocom/nginx-upsync-module )

他们这边也提了issue: weibocom/nginx-upsync-module#300

Ⅵ. Environment:

  • Tengine version (use sbin/nginx -V ):
/usr/local/nginx/sbin/nginx -V
Tengine version: Tengine/2.3.3
nginx version: nginx/1.18.0
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) 
built with OpenSSL 1.1.1m  14 Dec 2021
TLS SNI support enabled
configure arguments: --prefix=/usr/local/nginx --with-http_stub_status_module --with-http_gzip_static_module --with-http_ssl_module --with-http_v2_module --with-openssl=../openssl-1.1.1m --with-pcre=../pcre-8.43/ --with-zlib=../zlib-1.2.11 --with-http_lua_module --with-luajit-lib=/usr/local/lib/ --with-luajit-inc=/usr/local/include/luajit-2.1/ --with-lua-inc=/usr/local/include/luajit-2.1/ --with-lua-lib=/usr/local/lib/ --with-ld-opt=-Wl,-rpath, --add-module=modules/ngx_http_concat_module --add-module=modules/ngx_http_upstream_session_sticky_module --add-module=modules/ngx_http_reqstat_module --add-module=modules/ngx_http_upstream_check_module --add-module=modules/ngx_http_trim_filter_module --add-module=modules/ngx_http_footer_filter_module --add-module=modules/ngx_http_upstream_consistent_hash_module --add-module=modules/ngx_http_upstream_dynamic_module --add-module=modules/ngx_http_user_agent_module --add-module=modules/ngx_http_upstream_dyups_module --add-module=modules/ngx_http_upstream_vnswrr_module --add-module=../nginx-upsync-module-2.1.3
  • OS (e.g. from /etc/os-release): centos 7
  • Kernel (e.g. uname -a ): CentOS Linux release 7.7.1908 (Core)

# uname -an

Linux saas1 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Others:
vjrehmav

vjrehmav1#

upstream conf:

upstream oamain {
    server 127.0.0.1:8080;
    upsync 127.0.0.1:8500/v1/kv/upstreams/oa/ upsync_timeout=6m upsync_interval=500ms upsync_type=consul strong_dependency=off;
    upsync_dump_path /usr/local/nginx/conf/proxy/oa.upstream;
    include /usr/local/nginx/conf/proxy/oa.upstream;  
}

# cat /usr/local/nginx/conf/proxy/oa.upstream

server 192.168.254.54:80 weight=1 max_fails=2 fail_timeout=10s;
gkn4icbw

gkn4icbw2#

还有一种coredump

BFD: Warning: /home/coredump/core.813557 is truncated: expected core file size >= 9938554880, found: 104857600.
[New LWP 813557]
Cannot access memory at address 0x7f8eb94bf128
Cannot access memory at address 0x7f8eb94bf120
Failed to read a valid object file image from memory.
Core was generated by `nginx: worker process                                          '.
Program terminated with signal 11, Segmentation fault.

# 0  ngx_http_upstream_get_peer (rrp=0x714c520) at src/http/ngx_http_upstream_round_robin.c:618

618	src/http/ngx_http_upstream_round_robin.c: 没有那个文件或目录.
(gdb) bt

# 0  ngx_http_upstream_get_peer (rrp=0x714c520) at src/http/ngx_http_upstream_round_robin.c:618

# 1  ngx_http_upstream_get_round_robin_peer (pc=<error reading variable: Cannot access memory at address 0x7fff9c6c8f80>,

    pc@entry=<error reading variable: Cannot access memory at address 0x7fff9c6c8fd8>, data=0x714c520) at src/http/ngx_http_upstream_round_robin.c:532
(gdb) p rrp
$1 = (ngx_http_upstream_rr_peer_data_t *) 0x714c520
(gdb) p rrp->peers->number
Cannot access memory at address 0x714c528
(gdb) p rrp->peers
Cannot access memory at address 0x714c528
(gdb) p rrp->peers->peer
Cannot access memory at address 0x714c528
(gdb) p rrp->peers
Cannot access memory at address 0x714c528
muk1a3rh

muk1a3rh3#

看起来产生段错误函数为ngx_http_upstream_get_round_robin_peer

完整的回源逻辑为:

  1. 在解析upstream{}server配置的时候放在us→servers中。
  2. 在init_round_robin建立整个upstream对应的peers,将所有servers对应监听地址都建立peer联系起来。
  3. 随后在create_round_robin_peer中,
    rrp = r->upstream->peer.data;
    rrp->peers = peers;
  4. 将peers和request联系起来,并将upstream_server级别的一些信息填充peer。
  5. ngx_http_upstream_get_round_robin_peer中,真正挑选出来peer,并且将peer对应的内容赋给peer_connection,发起ngx_event_connect。

结论:
所以你的问题根源在于,第五步选取peer,和源站建立连接这一步。
引用nginx-upsync-module后,add module或通过patch打补丁修改了此处逻辑代码???

yh2wf1be

yh2wf1be4#

https://cloud.tencent.com/developer/article/1778734
之前写过一篇分析upstream源码分析的文章,希望能够有所帮助。

iswrvxsc

iswrvxsc5#

没有任何patch 操作, 就是下载官方release页面里的 tengine2.3.3 和 nginx-upsync-module-2.1.3 编译运行, 打包脚本如下:

tar -xf openssl-1.1.1m.tar.gz 
tar -xf pcre-8.43.tar.gz
tar -xf zlib-1.2.11.tar.gz
tar -xf nginx-upsync-module-2.1.3.tar.gz
tar -xf tengine-2.3.3.tar.gz
cd tengine-2.3.3

./configure --prefix=/usr/local/nginx --with-http_stub_status_module --with-http_gzip_static_module  --with-http_ssl_module --with-http_v2_module --with-openssl=../openssl-1.1.1m --with-pcre=../pcre-8.43/ --with-zlib=../zlib-1.2.11 --with-http_lua_module --with-luajit-lib=/usr/local/lib/ --with-luajit-inc=/usr/local/include/luajit-2.1/ --with-lua-inc=/usr/local/include/luajit-2.1/ --with-lua-lib=/usr/local/lib/ --with-ld-opt=-Wl,-rpath, --add-module=modules/ngx_http_concat_module  --add-module=modules/ngx_http_upstream_session_sticky_module --add-module=modules/ngx_http_reqstat_module  --add-module=modules/ngx_http_upstream_check_module  --add-module=modules/ngx_http_trim_filter_module --add-module=modules/ngx_http_footer_filter_module --add-module=modules/ngx_http_upstream_consistent_hash_module --add-module=modules/ngx_http_upstream_dynamic_module --add-module=modules/ngx_http_user_agent_module --add-module=modules/ngx_http_upstream_dyups_module --add-module=modules/ngx_http_upstream_vnswrr_module --add-module=../nginx-upsync-module-2.1.3

另外看 https://github.com/alibaba/tengine/blob/master/src/http/ngx_http_upstream_round_robin.c#L642 是不是健壮性可以提升优化 至少做个判断, 空了也不至于coredump, continue 跳过peer就行啊

相关问题