go cmd/pprof: pprof 提供了错误的 CPU 分析地址/行

t9eec4r0  于 4个月前  发布在  Go
关注(0)|答案(3)|浏览(42)

使用CPU分析和go tool pprof属性将CPU时间分配给消耗时间的指令之后。这可能导致CPU时间被分配到错误的行号。与此相比,Linux perf命令可以正常工作。下面的示例显示了时间实际上是在for循环中度过的,而不是在昂贵的内存访问期间。
据我所知,这个问题已经存在很长时间了(至少从1.4版本开始)。我惊讶地发现这个问题还没有被记录下来。它也出现在最近的工具链中:

go version devel +d277a36123 Fri Sep 11 02:58:36 2020 +0000 linux/amd64
go version go1.15.2 linux/amd64

测试平台:Linux localhost.localdomain 5.8.4-200.fc32.x86_64 #1 SMP Wed Aug 26 22:28:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
下面是一个演示故障的脚本:

#!/bin/sh

cat > prof_test.go <<EOF
// Demonstrate that CPU profiles refer to the wrong instruction.
package prof

import "testing"

var buf [1024 * 1024 * 1024]byte

func Benchmark(b *testing.B) {
for i := 0; i < b.N; i++ {
// Expensive memory accesses.
buf[(2097169*i)%len(buf)] = 42
}
}
EOF

go mod init prof
go test -c -o prof.test
perf record -- ./prof.test -test.bench . -test.benchtime 3s -test.cpuprofile cpu.prof
echo
echo '=== pprof list shows the "for" loop is expensive, not the memory access. ==='
go tool pprof -list Benchmark prof.test cpu.prof
echo
echo '=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ==='
go tool pprof -disasm Benchmark prof.test cpu.prof
echo
echo '=== Linux perf correctly shows the "MOV" instruction is expensive. ==='
perf annotate --stdio -s prof.Benchmark | cat

输出结果:

$ ./run.sh 
goos: linux
goarch: amd64
pkg: prof
Benchmark-8   	314821470	        11.2 ns/op
PASS
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.995 MB perf.data (25644 samples) ]

=== pprof list shows the "for" loop is expensive, not the memory access. ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark in /home/mark/Prog/golang/src/junk/memperf/prof_test.go
     6.34s      6.34s (flat, cum)   100% of Total
         .          .      4:import "testing"
         .          .      5:
         .          .      6:var buf [1024 * 1024 * 1024]byte
         .          .      7:
         .          .      8:func Benchmark(b *testing.B) {
     5.52s      5.52s      9:    for i := 0; i < b.N; i++ {
         .          .     10:        // Expensive memory accesses.
     820ms      820ms     11:        buf[(2097169*i)%len(buf)] = 42
         .          .     12:    }
         .          .     13:}

=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark
     6.34s      6.34s (flat, cum)   100% of Total
         .          .     509500: SUBQ $0x18, SP                          ;prof_test.go:8
         .          .     509504: MOVQ BP, 0x10(SP)
         .          .     509509: LEAQ 0x10(SP), BP
         .          .     50950e: MOVQ 0x20(SP), DX                       ;prof_test.go:9
         .          .     509513: XORL AX, AX
         .          .     509515: JMP 0x509525
      30ms       30ms     509517: LEAQ prof.buf(SB), BX                   ;prof.Benchmark prof_test.go:11
     540ms      540ms     50951e: MOVB $0x2a, 0(BX)(SI*1)
     5.49s      5.49s     509522: INCQ AX                                 ;prof.Benchmark prof_test.go:9
      30ms       30ms     509525: CMPQ AX, 0x170(DX)
         .          .     50952c: JLE 0x509559                            ;prof_test.go:9
      10ms       10ms     50952e: IMULQ $0x200011, AX, BX                 ;prof.Benchmark prof_test.go:11
      20ms       20ms     509535: MOVQ BX, SI
      80ms       80ms     509538: SARQ $0x3f, BX
         .          .     50953c: SHRQ $0x22, BX                          ;prof_test.go:11
         .          .     509540: ADDQ SI, BX
      30ms       30ms     509543: SARQ $0x1e, BX                          ;prof.Benchmark prof_test.go:11
      90ms       90ms     509547: SHLQ $0x1e, BX
         .          .     50954b: SUBQ BX, SI                             ;prof_test.go:11
      20ms       20ms     50954e: CMPQ $prof.buf+1067102432(SB), SI       ;prof.Benchmark prof_test.go:11
         .          .     509555: JB 0x509517                             ;prof_test.go:11
         .          .     509557: JMP 0x509563
         .          .     509559: MOVQ 0x10(SP), BP                       ;prof_test.go:9
         .          .     50955e: ADDQ $0x18, SP
         .          .     509562: RET
         .          .     509563: MOVQ SI, AX                             ;prof_test.go:11
         .          .     509566: MOVL $prof.buf+1067102432(SB), CX
         .          .     50956b: CALL runtime.panicIndex(SB)

=== Linux perf correctly shows the "MOV" instruction is expensive. ===
 Percent |	Source code & Disassembly of prof.test for cycles:u (24257 samples, percent: local period)
----------------------------------------------------------------------------------------------------------
         :
         :
         :
         :            Disassembly of section .text:
         :
         :            0000000000509500 <prof.Benchmark>:
         :            prof.Benchmark():
         :
         :            import "testing"
         :
         :            var buf [1024 * 1024 * 1024]byte
         :
         :            func Benchmark(b *testing.B) {
    0.00 :   509500: sub    $0x18,%rsp
    0.00 :   509504: mov    %rbp,0x10(%rsp)
    0.00 :   509509: lea    0x10(%rsp),%rbp
         :            for i := 0; i < b.N; i++ {
    0.00 :   50950e: mov    0x20(%rsp),%rdx
    0.00 :   509513: xor    %eax,%eax
    0.00 :   509515: jmp    509525 <prof.Benchmark+0x25>
         :            // Expensive memory accesses.
         :            buf[(2097169*i)%len(buf)] = 42
    1.82 :   509517: lea    0x14ba02(%rip),%rbx        # 654f20 <prof.buf>
   92.32 :   50951e: movb   $0x2a,(%rbx,%rsi,1)
         :            for i := 0; i < b.N; i++ {
    0.08 :   509522: inc    %rax
    0.25 :   509525: cmp    %rax,0x170(%rdx)
    0.00 :   50952c: jle    509559 <prof.Benchmark+0x59>
         :            buf[(2097169*i)%len(buf)] = 42
    0.53 :   50952e: imul   $0x200011,%rax,%rbx
    1.74 :   509535: mov    %rbx,%rsi
    0.02 :   509538: sar    $0x3f,%rbx
    0.11 :   50953c: shr    $0x22,%rbx
    0.46 :   509540: add    %rsi,%rbx
    1.93 :   509543: sar    $0x1e,%rbx
    0.03 :   509547: shl    $0x1e,%rbx
    0.12 :   50954b: sub    %rbx,%rsi
    0.00 :   50954e: cmp    $0x40000000,%rsi
    0.61 :   509555: jb     509517 <prof.Benchmark+0x17>
    0.00 :   509557: jmp    509563 <prof.Benchmark+0x63>
         :            for i := 0; i < b.N; i++ {
    0.00 :   509559: mov    0x10(%rsp),%rbp
    0.00 :   50955e: add    $0x18,%rsp
    0.00 :   509562: retq
         :            buf[(2097169*i)%len(buf)] = 42
    0.00 :   509563: mov    %rsi,%rax
    0.00 :   509566: mov    $0x40000000,%ecx
    0.00 :   50956b: callq  46aae0 <runtime.panicIndex>
         :            ./<autogenerated>:1
    0.00 :   509570: nop

Cc @hyangah@randall77

wswtfjt7

wswtfjt71#

这听起来很可能是由于在 #36821 中的提案。Go的pprof分析使用ITIMER_PROF,它可能与实际昂贵的指令有几个指令的差距。
另一方面,Linux的perf可以使用Intel PEBS(/AMD等效)进行精确采样。
我不认为perf默认启用PEBS(-e cycles:pp是必需的),但即使是标准的硬件计数器也可能比ITIMER_PROF具有更少的差距。

3phpmpom

3phpmpom3#

正如我在#30708中指出的,我们无法纠正Go中的滑移,除非使用精确的硬件计数器,这将使#36821得以实现。
无论该提案的结果如何(并非所有系统都将具有精确的计数器),我们都可以更好地记录关于此限制的信息。

相关问题