使用CPU分析和go tool pprof
属性将CPU时间分配给消耗时间的指令之后。这可能导致CPU时间被分配到错误的行号。与此相比,Linux perf
命令可以正常工作。下面的示例显示了时间实际上是在for
循环中度过的,而不是在昂贵的内存访问期间。
据我所知,这个问题已经存在很长时间了(至少从1.4版本开始)。我惊讶地发现这个问题还没有被记录下来。它也出现在最近的工具链中:
go version devel +d277a36123 Fri Sep 11 02:58:36 2020 +0000 linux/amd64
go version go1.15.2 linux/amd64
测试平台:Linux localhost.localdomain 5.8.4-200.fc32.x86_64 #1 SMP Wed Aug 26 22:28:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
下面是一个演示故障的脚本:
#!/bin/sh
cat > prof_test.go <<EOF
// Demonstrate that CPU profiles refer to the wrong instruction.
package prof
import "testing"
var buf [1024 * 1024 * 1024]byte
func Benchmark(b *testing.B) {
for i := 0; i < b.N; i++ {
// Expensive memory accesses.
buf[(2097169*i)%len(buf)] = 42
}
}
EOF
go mod init prof
go test -c -o prof.test
perf record -- ./prof.test -test.bench . -test.benchtime 3s -test.cpuprofile cpu.prof
echo
echo '=== pprof list shows the "for" loop is expensive, not the memory access. ==='
go tool pprof -list Benchmark prof.test cpu.prof
echo
echo '=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ==='
go tool pprof -disasm Benchmark prof.test cpu.prof
echo
echo '=== Linux perf correctly shows the "MOV" instruction is expensive. ==='
perf annotate --stdio -s prof.Benchmark | cat
输出结果:
$ ./run.sh
goos: linux
goarch: amd64
pkg: prof
Benchmark-8 314821470 11.2 ns/op
PASS
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.995 MB perf.data (25644 samples) ]
=== pprof list shows the "for" loop is expensive, not the memory access. ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark in /home/mark/Prog/golang/src/junk/memperf/prof_test.go
6.34s 6.34s (flat, cum) 100% of Total
. . 4:import "testing"
. . 5:
. . 6:var buf [1024 * 1024 * 1024]byte
. . 7:
. . 8:func Benchmark(b *testing.B) {
5.52s 5.52s 9: for i := 0; i < b.N; i++ {
. . 10: // Expensive memory accesses.
820ms 820ms 11: buf[(2097169*i)%len(buf)] = 42
. . 12: }
. . 13:}
=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark
6.34s 6.34s (flat, cum) 100% of Total
. . 509500: SUBQ $0x18, SP ;prof_test.go:8
. . 509504: MOVQ BP, 0x10(SP)
. . 509509: LEAQ 0x10(SP), BP
. . 50950e: MOVQ 0x20(SP), DX ;prof_test.go:9
. . 509513: XORL AX, AX
. . 509515: JMP 0x509525
30ms 30ms 509517: LEAQ prof.buf(SB), BX ;prof.Benchmark prof_test.go:11
540ms 540ms 50951e: MOVB $0x2a, 0(BX)(SI*1)
5.49s 5.49s 509522: INCQ AX ;prof.Benchmark prof_test.go:9
30ms 30ms 509525: CMPQ AX, 0x170(DX)
. . 50952c: JLE 0x509559 ;prof_test.go:9
10ms 10ms 50952e: IMULQ $0x200011, AX, BX ;prof.Benchmark prof_test.go:11
20ms 20ms 509535: MOVQ BX, SI
80ms 80ms 509538: SARQ $0x3f, BX
. . 50953c: SHRQ $0x22, BX ;prof_test.go:11
. . 509540: ADDQ SI, BX
30ms 30ms 509543: SARQ $0x1e, BX ;prof.Benchmark prof_test.go:11
90ms 90ms 509547: SHLQ $0x1e, BX
. . 50954b: SUBQ BX, SI ;prof_test.go:11
20ms 20ms 50954e: CMPQ $prof.buf+1067102432(SB), SI ;prof.Benchmark prof_test.go:11
. . 509555: JB 0x509517 ;prof_test.go:11
. . 509557: JMP 0x509563
. . 509559: MOVQ 0x10(SP), BP ;prof_test.go:9
. . 50955e: ADDQ $0x18, SP
. . 509562: RET
. . 509563: MOVQ SI, AX ;prof_test.go:11
. . 509566: MOVL $prof.buf+1067102432(SB), CX
. . 50956b: CALL runtime.panicIndex(SB)
=== Linux perf correctly shows the "MOV" instruction is expensive. ===
Percent | Source code & Disassembly of prof.test for cycles:u (24257 samples, percent: local period)
----------------------------------------------------------------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 0000000000509500 <prof.Benchmark>:
: prof.Benchmark():
:
: import "testing"
:
: var buf [1024 * 1024 * 1024]byte
:
: func Benchmark(b *testing.B) {
0.00 : 509500: sub $0x18,%rsp
0.00 : 509504: mov %rbp,0x10(%rsp)
0.00 : 509509: lea 0x10(%rsp),%rbp
: for i := 0; i < b.N; i++ {
0.00 : 50950e: mov 0x20(%rsp),%rdx
0.00 : 509513: xor %eax,%eax
0.00 : 509515: jmp 509525 <prof.Benchmark+0x25>
: // Expensive memory accesses.
: buf[(2097169*i)%len(buf)] = 42
1.82 : 509517: lea 0x14ba02(%rip),%rbx # 654f20 <prof.buf>
92.32 : 50951e: movb $0x2a,(%rbx,%rsi,1)
: for i := 0; i < b.N; i++ {
0.08 : 509522: inc %rax
0.25 : 509525: cmp %rax,0x170(%rdx)
0.00 : 50952c: jle 509559 <prof.Benchmark+0x59>
: buf[(2097169*i)%len(buf)] = 42
0.53 : 50952e: imul $0x200011,%rax,%rbx
1.74 : 509535: mov %rbx,%rsi
0.02 : 509538: sar $0x3f,%rbx
0.11 : 50953c: shr $0x22,%rbx
0.46 : 509540: add %rsi,%rbx
1.93 : 509543: sar $0x1e,%rbx
0.03 : 509547: shl $0x1e,%rbx
0.12 : 50954b: sub %rbx,%rsi
0.00 : 50954e: cmp $0x40000000,%rsi
0.61 : 509555: jb 509517 <prof.Benchmark+0x17>
0.00 : 509557: jmp 509563 <prof.Benchmark+0x63>
: for i := 0; i < b.N; i++ {
0.00 : 509559: mov 0x10(%rsp),%rbp
0.00 : 50955e: add $0x18,%rsp
0.00 : 509562: retq
: buf[(2097169*i)%len(buf)] = 42
0.00 : 509563: mov %rsi,%rax
0.00 : 509566: mov $0x40000000,%ecx
0.00 : 50956b: callq 46aae0 <runtime.panicIndex>
: ./<autogenerated>:1
0.00 : 509570: nop
Cc @hyangah@randall77
3条答案
按热度按时间wswtfjt71#
这听起来很可能是由于在 #36821 中的提案。Go的pprof分析使用ITIMER_PROF,它可能与实际昂贵的指令有几个指令的差距。
另一方面,Linux的perf可以使用Intel PEBS(/AMD等效)进行精确采样。
我不认为perf默认启用PEBS(
-e cycles:pp
是必需的),但即使是标准的硬件计数器也可能比ITIMER_PROF具有更少的差距。des4xlb02#
/cc @rhysh too.
3phpmpom3#
正如我在#30708中指出的,我们无法纠正Go中的滑移,除非使用精确的硬件计数器,这将使#36821得以实现。
无论该提案的结果如何(并非所有系统都将具有精确的计数器),我们都可以更好地记录关于此限制的信息。