debugging WinDbg:搜寻导致.net服务崩溃的异常

4bbkushb  于 2022-11-14  发布在  .NET
关注(0)|答案(3)|浏览(219)

一直有问题的x64 dotnet服务崩溃间歇性地在一个应用程序服务器上。该服务可以运行几个小时,几天,或几周没有问题,但然后崩溃没有太多的信息。
该服务在两台服务器的集群中运行(每台服务器3个服务)-并且在两台服务器上的任何一个服务都出现了崩溃。复制的环境显示了相同的行为,所以我已经“用尽”了配置问题的想法。
最初从应用程序服务器的事件日志中提取的错误是:

Error message from event log on server XXXX

Application: MySvc.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an internal error in the .NET Runtime
at IP 000007FEEFD8CD4C (000007FEEFC70000) with exit code 80131506

这并没有显示太多的细节和最好的指针,我在网上找到的是'交叉手指'...
Application Crashes With "Internal Error In The .NET Runtime"
http://www.jamesewelch.com/2010/09/30/troubleshooting-internal-error-in-the-net-runtime/
最后,在运行了一个月的AdPlus调试器后,我们遇到了一系列的故障和一些崩溃转储。现在我有了转储,我很难从它们中获得任何有用的东西。
我以前研究过一些“挂起”转储并取得了很大的成功,也读了很多Tess Ferrandez的博客,但是我所拥有的“崩溃”转储被证明是死胡同。大多数对象、异常等都被标记为垃圾收集&只剩下主线程-我可能遗漏了一些东西。
我将添加来自!analyze -v的详细信息以及转储日志-它们确实会显示异常。
所以,真实的的问题是:有人能给予我一些指示吗?转储日志中的异常与我在实际转储中看到的异常不匹配。
转储1日志可从以下位置获得:http://pastebin.com/Eg5YCqww
转储1分析:(我有一个无法解决的符号问题..)

0:000> !analyze -v
***
FAULTING_IP: 
+112c9440
00000000`00000000 ??              ???

EXCEPTION_RECORD:  ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 0000000000000000
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 0

FAULTING_THREAD:  00000000000011f8

PROCESS_NAME:  MySvc.exe

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid

MOD_LIST: <ANALYSIS/>

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

MANAGED_STACK: 
(TransitionMU)
000000000022EBB0 000007FEF40CB1AB System_ServiceProcess_ni!DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr)+0x3b
000000000022EC70 000007FEF40CD20D System_ServiceProcess_ni!System.ServiceProcess.ServiceBase.Run(System.ServiceProcess.ServiceBase[])+0x26d
000000000022EDA0 000007FF00170227 MySvc!Ax.Remoting.MySvc.Main()+0x107
(TransitionUM)

MANAGED_STACK_COMMAND:  _EFN_StackTrace

BUGCHECK_STR:  APPLICATION_FAULT_WRONG_SYMBOLS_FILL_PATTERN_ffffffff

PRIMARY_PROBLEM_CLASS:  WRONG_SYMBOLS_FILL_PATTERN_ffffffff

DEFAULT_BUCKET_ID:  WRONG_SYMBOLS_FILL_PATTERN_ffffffff

LAST_CONTROL_TRANSFER:  from 000007fefd8810ac to 000000007760f6fa

STACK_TEXT:  
00000000`0022e818 000007fe`fd8810ac : 00000000`007541f0 000007fe`f40ce089 00000000`0022e9c0 00000000`00000000 : ntdll!ZwWaitForSingleObject+0xa
00000000`0022e820 000007fe`fe7daffb : 00000000`ffffffff 000007fe`fe7d344c 00000000`00000000 00000000`0000032c : KERNELBASE!WaitForSingleObjectEx+0x79
00000000`0022e8c0 000007fe`fe7d9d61 : 00000000`01d47ff0 00000000`0000032c 00000000`00000000 00000000`00000000 : sechost!ScSendResponseReceiveControls+0x13b
00000000`0022e9b0 000007fe`fe7d9c16 : 00000000`0022eb18 00000000`00000000 00000000`00000000 000007fe`00000000 : sechost!ScDispatcherLoop+0x121
00000000`0022eac0 000007fe`f19017c7 : 00000000`11213890 00000000`01d635c0 00000000`00000000 00000000`00000000 : sechost!StartServiceCtrlDispatcherW+0x14e
00000000`0022eb10 000007fe`f40cb1ab : 00000000`01d63680 00000000`0022ebe8 000007fe`f40a5b50 0000bf6c`4589127e : clr!DoNDirectCall__PatchGetThreadCall+0x7b
00000000`0022ebb0 000007fe`f40cd20d : 00000000`01d63680 00000000`00000000 00000000`01d63698 00000000`00000000 : System_ServiceProcess_ni+0x2b1ab
00000000`0022ec70 000007ff`00170227 : 00000000`10ff1ac8 00000000`10ff1af0 00000000`10ff1af0 00000000`10ff1af0 : System_ServiceProcess_ni+0x2d20d
00000000`0022eda0 000007fe`f196dc54 : 00000000`0022ee80 000007fe`f1904e65 ffffffff`fffffffe 00000000`0022f3a0 : 0x7ff`00170227
00000000`0022ee30 000007fe`f196dd69 : 000007ff`000551f8 00000000`00000001 00000000`00000000 00000000`00000000 : clr!CallDescrWorker+0x84
00000000`0022ee70 000007fe`f196dde5 : 00000000`0022ef88 00000000`00000000 00000000`0022ef90 00000000`0022f168 : clr!CallDescrWorkerWithHandler+0xa9
00000000`0022eef0 000007fe`f1a214c5 : 00000000`00000000 00000000`0022f178 00000000`00000000 00000000`00000000 : clr!MethodDesc::CallDescr+0x2a1
00000000`0022f120 000007fe`f1a215fc : 00000000`000ad7c0 00000000`000ad7c0 00000000`00000000 00000000`00000000 : clr!ClassLoader::RunMain+0x228
00000000`0022f370 000007fe`f1a213b2 : 00000000`0022f970 00000000`00000200 00000000`000b7a80 00000000`00000200 : clr!Assembly::ExecuteMainMethod+0xac
00000000`0022f620 000007fe`f1ac6d66 : 00000000`00000000 00000000`10fd0000 00000000`00000000 00000000`00000000 : clr!SystemDomain::ExecuteMainMethod+0x452
00000000`0022fbd0 000007fe`f1ac6c83 : 00000000`10fd0000 00000000`00000000 00000000`00000000 00000000`00000000 : clr!ExecuteEXE+0x43
00000000`0022fc30 000007fe`f1a2c515 : 00000000`000ad7c0 ffffffff`ffffffff 00000000`00000000 00000000`00000000 : clr!CorExeMainInternal+0xc4
00000000`0022fca0 000007fe`f8973309 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`0022fc88 : clr!CorExeMain+0x15
00000000`0022fce0 000007fe`f8a05b21 : 000007fe`f1a2c500 000007fe`f89732c0 00000000`00000000 00000000`00000000 : mscoreei!CorExeMain+0x41
00000000`0022fd10 00000000`773bf56d : 000007fe`f8970000 00000000`00000000 00000000`00000000 00000000`00000000 : mscoree!CorExeMain_Exported+0x57
00000000`0022fd40 00000000`775f2cc1 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`0022fd70 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d

STACK_COMMAND:  ~0s; .ecxr ; kb

FOLLOWUP_IP: 
sechost!ScSendResponseReceiveControls+13b
000007fe`fe7daffb 85c0            test    eax,eax

SYMBOL_STACK_INDEX:  2

SYMBOL_NAME:  sechost!ScSendResponseReceiveControls+13b

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: sechost

IMAGE_NAME:  sechost.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  4a5be05e

FAILURE_BUCKET_ID:  WRONG_SYMBOLS_FILL_PATTERN_ffffffff_80000003_sechost.dll!ScSendResponseReceiveControls

BUCKET_ID:  X64_APPLICATION_FAULT_WRONG_SYMBOLS_FILL_PATTERN_ffffffff_sechost!ScSendResponseReceiveControls+13b

更新1(12月29日):

已从转储日志重建其中一个CLR异常错误,调用堆栈如下。似乎在调用数据库(通过ODAC)时发生异常错误

clr!RaiseTheExceptionInternalOnly+0x363
clr!IL_Throw+0x146
gm.a(System.String, System.String, Int32, System.String, XXBase, Int32, XXDataParameter[])
gm.b(XXBase, XXBase, Boolean, Boolean, Boolean, Int32)
gm.b(XXBase, XXBase)
od.a(XXGridQueue, TaskStatus, ProcessResult, Int32, Int32, Int32)
od.b(XXGridQueue)
he.b(XXBaseCollection)
he.a(Boolean ByRef)
XX.MySvc.tmr_Elapsed(System.Object)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

已重建访问冲突异常错误调用堆栈。在调用ODAC库后调用垃圾收集器时引发错误。

(1330.1074): Access violation - code c0000005 (first chance)
FirstChance_av_AccessViolation

clr!WKS::gc_heap::plan_phase+0x5ac
clr!WKS::gc_heap::gc1+0xbb
clr!WKS::gc_heap::garbage_collect+0x276
clr!WKS::GCHeap::GarbageCollectGeneration+0x14e
clr!WKS::gc_heap::try_allocate_more_space+0x25f
clr!WKS::GCHeap::Alloc+0x7e
clr!FastAllocatePrimitiveArray+0xc5
clr!JIT_NewArr1+0x389
System.Decimal.GetBits(System.Decimal)
Oracle.DataAccess.Types.DecimalConv.GetDecimal(IntPtr)
Oracle.DataAccess.Client.OracleDataReader.GetDecimal(Int32)
Oracle.DataAccess.Client.OracleDataReader.GetValue(Int32)
Oracle.DataAccess.Client.OracleDataReader.GetValues(System.Object[])
jr.a(System.Data.IDataReader, Boolean, ku, Boolean, DbTypeEnum, System.Type[])
ls.a(System.Data.IDataReader, Boolean, ku, Boolean, DbTypeEnum, System.Type[])
ba.a(System.String, System.Data.IDataReader, Boolean, ku, Boolean, System.Type[])
...
XX.MySvc.tmr_Elapsed(System.Object)

可能的类似问题(来自新信息):第http://markmail.org/message/yy3mvbngula4i3mu#query:+page:1+mid:l546gn5sfxtxxm5i+state:results章

更新2(2月23日):

ODAC组件已升级到Dotnet 4.0的正确版本(或Oracle网站上列出的兼容版本),但该问题仍然再次出现。该问题仍然以非常间歇的方式(每一到两周)再次出现。出现的服务每天循环。
从最近的崩溃中得到更多的转储,这些仍然指向堆损坏--尽管不是完全转储(访问冲突)。实际上,创建完全转储似乎失败了。

Creating d:\dumps\2xx_Crash_Mode\FULLDUMP_FirstChance_epr_Process_Shut_Down_MySvc.exe__0344.dmp - mini user dump
WriteFullMemory.Memory.Read(0x262c000, 0x1000) failed 0x8007012b, ABORT.
Dump creation failed, Win32 error 0n299
    "Only part of a ReadProcessMemory or WriteProcessMemory request was completed."

此外,一个自定义托管(dotnet)库被加载到应用程序中,这似乎也引发了一个异常,尽管这只是“第一次机会”,而且似乎不会导致进程失败(我猜这可能是一个因素)。它实际上也是我们的库,所以我能够验证它没有调用托管代码。错误是:

EXCEPTION_RECORD:  ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 000007fefcffaa7d (KERNELBASE!RaiseException+0x0000000000000039)
ExceptionCode: c0000006 (In-page I/O error)
ExceptionFlags: 00000000
NumberParameters: 3
Parameter[0]: 0000000000000000
Parameter[1]: 000000006d34aca0
Parameter[2]: 00000000c00000c4
Inpage operation failed at 000000006d34aca0, due to I/O error 00000000c00000c4

PROCESS_NAME:  MySvc.exe

ERROR_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x.

EXCEPTION_OBJECT: !pe 1a8106a8
Exception object: 000000001a8106a8
Exception type:   System.Runtime.InteropServices.SEHException
Message:          External component has thrown an exception.
InnerException:   <none>
StackTrace (generated):
SP               IP               Function
000000002C77B980 0000000000000000 ...
000000002C77BA50 000007FF01DCBA51 ...

StackTraceString: <none>
HResult: 80004005

MANAGED_OBJECT: !dumpobj 148306f8
Name:        System.String
MethodTable: 000007feed9a6870
EEClass:     000007feed52ed58
Size:        112(0x70) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      External component has thrown an exception.
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
0000000000000000  4000103        8         System.Int32  1 instance               43 m_stringLength
0000000000000000  4000104        c          System.Char  1 instance               45 m_firstChar
000007feed9a6870  4000105       10        System.String  0   shared           static Empty
                             >> Domain:Value  00000000002a69f0:NotInit  000000000dd738d0:NotInit  <<

EXCEPTION_MESSAGE:  External component has thrown an exception.

MANAGED_OBJECT_NAME:  System.Runtime.InteropServices.SEHException

MANAGED_STACK_COMMAND:  !pe 1a8106a8

LAST_CONTROL_TRANSFER:  from 000007fef47e8fc1 to 000007fefcffaa7d

ADDITIONAL_DEBUG_TEXT:  Followup set based on attribute [Is_ChosenCrashFollowupThread] from Frame:[0] on thread:[PSEUDO_THREAD] ; Followup set based on attribute [ip_is_call_value_Arch_si] from Frame:[23] on thread:[162c]

FAULTING_THREAD:  ffffffffffffffff

BUGCHECK_STR:  APPLICATION_FAULT__SYSTEM.RUNTIME.INTEROPSERVICES.SEHEXCEPTION_APPLICATION_FAULT_CALL

PRIMARY_PROBLEM_CLASS:  _SYSTEM.RUNTIME.INTEROPSERVICES.SEHEXCEPTION_CALL

DEFAULT_BUCKET_ID:  _SYSTEM.RUNTIME.INTEROPSERVICES.SEHEXCEPTION_CALL

STACK_TEXT:  
00000000`2c77b980 00000000`00000000 ...
00000000`2c77ba50 00000000`ffffffff ...

任何一个有任何想法,如何进一步追求这一点在一个权宜之计的方式。我渴望得到一些更充分的转储-但当然需要找到答案早于下一次失败!

d5vmydt9

d5vmydt91#

崩溃(命中断点)的原因表明进程中存在堆损坏。堆损坏故障是由堆管理函数通过发出调试中断来报告的。
从记录的错误判断,.net运行时没有准备好处理这些错误(我可能是错的,可能有更好的解释)。跟踪堆损坏的通常方法是启用(整个)页堆,这有助于通过使进程崩溃到更接近损坏点的位置来定位出错的组件。
至少可以说,查找堆损坏是一个真实的的痛苦,但如果内存消耗允许的话,我会选择对内存要求适中的应用程序最有效的全页堆。
希望能有所帮助。

bkkx9g8r

bkkx9g8r2#

x64 .NET 4.0的GC有一个错误。可能是您受到了这个错误的影响。MS建议在他们发布热修复之前禁用并发GC。或者,您可以启用服务器GC,使每个内核获得一个GC线程,如果您有多个内核,这是可能的。
否则服务器gc标志将无效。
这里是KB article.的链接

8wigbo56

8wigbo563#

两件事1.确保您运行的是最新版本的clr 2.对于本机堆损坏,pageheap是一个很好的选择,对于托管堆,您可以尝试GCStress How to turn GCStress on in Windows 7? 3.要验证托管堆上的堆损坏,您可以使用SOS https://msdn.microsoft.com/en-us/library/bb190764(v=vs.110).aspx“VerifyHeap中的verifyheap检查垃圾收集器堆是否有损坏迹象,并显示发现的任何错误。堆损坏可能是由构造不正确的平台调用调用引起的。“

相关问题