My name is Curtis Walker from Supermicro Computer Inc. We design GPU servers here for many years using NVIDIA GPU's.
We currently are seeing some issues we cannot understand why when we use the amber 16 software tool on our 8 GPU servers with the Passcal 1080. When we run on GPU we don't see any issues. When we using all 4 GPU's while running the software within 3 hours we get IERR that hangs the system.
The CPU has MSR registers that report back to me using the Intel ITP and I see we have MMIO along with Function 0 VGA of the PNY 1080 causing some issue during a read transaction. The second issue I see the ML2 is the Mid-Level Cache and that is where other instructions are coming with the Cuda software and it also causing some error again IERR the system hangs.
This system using two CPU's. I have noticed when there are 4 GPU 1080 on CPU 0 the IERR does not occur. When I move the 4 GPU's 1080 to CPU 1 then within 3 hours IERR.
Here is one of the CPU crash dumps when the CPU hangs while using your tool. The very interesting thing is we us another NVIDIA cuda tool and we don't see the failure. So I really would like to have some real support onside to tell me what this software is affecting to be able to fix this issue please. This is a very hot issue.
The first part is the CPU MSR register that tells me who and what report the error from what device to the CPU to make the IERR occurs.
This address is the root cause 0x000088400006 it also randomly goes from one GTX to the next with another address 0x000088500006
Date: 2017-01-11 15:02:20
CPU0 IERR LOGGING REGISTER Here is an IERR. This means it either came from memory, QPI or PCIe.
-First IERR Source ID : Core 4
-Second IERR Source ID: Core 3
CPU0 MCERR LOGGING REGISTER
-First MCERR Source ID : Core/Cbo 4
-Second MCERR Source ID: Core/Cbo 4
CPU1 MCERR LOGGING REGISTER
-First MCERR Source ID : PCU
-Second MCERR Source ID: PCU
CPU0: X10, Broadwell Server
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| Core1 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
|===========================================================================================|
-MSCOD: MLC Watchdog timer (3-strike) Error
-MLC MISC
-Thread: 0
-Way : 0
-SQID : 0x6
-Opcode: 0x07
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| Core2 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
|===========================================================================================|
-MSCOD: MLC Watchdog timer (3-strike) Error
-MLC MISC
-Thread: 0
-Way : 0
-SQID : 0x6
-Opcode: 0x07
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| Core3 MLC | UnCorrected | Processor Context Corrupt | 0 |0x3fff81060f94| None |
|===========================================================================================|
-MSCOD: MLC Watchdog timer (3-strike) Error
-MLC MISC
-Thread: 0
-Way : 0
-SQID : 0x6
-Opcode: 0x07
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| Core4 MLC | UnCorrected | Processor Context Corrupt | 0 |0x3fff8154c1e5| Generic |
|===========================================================================================|
-MSCOD: MLC Watchdog timer (3-strike) Error
-MLC MISC
-Thread: 0
-Way : 5
-SQID : 0x4
-Opcode: 0x60
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| Core5 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
|===========================================================================================|
-MSCOD: MLC Watchdog timer (3-strike) Error
-MLC MISC
-Thread: 0
-Way : 0
-SQID : 0x6
-Opcode: 0x07
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| Core6 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
|===========================================================================================|
-MSCOD: MLC Watchdog timer (3-strike) Error
-MLC MISC
-Thread: 0
-Way : 0
-SQID : 0x6
-Opcode: 0x07
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| Core7 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
|===========================================================================================|
-MSCOD: MLC Watchdog timer (3-strike) Error
-MLC MISC
-Thread: 0
-Way : 0
-SQID : 0x6
-Opcode: 0x07
|=================================================================|
| MCA Agent | Signal | Type of Error | OverFlow |
|-----------------------------------------------------------------|
| PCU | UnCorrected | Processor Context Corrupt | 1 |
|=================================================================|
-FW Generated Error: Internal Error
-MSEC_UC: No Error
-Machine Check Error Code: None
-Corrected Error Count: 84
-Error Address : 0xc1e5
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| CBo/LLC 0 | UnCorrected | Processor Context Corrupt | 1 |0x000000000286|Physical Address |
|===========================================================================================|
-MSCOD: TOR_TIMEOUT
-Request Type : Generic Error
-Transaction Type: Generic
-Level Encoding : Level 2
-Original Request: Lock
-RTID : 0x16
-TORID : 0x00
-COREID : 0x03
-THREADID: 0x0
-WAY : 0x00
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| CBo/LLC 1 | UnCorrected | Processor Context Corrupt | 1 |0x000088400006|Physical Address |
|===========================================================================================|
-MSCOD: TOR_TIMEOUT
-Request Type : Generic Error
-Transaction Type: Generic
-Level Encoding : Level 2
-Original Request: Port Out (CFC/CF8 type transaction)
-RTID : 0x25
-TORID : 0x01
-COREID : 0x04
-THREADID: 0x0
-WAY : 0x00
|===========================================================================================|
| MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
|-------------------------------------------------------------------------------------------|
| CBo/LLC 0 | UnCorrected | Processor Context Corrupt | 1 |0x000088500006|Physical Address |
|===========================================================================================|
-MSCOD: TOR_TIMEOUT
-Request Type : Generic Error
-Transaction Type: Generic
-Level Encoding : Level 2
-Original Request: Port In (CFC/CF8 type transaction)
-RTID : 0x35
-TORID : 0x00
-COREID : 0x06
-THREADID: 0x0
-WAY : 0x00
This part I will show you from the OS level all the address called proc/iomem
Now when I show you this address 0x000088400006 you can see here it is mmio config space.
80000000-8fffffff : PCI MMCONFIG 0000 [bus 00-ff]
Then is take this address 0x000088400006 and address 0x0008850006
I find out who owns that mmio space to what lspci device.
This is on CPU 0
84:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) (prog-if 00 [VGA controller])
Subsystem: ZOTAC International (MCO) Ltd. Device 1448
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 40
Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at 2ffe0000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at 2fff0000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at e000 [size=128]
Expansion ROM at fb000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [250 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Kernel driver in use: nvidia
84:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. Device 1448
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin B routed to IRQ 76
Region 0: Memory at fb080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
This is one of the IERR we receive when the AMBER software is running
This is reporting by the MSR registers in the CPU when the system hangs.
ML2 is Mid-Level Cache where the software keeps repeated instructions.
MC3 is a virtual address when the cuda is running.
Number of CPU: 2
CPU Model: E5-2600-V4
Number of cores per CPU: 8
Microcode: b00001f
Socket0 LLC_MASK e6d
Socket0 CORE_MASK ff
skt0 MCA_ERR_SRC = 0xf4000000
skt0 IerrLoggingReg = 0xf880b80
skt0 MCerrLoggingReg = 0x5440144
Error Type: ifu Socket: 0 Core: 0 Status: 0
Error Type: ifu Socket: 0 Core: 1 Status: 0
Error Type: ifu Socket: 0 Core: 2 Status: 0
Error Type: ifu Socket: 0 Core: 3 Status: 0
Error Type: ifu Socket: 0 Core: 4 Status: 0
Error Type: ifu Socket: 0 Core: 5 Status: 0
Error Type: ifu Socket: 0 Core: 6 Status: 0
Error Type: ifu Socket: 0 Core: 7 Status: 0
Error Type: dcu Socket: 0 Core: 0 Status: 0
Error Type: dcu Socket: 0 Core: 1 Status: 0
Error Type: dcu Socket: 0 Core: 2 Status: 0
Error Type: dcu Socket: 0 Core: 3 Status: 0
Error Type: dcu Socket: 0 Core: 4 Status: 0
Error Type: dcu Socket: 0 Core: 5 Status: 0
Error Type: dcu Socket: 0 Core: 6 Status: 0
Error Type: dcu Socket: 0 Core: 7 Status: 0
Error Type: dtlb Socket: 0 Core: 0 Status: 0
Error Type: dtlb Socket: 0 Core: 1 Status: 0
Error Type: dtlb Socket: 0 Core: 2 Status: 0
Error Type: dtlb Socket: 0 Core: 3 Status: 0
Error Type: dtlb Socket: 0 Core: 4 Status: 0
Error Type: dtlb Socket: 0 Core: 5 Status: 0
Error Type: dtlb Socket: 0 Core: 6 Status: 0
Error Type: dtlb Socket: 0 Core: 7 Status: 0
Error Type: ml2 Socket: 0 Core: 0 Status: fe00000000800400
Socket0: ml2_MC3_CORE0_CTL= f
Socket0: ml2_MC3_CORE0_STATUS= fe00000000800400
Socket0: ml2_MC3_CORE0_ADDR= ffffffff81060ff5
Socket0: ml2_MC3_CORE0_MISC= ffffffff81060ff5
Socket0: ml2_MC3_CORE0_CTL2= 40000001
Error Type: ml2 Socket: 0 Core: 1 Status: fe00000000800400
Socket0: ml2_MC3_CORE1_CTL= f
Socket0: ml2_MC3_CORE1_STATUS= fe00000000800400
Socket0: ml2_MC3_CORE1_ADDR= ffffffff81060ff5
Socket0: ml2_MC3_CORE1_MISC= ffffffff81060ff5
Socket0: ml2_MC3_CORE1_CTL2= 40000001
Error Type: ml2 Socket: 0 Core: 2 Status: fe00000000800400
Socket0: ml2_MC3_CORE2_CTL= f
Socket0: ml2_MC3_CORE2_STATUS= fe00000000800400
Socket0: ml2_MC3_CORE2_ADDR= ffffffff81060ff5
Socket0: ml2_MC3_CORE2_MISC= ffffffff81060ff5
Socket0: ml2_MC3_CORE2_CTL2= 40000001
Error Type: ml2 Socket: 0 Core: 3 Status: fe00000000800400
Socket0: ml2_MC3_CORE3_CTL= f
Socket0: ml2_MC3_CORE3_STATUS= fe00000000800400
Socket0: ml2_MC3_CORE3_ADDR= ffffffff81060ff5
Socket0: ml2_MC3_CORE3_MISC= ffffffff81060ff5
Socket0: ml2_MC3_CORE3_CTL2= 40000001
Error Type: ml2 Socket: 0 Core: 4 Status: fe00000000800400
Socket0: ml2_MC3_CORE4_CTL= f
Socket0: ml2_MC3_CORE4_STATUS= fe00000000800400
Socket0: ml2_MC3_CORE4_ADDR= ffffffff81060ff5
Socket0: ml2_MC3_CORE4_MISC= ffffffff81060ff5
Socket0: ml2_MC3_CORE4_CTL2= 40000001
Error Type: ml2 Socket: 0 Core: 5 Status: fe00000000800400
Socket0: ml2_MC3_CORE5_CTL= f
Socket0: ml2_MC3_CORE5_STATUS= fe00000000800400
Socket0: ml2_MC3_CORE5_ADDR= ffffffff81060ff5
Socket0: ml2_MC3_CORE5_MISC= ffffffff81060ff5
Socket0: ml2_MC3_CORE5_CTL2= 40000001
Error Type: ml2 Socket: 0 Core: 6 Status: fe00000000800400
Socket0: ml2_MC3_CORE6_CTL= f
Socket0: ml2_MC3_CORE6_STATUS= fe00000000800400
Socket0: ml2_MC3_CORE6_ADDR= ffffffff81060ff5
Socket0: ml2_MC3_CORE6_MISC= ffffffff81060ff5
Socket0: ml2_MC3_CORE6_CTL2= 40000001
Error Type: ml2 Socket: 0 Core: 7 Status: fe00000000800400
Socket0: ml2_MC3_CORE7_CTL= f
Socket0: ml2_MC3_CORE7_STATUS= fe00000000800400
Socket0: ml2_MC3_CORE7_ADDR= ffffffff81060ff5
Socket0: ml2_MC3_CORE7_MISC= ffffffff81060ff5
Socket0: ml2_MC3_CORE7_CTL2= 40000001
Error Type: pcu Socket: 0 Status: b200000000400407
Socket0: pcu_MC4_CTL= 7f
Socket0: pcu_MC4_STATUS= b200000000400407
Socket0: pcu_MC4_ADDR= 0
Socket0: pcu_MC4_MISC= 0
Socket0: pcu_MC4_CTL2= 0
Error Type: qpi0 Socket: 0 Status: 0
Error Type: iio Socket: 0 Status: 0
Error Type: ha0 Socket: 0 Status: 0
Error Type: ha1 Socket: 0 Status: 0
Error Type: imc0 Socket: 0 Status: 0
Error Type: imc1 Socket: 0 Status: 0
Error Type: imc2 Socket: 0 Status: 0
Error Type: imc3 Socket: 0 Status: 0
Error Type: imc4 Socket: 0 Status: 0
Error Type: imc5 Socket: 0 Status: 0
Error Type: imc6 Socket: 0 Status: 0
Error Type: imc7 Socket: 0 Status: 0
Error Type: cbo0 Socket: 0 Status: 0
Error Type: cbo1 Socket: 0 Status: 0
Error Type: cbo2 Socket: 0 Status: 0
Error Type: qpi1 Socket: 0 Status: 0
Error Type: qpi2 Socket: 0 Status: 0
CPU Model: E5-2600-V4
Number of cores per CPU: 8
Microcode: b00001f
Socket1 LLC_MASK e6d
Socket1 CORE_MASK ff
skt1 MCA_ERR_SRC = 0xbc000000
skt1 IerrLoggingReg = 0xf840b88
skt1 MCerrLoggingReg = 0x7880144
Error Type: ifu Socket: 1 Core: 0 Status: badbad
Error Type: ifu Socket: 1 Core: 1 Status: 0
Error Type: ifu Socket: 1 Core: 2 Status: 0
Error Type: ifu Socket: 1 Core: 3 Status: badbad
Error Type: ifu Socket: 1 Core: 4 Status: badbad
Error Type: ifu Socket: 1 Core: 5 Status: badbad
Error Type: ifu Socket: 1 Core: 6 Status: badbad
Error Type: ifu Socket: 1 Core: 7 Status: badbad
Error Type: dcu Socket: 1 Core: 0 Status: badbad
Error Type: dcu Socket: 1 Core: 1 Status: 0
Error Type: dcu Socket: 1 Core: 2 Status: 0
Error Type: dcu Socket: 1 Core: 3 Status: badbad
Error Type: dcu Socket: 1 Core: 4 Status: badbad
Error Type: dcu Socket: 1 Core: 5 Status: badbad
Error Type: dcu Socket: 1 Core: 6 Status: badbad
Error Type: dcu Socket: 1 Core: 7 Status: badbad
Error Type: dtlb Socket: 1 Core: 0 Status: badbad
Error Type: dtlb Socket: 1 Core: 1 Status: 0
Error Type: dtlb Socket: 1 Core: 2 Status: 0
Error Type: dtlb Socket: 1 Core: 3 Status: badbad
Error Type: dtlb Socket: 1 Core: 4 Status: badbad
Error Type: dtlb Socket: 1 Core: 5 Status: badbad
Error Type: dtlb Socket: 1 Core: 6 Status: badbad
Error Type: dtlb Socket: 1 Core: 7 Status: badbad
Error Type: ml2 Socket: 1 Core: 0 Status: badbad
Error Type: ml2 Socket: 1 Core: 1 Status: be00000000800400
Socket1: ml2_MC3_CORE1_CTL= f
Socket1: ml2_MC3_CORE1_STATUS= be00000000800400
Socket1: ml2_MC3_CORE1_ADDR= ffffffff81060f94
Socket1: ml2_MC3_CORE1_MISC= ffffffff81060f94
Socket1: ml2_MC3_CORE1_CTL2= 40000001
Error Type: ml2 Socket: 1 Core: 2 Status: fe00000000800400
Socket1: ml2_MC3_CORE2_CTL= f
Socket1: ml2_MC3_CORE2_STATUS= fe00000000800400
Socket1: ml2_MC3_CORE2_ADDR= ffffffff8154c1d5
Socket1: ml2_MC3_CORE2_MISC= ffffffff8154c1d5
Socket1: ml2_MC3_CORE2_CTL2= 40000001
Error Type: ml2 Socket: 1 Core: 3 Status: badbad
Error Type: ml2 Socket: 1 Core: 4 Status: badbad
Error Type: ml2 Socket: 1 Core: 5 Status: badbad
Error Type: ml2 Socket: 1 Core: 6 Status: badbad
Error Type: ml2 Socket: 1 Core: 7 Status: badbad
Error Type: pcu Socket: 1 Status: be00000000800400
Socket1: pcu_MC4_CTL= 7f
Socket1: pcu_MC4_STATUS= be00000000800400
Socket1: pcu_MC4_ADDR= ffffffff81060f94
Socket1: pcu_MC4_MISC= ffffffff81060f94
Socket1: pcu_MC4_CTL2= 0
Error Type: qpi0 Socket: 1 Status: 0
Error Type: iio Socket: 1 Status: 0
Error Type: ha0 Socket: 1 Status: 0
Error Type: ha1 Socket: 1 Status: 0
Error Type: imc0 Socket: 1 Status: 0
Error Type: imc1 Socket: 1 Status: 0
Error Type: imc2 Socket: 1 Status: 0
Error Type: imc3 Socket: 1 Status: 0
Error Type: imc4 Socket: 1 Status: 0
Error Type: imc5 Socket: 1 Status: 0
Error Type: imc6 Socket: 1 Status: 0
Error Type: imc7 Socket: 1 Status: 0
Error Type: cbo0 Socket: 1 Status: 0
Error Type: cbo1 Socket: 1 Status: fe200000000c110a
Socket1: cbo1_MC18_CTL= 1ffffff
Socket1: cbo1_MC18_STATUS= fe200000000c110a
Socket1: cbo1_MC18_ADDR= 280
Socket1: cbo1_MC18_MISC= 50fff81601580086
Socket1: cbo1_MC18_CTL2= 40000001
Error Type: cbo2 Socket: 1 Status: fe200000000c110a
Socket1: cbo2_MC19_CTL= 1ffffff
Socket1: cbo2_MC19_STATUS= fe200000000c110a
Socket1: cbo2_MC19_ADDR= 88400000
Socket1: cbo2_MC19_MISC= 70ffa81602500086
Socket1: cbo2_MC19_CTL2= 40000001
Error Type: qpi1 Socket: 1 Status: 0
Error Type: qpi2 Socket: 1 Status: 0
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Kernel driver in use: snd_hda_intel
Here is another way to see another IERR CPU crash dump.
Curtis Walker
Manager Hardware Debug Engineer
Technical Services
Supermicro Computer, Inc
Ph.: 408-895-6221
Cell phone: 408-910-2487
[cid:image003.png.01D2885E.8A23A1A0]
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Feb 16 2017 - 14:30:02 PST