Re: [AMBER] IERR CPU hang needing support please

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 17 Feb 2017 11:35:57 -0500

Hi Curtis,

What does 'lspci -d "10b5:*" -vvv | grep ACSCtl' run as root return?

Also what happens if you run a 2 GPU job on GPUs on CPU 1? Try with all the combinations of CUDA_VISIBLE_DEVICES

CUDA_VISIBLE_DEVICES=4,5
CUDA_VISIBLE_DEVICES=6,7
CUDA_VISIBLE_DEVICES=4,6
CUDA_VISIBLE_DEVICES=4,7
CUDA_VISIBLE_DEVICES=5,6

This might let you narrow it down to a pair of PCI-E slots. My money, if you have PLX switches present, would be on 4,5 and 6,7 working fine (but 4,7 causing issues). I.e. spanning PLX switches being an issue or the two GPUs that are physically the furthest apart causing problems.

Assuming swapping CPUs, memory etc between banks doesn't cause the error to follow a specific component then my guess would be something related to P2P communication over PCI-E. What's the logical layout of the motherboard? Do you have PLX switches between the CPUs and the GPUs? I have found some motherboards, particularly the Asus X99-E-WS boards to sometimes be marginal on the PLX switches. Almost as if things were right on the limit of line length and so a few boards just had timing issues depending on manufacturing tolerances. Have you tried multilple motherboards and always see the same issue? If not then I'd suspect it's just something bad with the CPU 1 PCI-E root complex - a dry solder joint maybe? If it is on multiple motherboards of the same design then it gets more interesting. It's possible the line lengths are too long on CPU 1 vs CPU 0? AMBER really hammers P2P communication over PCI-E so if there is a timing issue it will really show this. Although I've seen that issue more when trying to conn
ect multiple 8796 switch boards with PCI-E extension cables.

All the best
Ross

> On Feb 16, 2017, at 17:11, Curtis Walker <curtisw.supermicro.com> wrote:
>
> My name is Curtis Walker from Supermicro Computer Inc. We design GPU servers here for many years using NVIDIA GPU's.
> We currently are seeing some issues we cannot understand why when we use the amber 16 software tool on our 8 GPU servers with the Passcal 1080. When we run on GPU we don't see any issues. When we using all 4 GPU's while running the software within 3 hours we get IERR that hangs the system.
>
> The CPU has MSR registers that report back to me using the Intel ITP and I see we have MMIO along with Function 0 VGA of the PNY 1080 causing some issue during a read transaction. The second issue I see the ML2 is the Mid-Level Cache and that is where other instructions are coming with the Cuda software and it also causing some error again IERR the system hangs.
>
> This system using two CPU's. I have noticed when there are 4 GPU 1080 on CPU 0 the IERR does not occur. When I move the 4 GPU's 1080 to CPU 1 then within 3 hours IERR.
>
>
> Here is one of the CPU crash dumps when the CPU hangs while using your tool. The very interesting thing is we us another NVIDIA cuda tool and we don't see the failure. So I really would like to have some real support onside to tell me what this software is affecting to be able to fix this issue please. This is a very hot issue.
>
>
>
> The first part is the CPU MSR register that tells me who and what report the error from what device to the CPU to make the IERR occurs.
>
> This address is the root cause 0x000088400006 it also randomly goes from one GTX to the next with another address 0x000088500006
>
> Date: 2017-01-11 15:02:20
> CPU0 IERR LOGGING REGISTER Here is an IERR. This means it either came from memory, QPI or PCIe.
> -First IERR Source ID : Core 4
> -Second IERR Source ID: Core 3
>
>
> CPU0 MCERR LOGGING REGISTER
> -First MCERR Source ID : Core/Cbo 4
> -Second MCERR Source ID: Core/Cbo 4
>
> CPU1 MCERR LOGGING REGISTER
> -First MCERR Source ID : PCU
> -Second MCERR Source ID: PCU
>
> CPU0: X10, Broadwell Server
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | Core1 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
> |===========================================================================================|
> -MSCOD: MLC Watchdog timer (3-strike) Error
> -MLC MISC
> -Thread: 0
> -Way : 0
> -SQID : 0x6
> -Opcode: 0x07
>
>
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | Core2 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
> |===========================================================================================|
> -MSCOD: MLC Watchdog timer (3-strike) Error
> -MLC MISC
> -Thread: 0
> -Way : 0
> -SQID : 0x6
> -Opcode: 0x07
>
>
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | Core3 MLC | UnCorrected | Processor Context Corrupt | 0 |0x3fff81060f94| None |
> |===========================================================================================|
> -MSCOD: MLC Watchdog timer (3-strike) Error
> -MLC MISC
> -Thread: 0
> -Way : 0
> -SQID : 0x6
> -Opcode: 0x07
>
>
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | Core4 MLC | UnCorrected | Processor Context Corrupt | 0 |0x3fff8154c1e5| Generic |
> |===========================================================================================|
> -MSCOD: MLC Watchdog timer (3-strike) Error
> -MLC MISC
> -Thread: 0
> -Way : 5
> -SQID : 0x4
> -Opcode: 0x60
>
>
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | Core5 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
> |===========================================================================================|
> -MSCOD: MLC Watchdog timer (3-strike) Error
> -MLC MISC
> -Thread: 0
> -Way : 0
> -SQID : 0x6
> -Opcode: 0x07
>
>
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | Core6 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
> |===========================================================================================|
> -MSCOD: MLC Watchdog timer (3-strike) Error
> -MLC MISC
> -Thread: 0
> -Way : 0
> -SQID : 0x6
> -Opcode: 0x07
>
>
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | Core7 MLC | UnCorrected | Processor Context Corrupt | 1 |0x3fff81060ff5| Generic |
> |===========================================================================================|
> -MSCOD: MLC Watchdog timer (3-strike) Error
> -MLC MISC
> -Thread: 0
> -Way : 0
> -SQID : 0x6
> -Opcode: 0x07
>
>
> |=================================================================|
> | MCA Agent | Signal | Type of Error | OverFlow |
> |-----------------------------------------------------------------|
> | PCU | UnCorrected | Processor Context Corrupt | 1 |
> |=================================================================|
> -FW Generated Error: Internal Error
> -MSEC_UC: No Error
> -Machine Check Error Code: None
> -Corrected Error Count: 84
> -Error Address : 0xc1e5
>
>
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | CBo/LLC 0 | UnCorrected | Processor Context Corrupt | 1 |0x000000000286|Physical Address |
> |===========================================================================================|
> -MSCOD: TOR_TIMEOUT
> -Request Type : Generic Error
> -Transaction Type: Generic
> -Level Encoding : Level 2
> -Original Request: Lock
> -RTID : 0x16
> -TORID : 0x00
> -COREID : 0x03
> -THREADID: 0x0
> -WAY : 0x00
>
>
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | CBo/LLC 1 | UnCorrected | Processor Context Corrupt | 1 |0x000088400006|Physical Address |
> |===========================================================================================|
> -MSCOD: TOR_TIMEOUT
> -Request Type : Generic Error
> -Transaction Type: Generic
> -Level Encoding : Level 2
> -Original Request: Port Out (CFC/CF8 type transaction)
> -RTID : 0x25
> -TORID : 0x01
> -COREID : 0x04
> -THREADID: 0x0
> -WAY : 0x00
>
> |===========================================================================================|
> | MCA Agent | Signal | Type of Error |OVF| Address | Address Mode |
> |-------------------------------------------------------------------------------------------|
> | CBo/LLC 0 | UnCorrected | Processor Context Corrupt | 1 |0x000088500006|Physical Address |
> |===========================================================================================|
> -MSCOD: TOR_TIMEOUT
> -Request Type : Generic Error
> -Transaction Type: Generic
> -Level Encoding : Level 2
> -Original Request: Port In (CFC/CF8 type transaction)
> -RTID : 0x35
> -TORID : 0x00
> -COREID : 0x06
> -THREADID: 0x0
> -WAY : 0x00
>
>
> This part I will show you from the OS level all the address called proc/iomem
> Now when I show you this address 0x000088400006 you can see here it is mmio config space.
>
>
> 80000000-8fffffff : PCI MMCONFIG 0000 [bus 00-ff]
>
> Then is take this address 0x000088400006 and address 0x0008850006
> I find out who owns that mmio space to what lspci device.
>
> This is on CPU 0
> 84:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) (prog-if 00 [VGA controller])
> Subsystem: ZOTAC International (MCO) Ltd. Device 1448
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 32 bytes
> Interrupt: pin A routed to IRQ 40
> Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
> Region 1: Memory at 2ffe0000000 (64-bit, prefetchable) [size=256M]
> Region 3: Memory at 2fff0000000 (64-bit, prefetchable) [size=32M]
> Region 5: I/O ports at e000 [size=128]
> Expansion ROM at fb000000 [disabled] [size=512K]
> Capabilities: [60] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> Address: 0000000000000000 Data: 0000
> Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 256 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
> ClockPM+ Surprise- LLActRep- BwNot-
> LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
> EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
> Capabilities: [100 v1] Virtual Channel
> Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
> Arb: Fixed- WRR32- WRR64- WRR128-
> Ctrl: ArbSelect=Fixed
> Status: InProgress-
> VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> Status: NegoPending- InProgress-
> Capabilities: [250 v1] Latency Tolerance Reporting
> Max snoop latency: 0ns
> Max no snoop latency: 0ns
> Capabilities: [128 v1] Power Budgeting <?>
> Capabilities: [420 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
> Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> Capabilities: [900 v1] #19
> Kernel driver in use: nvidia
>
> 84:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)
> Subsystem: ZOTAC International (MCO) Ltd. Device 1448
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 32 bytes
> Interrupt: pin B routed to IRQ 76
> Region 0: Memory at fb080000 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [60] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> Address: 0000000000000000 Data: 0000
> Capabilities: [78] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 256 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
> ClockPM+ Surprise- LLActRep- BwNot-
> LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> This is one of the IERR we receive when the AMBER software is running
>
> This is reporting by the MSR registers in the CPU when the system hangs.
> ML2 is Mid-Level Cache where the software keeps repeated instructions.
> MC3 is a virtual address when the cuda is running.
>
> Number of CPU: 2
> CPU Model: E5-2600-V4
> Number of cores per CPU: 8
> Microcode: b00001f
> Socket0 LLC_MASK e6d
> Socket0 CORE_MASK ff
> skt0 MCA_ERR_SRC = 0xf4000000
> skt0 IerrLoggingReg = 0xf880b80
> skt0 MCerrLoggingReg = 0x5440144
> Error Type: ifu Socket: 0 Core: 0 Status: 0
> Error Type: ifu Socket: 0 Core: 1 Status: 0
> Error Type: ifu Socket: 0 Core: 2 Status: 0
> Error Type: ifu Socket: 0 Core: 3 Status: 0
> Error Type: ifu Socket: 0 Core: 4 Status: 0
> Error Type: ifu Socket: 0 Core: 5 Status: 0
> Error Type: ifu Socket: 0 Core: 6 Status: 0
> Error Type: ifu Socket: 0 Core: 7 Status: 0
> Error Type: dcu Socket: 0 Core: 0 Status: 0
> Error Type: dcu Socket: 0 Core: 1 Status: 0
> Error Type: dcu Socket: 0 Core: 2 Status: 0
> Error Type: dcu Socket: 0 Core: 3 Status: 0
> Error Type: dcu Socket: 0 Core: 4 Status: 0
> Error Type: dcu Socket: 0 Core: 5 Status: 0
> Error Type: dcu Socket: 0 Core: 6 Status: 0
> Error Type: dcu Socket: 0 Core: 7 Status: 0
> Error Type: dtlb Socket: 0 Core: 0 Status: 0
> Error Type: dtlb Socket: 0 Core: 1 Status: 0
> Error Type: dtlb Socket: 0 Core: 2 Status: 0
> Error Type: dtlb Socket: 0 Core: 3 Status: 0
> Error Type: dtlb Socket: 0 Core: 4 Status: 0
> Error Type: dtlb Socket: 0 Core: 5 Status: 0
> Error Type: dtlb Socket: 0 Core: 6 Status: 0
> Error Type: dtlb Socket: 0 Core: 7 Status: 0
> Error Type: ml2 Socket: 0 Core: 0 Status: fe00000000800400
> Socket0: ml2_MC3_CORE0_CTL= f
> Socket0: ml2_MC3_CORE0_STATUS= fe00000000800400
> Socket0: ml2_MC3_CORE0_ADDR= ffffffff81060ff5
> Socket0: ml2_MC3_CORE0_MISC= ffffffff81060ff5
> Socket0: ml2_MC3_CORE0_CTL2= 40000001
> Error Type: ml2 Socket: 0 Core: 1 Status: fe00000000800400
> Socket0: ml2_MC3_CORE1_CTL= f
> Socket0: ml2_MC3_CORE1_STATUS= fe00000000800400
> Socket0: ml2_MC3_CORE1_ADDR= ffffffff81060ff5
> Socket0: ml2_MC3_CORE1_MISC= ffffffff81060ff5
> Socket0: ml2_MC3_CORE1_CTL2= 40000001
> Error Type: ml2 Socket: 0 Core: 2 Status: fe00000000800400
> Socket0: ml2_MC3_CORE2_CTL= f
> Socket0: ml2_MC3_CORE2_STATUS= fe00000000800400
> Socket0: ml2_MC3_CORE2_ADDR= ffffffff81060ff5
> Socket0: ml2_MC3_CORE2_MISC= ffffffff81060ff5
> Socket0: ml2_MC3_CORE2_CTL2= 40000001
> Error Type: ml2 Socket: 0 Core: 3 Status: fe00000000800400
> Socket0: ml2_MC3_CORE3_CTL= f
> Socket0: ml2_MC3_CORE3_STATUS= fe00000000800400
> Socket0: ml2_MC3_CORE3_ADDR= ffffffff81060ff5
> Socket0: ml2_MC3_CORE3_MISC= ffffffff81060ff5
> Socket0: ml2_MC3_CORE3_CTL2= 40000001
> Error Type: ml2 Socket: 0 Core: 4 Status: fe00000000800400
> Socket0: ml2_MC3_CORE4_CTL= f
> Socket0: ml2_MC3_CORE4_STATUS= fe00000000800400
> Socket0: ml2_MC3_CORE4_ADDR= ffffffff81060ff5
> Socket0: ml2_MC3_CORE4_MISC= ffffffff81060ff5
> Socket0: ml2_MC3_CORE4_CTL2= 40000001
> Error Type: ml2 Socket: 0 Core: 5 Status: fe00000000800400
> Socket0: ml2_MC3_CORE5_CTL= f
> Socket0: ml2_MC3_CORE5_STATUS= fe00000000800400
> Socket0: ml2_MC3_CORE5_ADDR= ffffffff81060ff5
> Socket0: ml2_MC3_CORE5_MISC= ffffffff81060ff5
> Socket0: ml2_MC3_CORE5_CTL2= 40000001
> Error Type: ml2 Socket: 0 Core: 6 Status: fe00000000800400
> Socket0: ml2_MC3_CORE6_CTL= f
> Socket0: ml2_MC3_CORE6_STATUS= fe00000000800400
> Socket0: ml2_MC3_CORE6_ADDR= ffffffff81060ff5
> Socket0: ml2_MC3_CORE6_MISC= ffffffff81060ff5
> Socket0: ml2_MC3_CORE6_CTL2= 40000001
> Error Type: ml2 Socket: 0 Core: 7 Status: fe00000000800400
> Socket0: ml2_MC3_CORE7_CTL= f
> Socket0: ml2_MC3_CORE7_STATUS= fe00000000800400
> Socket0: ml2_MC3_CORE7_ADDR= ffffffff81060ff5
> Socket0: ml2_MC3_CORE7_MISC= ffffffff81060ff5
> Socket0: ml2_MC3_CORE7_CTL2= 40000001
> Error Type: pcu Socket: 0 Status: b200000000400407
> Socket0: pcu_MC4_CTL= 7f
> Socket0: pcu_MC4_STATUS= b200000000400407
> Socket0: pcu_MC4_ADDR= 0
> Socket0: pcu_MC4_MISC= 0
> Socket0: pcu_MC4_CTL2= 0
> Error Type: qpi0 Socket: 0 Status: 0
> Error Type: iio Socket: 0 Status: 0
> Error Type: ha0 Socket: 0 Status: 0
> Error Type: ha1 Socket: 0 Status: 0
> Error Type: imc0 Socket: 0 Status: 0
> Error Type: imc1 Socket: 0 Status: 0
> Error Type: imc2 Socket: 0 Status: 0
> Error Type: imc3 Socket: 0 Status: 0
> Error Type: imc4 Socket: 0 Status: 0
> Error Type: imc5 Socket: 0 Status: 0
> Error Type: imc6 Socket: 0 Status: 0
> Error Type: imc7 Socket: 0 Status: 0
> Error Type: cbo0 Socket: 0 Status: 0
> Error Type: cbo1 Socket: 0 Status: 0
> Error Type: cbo2 Socket: 0 Status: 0
> Error Type: qpi1 Socket: 0 Status: 0
> Error Type: qpi2 Socket: 0 Status: 0
> CPU Model: E5-2600-V4
> Number of cores per CPU: 8
> Microcode: b00001f
> Socket1 LLC_MASK e6d
> Socket1 CORE_MASK ff
> skt1 MCA_ERR_SRC = 0xbc000000
> skt1 IerrLoggingReg = 0xf840b88
> skt1 MCerrLoggingReg = 0x7880144
> Error Type: ifu Socket: 1 Core: 0 Status: badbad
> Error Type: ifu Socket: 1 Core: 1 Status: 0
> Error Type: ifu Socket: 1 Core: 2 Status: 0
> Error Type: ifu Socket: 1 Core: 3 Status: badbad
> Error Type: ifu Socket: 1 Core: 4 Status: badbad
> Error Type: ifu Socket: 1 Core: 5 Status: badbad
> Error Type: ifu Socket: 1 Core: 6 Status: badbad
> Error Type: ifu Socket: 1 Core: 7 Status: badbad
> Error Type: dcu Socket: 1 Core: 0 Status: badbad
> Error Type: dcu Socket: 1 Core: 1 Status: 0
> Error Type: dcu Socket: 1 Core: 2 Status: 0
> Error Type: dcu Socket: 1 Core: 3 Status: badbad
> Error Type: dcu Socket: 1 Core: 4 Status: badbad
> Error Type: dcu Socket: 1 Core: 5 Status: badbad
> Error Type: dcu Socket: 1 Core: 6 Status: badbad
> Error Type: dcu Socket: 1 Core: 7 Status: badbad
> Error Type: dtlb Socket: 1 Core: 0 Status: badbad
> Error Type: dtlb Socket: 1 Core: 1 Status: 0
> Error Type: dtlb Socket: 1 Core: 2 Status: 0
> Error Type: dtlb Socket: 1 Core: 3 Status: badbad
> Error Type: dtlb Socket: 1 Core: 4 Status: badbad
> Error Type: dtlb Socket: 1 Core: 5 Status: badbad
> Error Type: dtlb Socket: 1 Core: 6 Status: badbad
> Error Type: dtlb Socket: 1 Core: 7 Status: badbad
> Error Type: ml2 Socket: 1 Core: 0 Status: badbad
> Error Type: ml2 Socket: 1 Core: 1 Status: be00000000800400
> Socket1: ml2_MC3_CORE1_CTL= f
> Socket1: ml2_MC3_CORE1_STATUS= be00000000800400
> Socket1: ml2_MC3_CORE1_ADDR= ffffffff81060f94
> Socket1: ml2_MC3_CORE1_MISC= ffffffff81060f94
> Socket1: ml2_MC3_CORE1_CTL2= 40000001
> Error Type: ml2 Socket: 1 Core: 2 Status: fe00000000800400
> Socket1: ml2_MC3_CORE2_CTL= f
> Socket1: ml2_MC3_CORE2_STATUS= fe00000000800400
> Socket1: ml2_MC3_CORE2_ADDR= ffffffff8154c1d5
> Socket1: ml2_MC3_CORE2_MISC= ffffffff8154c1d5
> Socket1: ml2_MC3_CORE2_CTL2= 40000001
> Error Type: ml2 Socket: 1 Core: 3 Status: badbad
> Error Type: ml2 Socket: 1 Core: 4 Status: badbad
> Error Type: ml2 Socket: 1 Core: 5 Status: badbad
> Error Type: ml2 Socket: 1 Core: 6 Status: badbad
> Error Type: ml2 Socket: 1 Core: 7 Status: badbad
> Error Type: pcu Socket: 1 Status: be00000000800400
> Socket1: pcu_MC4_CTL= 7f
> Socket1: pcu_MC4_STATUS= be00000000800400
> Socket1: pcu_MC4_ADDR= ffffffff81060f94
> Socket1: pcu_MC4_MISC= ffffffff81060f94
> Socket1: pcu_MC4_CTL2= 0
> Error Type: qpi0 Socket: 1 Status: 0
> Error Type: iio Socket: 1 Status: 0
> Error Type: ha0 Socket: 1 Status: 0
> Error Type: ha1 Socket: 1 Status: 0
> Error Type: imc0 Socket: 1 Status: 0
> Error Type: imc1 Socket: 1 Status: 0
> Error Type: imc2 Socket: 1 Status: 0
> Error Type: imc3 Socket: 1 Status: 0
> Error Type: imc4 Socket: 1 Status: 0
> Error Type: imc5 Socket: 1 Status: 0
> Error Type: imc6 Socket: 1 Status: 0
> Error Type: imc7 Socket: 1 Status: 0
> Error Type: cbo0 Socket: 1 Status: 0
> Error Type: cbo1 Socket: 1 Status: fe200000000c110a
> Socket1: cbo1_MC18_CTL= 1ffffff
> Socket1: cbo1_MC18_STATUS= fe200000000c110a
> Socket1: cbo1_MC18_ADDR= 280
> Socket1: cbo1_MC18_MISC= 50fff81601580086
> Socket1: cbo1_MC18_CTL2= 40000001
> Error Type: cbo2 Socket: 1 Status: fe200000000c110a
> Socket1: cbo2_MC19_CTL= 1ffffff
> Socket1: cbo2_MC19_STATUS= fe200000000c110a
> Socket1: cbo2_MC19_ADDR= 88400000
> Socket1: cbo2_MC19_MISC= 70ffa81602500086
> Socket1: cbo2_MC19_CTL2= 40000001
> Error Type: qpi1 Socket: 1 Status: 0
> Error Type: qpi2 Socket: 1 Status: 0
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
> Kernel driver in use: snd_hda_intel
>
>
> Here is another way to see another IERR CPU crash dump.
>
>
> Curtis Walker
> Manager Hardware Debug Engineer
> Technical Services
> Supermicro Computer, Inc
> Ph.: 408-895-6221
> Cell phone: 408-910-2487
> [cid:image003.png.01D2885E.8A23A1A0]
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 17 2017 - 09:00:03 PST
Custom Search