In these days I’m installing and configuring OSC 4.1 on Solaris 11.1 for a customer.
The environment is:
2x T4-2 server ( internal SAS disks on raid HW )
1x storage 2540
2x Brocade switches ( fabric configuration )
Oracle Solaris Cluster 4.1
After the installation on the first node , I let scinstall reboot and the node got a kernel panic at the first boot in cluster!
The error was ( short extract ):
SunOS Release 5.11 Version 11.1 64-bit
Copyright (c) 1983, 2012, Oracle and/or its affiliates. All rights reserved.
Sep 13 14:35:45 Cluster.CCR: rcm script SUNW,vdevices.pl: do_register
Sep 13 14:35:45 Cluster.CCR: rcm script SUNW,vdevices.pl: do_register: 0 devices
Booting in cluster mode
NOTICE: CMM: Node system02 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node system02: attempting to join cluster.
NOTICE: CMM: Cluster has reached quorum.
NOTICE: CMM: Node system02 (nodeid = 1) is up; new incarnation number = 1379075759.
NOTICE: CMM: Cluster members: system02.
NOTICE: CMM: node reconfiguration #1 completed.
NOTICE: CMM: Node system02: joined cluster.
system02 console login: obtaining access to all attached disks
Sep 13 14:36:10 system02 scsi: WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas1):
Sep 13 14:36:10 system02 MPTSAS Firmware Fault, code: 2665
SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
EVENT-TIME: 0x523306ba.0x2887e262 (0x2960cb3960)
PLATFORM: sun4v, CSN: -, HOSTNAME: system02
SOURCE: SunOS, REV: 5.11 11.1
DESC: Errors have been detected that require a reboot to ensure system
integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information.
AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry
IMPACT: The system will sync files, save a crash dump if needed, and reboot
REC-ACTION: Save the error summary below in case telemetry cannot be saved
panic[cpu53]/thread=2a103edbc60: Fatal error has occured in: PCIe fabric.(0x1)(0x101)
000002a103edb6a0 px:px_err_panic+1c4 (106ea800, 1, 101, 7bfb8800, 1, 106e8988)
%l0-3: 000002a103edb750 00001000acbc8400 00000000106ea800 000000000000005f
%l4-7: 0000000000000000 0000000010508000 ffffffffffffffff 0000000000000000
000002a103edb7b0 px:px_err_fabric_intr+1ac (1000acbc6000, 1, f00, 1, 101, 400084f28b8)
%l0-3: 0000000000000f00 000000007bfb8770 0000000000000000 0000000000000f00
%l4-7: 0000000000000001 000000007bfb8400 0000000000000001 00001000acbc9858
000002a103edb930 px:px_msiq_intr+208 (1000ac7cae38, 0, 9, 1000acbcecc8, 1, 2)
%l0-3: 0000000000000000 00000000215e0000 00000400084efc28 00001000acbc6000
%l4-7: 00001000acbcee88 00001000acbbd2e0 00000400084f28b8 0000000000000030
syncing file systems… done
I was very surprised because I did:
1) Solaris update via ORACLE repository ( almost 2.5gb with latest SRU )
2) OSC cluster installation via ORACLE repository
3) Review of ORACLE checklist
For this issue we opened a case to ORACLE we it suggested to replace the motherboard.
We saw the alarm on ILOM too and from SP system , so we tought too about a real hardware problem.
A technician replaced the motherboard and we still had the same issue!!
We saw that boot out of the cluster worked , so we began to be suspicious about the cluster.
With beadm utility , i made a lot of tests and experiments , no one worked.
I called an expert collegue , Daniele Traettino , and together we made a new installation with scinstall.
We disabled the global fencing and the first boot worked , without any warning about MPT SAS or panic!
I did not think about to disable the global fencing yet , because in my experience I never faced any issue about it during the first installation!
we saw then that the issue was related with fencing.
I made different tests after:
1) Active global fencing and reboot – got a new kernel panic
2) Disable fencing on local disks and active global fencing – it worked
3) Active fencing on local disks with different policies ( pathcount for example ) – got a new kernel panic
So the solution is to disable fencing on local SAS disks. I used this command:
cldev set -p default_fencing=nofencing DID_of_local_SAS_disk
I think this can happen on other systems with internal SAS disks.
I hope for a fix soon from ORACLE 🙂