September 23, 2013

Oracle Solaris 4.1: fencing bug on T4-2 platform

Posted in Sun Cluster tagged , , , , , , , , , , , at 10:12 am by alessiodini

In these days I’m installing and configuring OSC 4.1 on Solaris 11.1 for a customer.
The environment is:

2x T4-2 server ( internal SAS disks on raid HW )
1x storage 2540
2x Brocade switches ( fabric configuration )
Solaris 11.1
Oracle Solaris Cluster 4.1

After the installation on the first node , I let scinstall reboot and the node got a kernel panic at the first boot in cluster!

The error was ( short extract ):

SunOS Release 5.11 Version 11.1 64-bit
Copyright (c) 1983, 2012, Oracle and/or its affiliates. All rights reserved.
Sep 13 14:35:45 Cluster.CCR: rcm script SUNW,vdevices.pl: do_register

Sep 13 14:35:45 Cluster.CCR: rcm script SUNW,vdevices.pl: do_register: 0 devices

Hostname: system02
Booting in cluster mode
NOTICE: CMM: Node system02 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node system02: attempting to join cluster.
NOTICE: CMM: Cluster has reached quorum.
NOTICE: CMM: Node system02 (nodeid = 1) is up; new incarnation number = 1379075759.
NOTICE: CMM: Cluster members: system02.
NOTICE: CMM: node reconfiguration #1 completed.
NOTICE: CMM: Node system02: joined cluster.

system02 console login: obtaining access to all attached disks
Sep 13 14:36:10 system02 scsi: WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas1):

Sep 13 14:36:10 system02 MPTSAS Firmware Fault, code: 2665

SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major

EVENT-TIME: 0x523306ba.0x2887e262 (0x2960cb3960)

PLATFORM: sun4v, CSN: -, HOSTNAME: system02

SOURCE: SunOS, REV: 5.11 11.1

DESC: Errors have been detected that require a reboot to ensure system

integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information.

AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry

IMPACT: The system will sync files, save a crash dump if needed, and reboot

REC-ACTION: Save the error summary below in case telemetry cannot be saved

panic[cpu53]/thread=2a103edbc60: Fatal error has occured in: PCIe fabric.(0x1)(0x101)

000002a103edb6a0 px:px_err_panic+1c4 (106ea800, 1, 101, 7bfb8800, 1, 106e8988)

%l0-3: 000002a103edb750 00001000acbc8400 00000000106ea800 000000000000005f

%l4-7: 0000000000000000 0000000010508000 ffffffffffffffff 0000000000000000

000002a103edb7b0 px:px_err_fabric_intr+1ac (1000acbc6000, 1, f00, 1, 101, 400084f28b8)

%l0-3: 0000000000000f00 000000007bfb8770 0000000000000000 0000000000000f00

%l4-7: 0000000000000001 000000007bfb8400 0000000000000001 00001000acbc9858

000002a103edb930 px:px_msiq_intr+208 (1000ac7cae38, 0, 9, 1000acbcecc8, 1, 2)

%l0-3: 0000000000000000 00000000215e0000 00000400084efc28 00001000acbc6000

%l4-7: 00001000acbcee88 00001000acbbd2e0 00000400084f28b8 0000000000000030

syncing file systems… done

I was very surprised because I did:

1) Solaris update via ORACLE repository ( almost 2.5gb with latest SRU )
2) OSC cluster installation via ORACLE repository
3) Review of ORACLE checklist

For this issue we opened a case to ORACLE we it suggested to replace the motherboard.
We saw the alarm on ILOM too and from SP system , so we tought too about a real hardware problem.
A technician replaced the motherboard and we still had the same issue!!
We saw that boot out of the cluster worked , so we began to be suspicious about the cluster.

With beadm utility , i made a lot of tests and experiments , no one worked.
I called an expert collegue , Daniele Traettino , and together we made a new installation with scinstall.
We disabled the global fencing and the first boot worked , without any warning about MPT SAS or panic!
I did not think about to disable the global fencing yet , because in my experience I never faced any issue about it during the first installation!

we saw then that the issue was related with fencing.
I made different tests after:

1) Active global fencing and reboot – got a new kernel panic
2) Disable fencing on local disks and active global fencing – it worked
3) Active fencing on local disks with different policies ( pathcount for example ) – got a new kernel panic

So the solution is to disable fencing on local SAS disks. I used this command:

cldev set -p default_fencing=nofencing DID_of_local_SAS_disk

I think this can happen on other systems with internal SAS disks.
I hope for a fix soon from ORACLE 🙂

Advertisements

2 Comments »

  1. stiv said,

    Hi, that’s very useful. I’ve faced same issues with SF T4-4 S11.1 / SC4.1. In my case, i was obliged to reinstall the whole infra with S11.0 / SC4.0. But thanks for documenting this solution. I’m going to test it again in my env…

    • alessiodini said,

      Thank you for visit my blog Stive!
      I had a lot of issues during this task , ORACLE asked to my company which kind of bugs we met during the delivery.
      Let me know about your installation

      Alessio


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: