September 23, 2013

Oracle Solaris 4.1: fencing bug on T4-2 platform

Posted in Sun Cluster tagged , , , , , , , , , , , at 10:12 am by alessiodini


In these days I’m installing and configuring OSC 4.1 on Solaris 11.1 for a customer.
The environment is:

2x T4-2 server ( internal SAS disks on raid HW )
1x storage 2540
2x Brocade switches ( fabric configuration )
Solaris 11.1
Oracle Solaris Cluster 4.1

After the installation on the first node , I let scinstall reboot and the node got a kernel panic at the first boot in cluster!

The error was ( short extract ):

SunOS Release 5.11 Version 11.1 64-bit
Copyright (c) 1983, 2012, Oracle and/or its affiliates. All rights reserved.
Sep 13 14:35:45 Cluster.CCR: rcm script SUNW,vdevices.pl: do_register

Sep 13 14:35:45 Cluster.CCR: rcm script SUNW,vdevices.pl: do_register: 0 devices

Hostname: system02
Booting in cluster mode
NOTICE: CMM: Node system02 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node system02: attempting to join cluster.
NOTICE: CMM: Cluster has reached quorum.
NOTICE: CMM: Node system02 (nodeid = 1) is up; new incarnation number = 1379075759.
NOTICE: CMM: Cluster members: system02.
NOTICE: CMM: node reconfiguration #1 completed.
NOTICE: CMM: Node system02: joined cluster.

system02 console login: obtaining access to all attached disks
Sep 13 14:36:10 system02 scsi: WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas1):

Sep 13 14:36:10 system02 MPTSAS Firmware Fault, code: 2665

SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major

EVENT-TIME: 0x523306ba.0x2887e262 (0x2960cb3960)

PLATFORM: sun4v, CSN: -, HOSTNAME: system02

SOURCE: SunOS, REV: 5.11 11.1

DESC: Errors have been detected that require a reboot to ensure system

integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information.

AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry

IMPACT: The system will sync files, save a crash dump if needed, and reboot

REC-ACTION: Save the error summary below in case telemetry cannot be saved

panic[cpu53]/thread=2a103edbc60: Fatal error has occured in: PCIe fabric.(0x1)(0x101)

000002a103edb6a0 px:px_err_panic+1c4 (106ea800, 1, 101, 7bfb8800, 1, 106e8988)

%l0-3: 000002a103edb750 00001000acbc8400 00000000106ea800 000000000000005f

%l4-7: 0000000000000000 0000000010508000 ffffffffffffffff 0000000000000000

000002a103edb7b0 px:px_err_fabric_intr+1ac (1000acbc6000, 1, f00, 1, 101, 400084f28b8)

%l0-3: 0000000000000f00 000000007bfb8770 0000000000000000 0000000000000f00

%l4-7: 0000000000000001 000000007bfb8400 0000000000000001 00001000acbc9858

000002a103edb930 px:px_msiq_intr+208 (1000ac7cae38, 0, 9, 1000acbcecc8, 1, 2)

%l0-3: 0000000000000000 00000000215e0000 00000400084efc28 00001000acbc6000

%l4-7: 00001000acbcee88 00001000acbbd2e0 00000400084f28b8 0000000000000030

syncing file systems… done

I was very surprised because I did:

1) Solaris update via ORACLE repository ( almost 2.5gb with latest SRU )
2) OSC cluster installation via ORACLE repository
3) Review of ORACLE checklist

For this issue we opened a case to ORACLE we it suggested to replace the motherboard.
We saw the alarm on ILOM too and from SP system , so we tought too about a real hardware problem.
A technician replaced the motherboard and we still had the same issue!!
We saw that boot out of the cluster worked , so we began to be suspicious about the cluster.

With beadm utility , i made a lot of tests and experiments , no one worked.
I called an expert collegue , Daniele Traettino , and together we made a new installation with scinstall.
We disabled the global fencing and the first boot worked , without any warning about MPT SAS or panic!
I did not think about to disable the global fencing yet , because in my experience I never faced any issue about it during the first installation!

we saw then that the issue was related with fencing.
I made different tests after:

1) Active global fencing and reboot – got a new kernel panic
2) Disable fencing on local disks and active global fencing – it worked
3) Active fencing on local disks with different policies ( pathcount for example ) – got a new kernel panic

So the solution is to disable fencing on local SAS disks. I used this command:

cldev set -p default_fencing=nofencing DID_of_local_SAS_disk

I think this can happen on other systems with internal SAS disks.
I hope for a fix soon from ORACLE 🙂

Advertisements

December 13, 2012

Sun Cluster 3.2: resources migration between two clusters

Posted in Sun Cluster at 11:28 am by alessiodini


Last week i was envolved in this task.
Instead of classic scsnapshot method i used xml configuration files for the migration!

I wrote a new document in my library, you can see it here

I suggest to use xml for every migration 😉

October 15, 2012

Oracle cluster 3.3 installation on VMware – step by step

Posted in Sun Cluster at 10:21 am by alessiodini


Finally i completed the procedure!
It was a long job, you can download it here 🙂
I read it many times but i could do some error , i don’t speak fluent english . Let me know for any correction about the language or about the tasks!

Have fun with the tutorial !

December 7, 2011

Oracle Solaris Cluster 4.0 is out!

Posted in Sun Cluster at 6:37 pm by alessiodini


Looking on internet i found that Oracle has released Oracle Solaris Cluster 4.0.
I’m asking about new features, i’m excited !!

Here i’m reading about install and configure it on Oracle Solaris 11

I hope to play with both soon!!

September 29, 2011

Sun Cluster : troubleshooting Failback policies

Posted in Sun Cluster at 2:08 pm by alessiodini


Today i worked on a two node 3.1u3 cluster.
During an activity i saw on messages this error:

Mismatch between the Failback policies for the resource group system-rg (True) and global service system-dg (False) detected.

After a conference call the customer asked me to set the resource group failback to False.

Reading the rg_properties manpage i found:

Failback (boolean)
A Boolean value that indicates whether to recalculate
the set of nodes where the group is online when the
cluster membership changes. A recalculation can cause
the RGM to bring the group offline on less preferred
nodes and online on more preferred nodes.

Default
False

Tunable
Any time

ok. Any time means that i can modify this property without offline the resource group.

I tried the command:

root@system1 # scrgadm -c -g system-rg -y Failback=False
system1 – Mismatch between the Failback policies for the resource group system-rg (True) and global service system1-dg (False) detected.

VALIDATE on resource system1-storage, resource group system1-rg, exited with non-zero exit status.
Validation of resource system1-storage in resource group system1-rg on node system1 failed.

mmm….

After some test i finally changed it , doing in order:

1) setting diskgroup failback policy to True ( in this moment both rg and dg had True value )
2) setting resource group failback policy to False ( in this moment rg had False and dg True )
3) setting diskgroup failback policy to False ( in this moment both rg and dg had False value )

I checked what i did with scrgadm/scconf commands:

root@system1 # scrgadm -pvv -g system1-rg | head | grep Failback
(system1-rg) Res Group Failback: False

root@system1 # scconf -pv | grep system1-dg | grep failback
(system1-dg) Device group failback enabled: no

Nice!

September 9, 2011

Sun Cluster 3.0 & NAFO

Posted in Sun Cluster at 12:06 pm by alessiodini


Yesterday i worked on NAFO after very long time 😉
My goal was to configure a single adapter in a new group.

i tried with the command:

# pnmset -c nafo1 -o create qfe3
nafo1: Must have exactly one configured adapter for the group

mmm

After few checks i saw that qfe3 interface was plumbed and configured but it had not the “UP” flag from ifconfig output.
In this case i did:

# ifconfig qfe3 up
# pnmset -c nafo1 -o create qfe3

# pnmstat -l
group adapters status fo_time act_adp
nafo0 qfe0:qfe2 OK NEVER qfe0
nafo1 qfe3 OK NEVER qfe3

August 31, 2011

Sun Cluster 3.3 & LDOM

Posted in Sun Cluster at 12:29 pm by alessiodini


In these days i was thinking about to write an LDOM agent.
My purpose was have fun with LDOM and Sun Cluster from Domain Controller. Even without that hardware ( T1000 , T2000 , etc.etc. ) i was very curious how do it.
But looking on the web i found that Oracle has already developed this function, sigh 😦
So , from Sun Cluster 3.3 is possible manage LDOM from primary domain controller and switch them as resources.

July 15, 2011

Sun Cluster & HA-Oracle: failed to start listener

Posted in Sun Cluster at 10:15 am by alessiodini


Yesterday i was configuring HA-Oracle by a customer.
When i tried to startup the listener I saw these errors:

Jul 14 09:43:33 system SC[SUNWscor.oracle_listener.start]:system-rg:system-listener: [ID 847065 user.error] Failed to start listener SYSTEM
Jul 14 09:43:33 system Cluster.RGM.rgmd: [ID 878162 daemon.error] Method failed on resource in resource group , exit code , time used: 7% of timeout

Looking for more detailed messages i saw these logs:

Jul 14 09:43:29 SC[SUNWscor.oracle_listener.start]:system-rg:system-listener: Starting listener SYSTEM.

LSNRCTL for Solaris: Version 8.1.7.4.0 – Production on 14-JUL-2011 09:43:30

(c) Copyright 1998 Oracle Corporation. All rights reserved.

TNS-01106: Listener using listener name LISTENER has already been started
Jul 14 09:43:33 SC[SUNWscor.oracle_listener.start]:system-rg:system-listener: Failed to start listener SYSTEM.

TNS-01106: Listener using listener name LISTENER has already been started

I did some check and i did not found any listener’s active process.

This is what i thought: “!?!?”

I asked to dba to verify the messages and he told me that was not active his process.

Anyway he tried to stop the listener manually and the command worked

I thought again “!?!?”

After this i was able to startup the listener within cluster

May 20, 2011

Sun Cluster 3.2 experiment: boot two nodes without quorum device

Posted in Sun Cluster at 2:51 pm by alessiodini


Recently i did some experiment on my vmware toy-cluster where i have:

– two cluster nodes ( Solaris 10 – Sun Cluster 3.2u3 )
– quorum server on third system

I tried to boot both nodes with quorum server offline. In order i saw:

1) node1 tries to boot and he waits another node for form the cluster ( it’s ok cos the requirement to form/maintain a cluster is to have more than 50% of possible votes )

2) node2 joins the cluster ( in this moment the cluster haves 2 votes , more than 50% ). After some second both nodes panic because they lost quorum device. This was unexpectedly for me.

Thinking to this i worked a couple of days ago to a customer with similar scenario. They physically moved 12 systems from one location to another. We had Sun Cluster 3.0 , Solaris 8 and VxVM 3.2. On of three cluster booted up without any quorum device, and other 2 cluster were not able to do this.

My questions here are:

1) Why on my laptop both nodes panic during the boot? I think this is due to prevent something like amnesia.
2) Why 2 days ago i saw an older Sun Cluster that booted up in same scenario correctly?

>_<

May 13, 2011

Sun Cluster 3.3 and /etc/reboot

Posted in Sun Cluster at 10:50 am by alessiodini


Today i’m studying Sun Cluster 3.3 documentation and i read this:

8 Reboot all nodes into cluster mode.
# sync;sync;sync;/etc/reboot

i was very surprised to see the command “/etc/reboot”.
I logged into 3.3 cluster and i founded this file! I founded this is a link to halt command … mmm.
So my questions are:

1) why /etc/reboot ?
2) does sun cluster 3.3 make this link ? or not ?

I want know 😛

Next page