Monday, January 10, 2011

LMON: terminated RAC node 2 instance due to error 29740, Node 2 is evicted from the cluster

Problem Description:

Database is 3 node Oracle RAC cluster on linux environment, LMON: terminated
RAC node 2 instance due to error 29740.
Node 2 is evicted from the cluster with same error.


Info from Alert log from instance 2 (node 2)
====================================================

Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lmon_15344.trc:
ORA-29740: evicted by member 0, group incarnation 30
Thu Jan 6 08:51:38 2011
LMON: terminating instance due to error 29740
Thu Jan 6 08:51:38 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lms1_15399.trc:
ORA-29740: evicted by member , group incarnation
Thu Jan 6 08:51:38 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lms5_15426.trc:
ORA-29740: evicted by member , group incarnation
Thu Jan 6 08:51:38 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lms8_15438.trc:
ORA-29740: evicted by member , group incarnation
Thu Jan 6 08:51:38 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lms4_15422.trc:
ORA-29740: evicted by member , group incarnation
Thu Jan 6 08:51:38 2011


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Info from Alert log from instance 1 (node 1)
====================================================
alert_DBFAP1.log
~~~~~~~~~~~~~~~~
...
Thu Jan 6 08:43:30 2011
IPC Send timeout detected.Sender: ospid 18934
Receiver: inst 2 binc 366295975 ospid 15344
Thu Jan 6 08:43:55 2011
Communications reconfiguration: instance_number 2
Thu Jan 6 08:43:56 2011
Trace dumping is performing id=[cdmp_20110106084355]
Thu Jan 6 08:47:27 2011
Evicting instance 2 from cluster
Thu Jan 6 08:47:51 2011
Waiting for instances to leave:
2
Thu Jan 6 08:48:11 2011
Waiting for instances to leave:
....

where 18934 is the instance 1 LMON process
15344 is the instance 2 LMON process


************************************************************************



alert_DBFAP3.log
~~~~~~~~~~~~~~~
...
Thu Jan 6 08:47:27 2011
Evicting instance 2 from cluster
Thu Jan 6 08:47:51 2011
Waiting for instances to leave:
2
Thu Jan 6 08:48:11 2011
Waiting for instances to leave:
2
Thu Jan 6 08:48:31 2011
Waiting for instances to leave:
2

**************************************************************************



Uploaded the trace file also for further investigation ===>>>>>>



dbfap2_lmon_15344.trc:
----------------------

*** 2011-01-06 01:43:45.239
SKGXP_TRACE_FLUSH: output truncated 11 traces lost
SKGXP_TRACE_FLUSH: start
SKGXPIWAIT: keepalive_reset elapsed 676348 ts 25100078 last ping 24423730 check 600000
. . .
kjfmrcvrchk: receiver LMS[5] check timed out (1294238295.1294238625.0).
kjfmrcvrchk: Dumping callstack of lms5
Submitting asynchronized dump request [20]
kjfmrcvrchk: receiver LMS[9] check timed out (1294238295.1294238625.0).
kjfmrcvrchk: Dumping callstack of lms9
Submitting asynchronized dump request [20]
*** 2011-01-06 08:46:52.715
Received ORADEBUG command 'dump errorstack 1' from process Unix process pid: 15310, image:
*** 2011-01-06 08:46:52.745
ksedst ksedmp ksdxfdmp ksdxcb sspuser __funlockfile __pread_nocancel skgfqio
ksfdread1 ksfdread kccrbp kccgrd kjxgrgetresults kjxgrimember kjxggpoll kjfmact kjfdact kjfcln
ksbrdp opirip opidrv sou2o opimai_real main

*** 2011-01-06 08:51:33.863
ksedst ksedmp ksdxfdmp ksdxcb sspuser __funlockfile __pwrite_nocancel skgfqio
ksfdgo ksfdbio kccwbp kccprd kccpcp kjxgrgetresults kjxgrimember kjxggpoll kjfmact kjfdact kjfcln
ksbrdp opirip opidrv sou2o opimai_real main
*** 2011-01-06 08:51:34.540
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 28 0.
*** 2011-01-06 08:51:34.541
kjxgrupdtcp: Updating ckpt progress rec, flgs 0x04
kjxgrgtld: current system load 6522 (hiload)

****************************


Observation : kjxgrgtld: current system load 6522 (hiload)

system seems to be under high load, so might be the reason.


==>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<====================================

dbfap1_lmon_18934.trc:
---------------------
*** 2011-01-06 08:43:30.072
IPC Send timeout detected.Sender: ospid 18934
Receiver: inst 2 binc 366295975 ospid 15344
SKGXPCTX: 0x0x66af610 ctx
. . .
*** 2011-01-06 08:43:57.050
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 28 0.
*** 2011-01-06 08:43:57.051
kjxgrupdtcp: Updating ckpt progress rec, flgs 0x04
kjxgrupdmemthrd: Mem 1, thrd 2, vers 366295975, stat 0x0007 err 0x0002
kjxgrgtld: current system load 48 (normal)


+++++++++++++++++++++++++++++++++++++++++++++++++++

Observation :kjxgrgtld: current system load 48 (normal)

So system load is healthy , no problem in node 1 and node 3.




After going through alert log and trace files we have notice that the following points :


At about 'Thu Jan 6 08:43:30 2011' instance 2 LMON has gone unresponsive. Hence it had been evicted out of the cluster. Looks like instance2 node was high above normal. It could be that due to high load on node2, the LMON could not respond on timely manner.

Now, we need to look at the OS statistics on Node2 during that time to further analyze on system load. Have you got OS Watcher already installed, if so get the logs which covers the time period 'Thu Jan 6 08:43:30 2011'


===>>>>>>>>>>>>>>>>>


After investigation and Oracle Suggested parameters to disable the DRM with below parameter :


--------------------------------------------------------------------------------
_gc_undo_affinity= FALSE
_gc_affinity_time=0

gdp Strack parameter :
--------------------------------------------------------------------------------
ksb_disable_diagpid = TRUE
=============================================================



We have noticed that not seeing LMON issue for a day , but again after 1 day, behavior reproduce ........






Investigation from alert log and trace file ===>>>


node1_alert_DBFAP1.log
======================
Mon Nov 8 17:46:44 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 bond2 172.16.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 bond0 144.131.84.0 configured from OCR for use as a public interface
...
Cluster communication is configured to use the following interface(s) for this instance
172.16.0.7
Mon Nov 8 17:46:44 2010
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Mon Nov 8 17:46:44 2010
WARNING:Oracle instance running on a system with low open file descriptor
limit. Tune your system to increase this limit to avoid
severe performance degradation.
...
LMON started with pid=5, OS id=18934
...
Fri Jan 7 16:31:50 2011
IPC Send timeout detected.Sender: ospid 18934
Receiver: inst 2 binc 393203158 ospid 5629
Communications reconfiguration: instance_number 2
Fri Jan 7 16:32:00 2011
Trace dumping is performing id=[cdmp_20110107163200]
Fri Jan 7 16:33:47 2011
Evicting instance 2 from cluster
Fri Jan 7 16:33:50 2011
Trace dumping is performing id=[cdmp_20110107163350]
Fri Jan 7 16:33:56 2011
Reconfiguration started (old inc 64, new inc 68)
List of nodes:
0 2
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 not valid according to instance 2
=============================================================


Observation :Oracle instance running on a system with low open file descriptor
limit. Tune your system to increase this limit to avoid
severe performance degradation.




+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


node2_alert_DBFAP2.log
======================
Fri Jan 7 16:31:50 2011
IPC Send timeout detected. Receiver ospid 5629
Fri Jan 7 16:31:58 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lmon_5629.trc:
Fri Jan 7 16:32:00 2011
Trace dumping is performing id=[cdmp_20110107163200]
Fri Jan 7 16:33:50 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lmon_5629.trc:
ORA-29740: evicted by member 2, group incarnation 66
Fri Jan 7 16:33:50 2011
LMON: terminating instance due to error 29740
...
Fri Jan 7 16:33:50 2011
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_diag_5625.trc
Fri Jan 7 16:33:50 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_pmon_5623.trc:
ORA-29740: evicted by member , group incarnation
Fri Jan 7 16:33:50 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_ckpt_5705.trc:
ORA-29740: evicted by member , group incarnation
Fri Jan 7 16:33:51 2011
Shutting down instance (abort)
...
Fri Jan 7 16:33:59 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 bond2 172.16.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 bond0 144.131.84.0 configured from OCR for use as a public interface
...
Cluster communication is configured to use the following interface(s) for this instance
172.16.0.8
Fri Jan 7 16:33:59 2011
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Fri Jan 7 16:33:59 2011
WARNING:Oracle instance running on a system with low open file descriptor
limit. Tune your system to increase this limit to avoid
severe performance degradation.
=========






dbfap2_lmon_5629.trc
~~~~~~~~~~~~~~~~~~~~
...
*** 2011-01-07 16:32:07.029
kjxgrnetchk: Sending comm check req to 0
*** 2011-01-07 16:32:07.029
kjxgrnetchk: Sending comm check req to 2
kjxgrrcfgchk: prev pstate 5 mapsz 512
kjxgrrcfgchk: new bmp: 0 1 2
kjxgrrcfgchk: work bmp: 0 1 2
kjxgrrcfgchk: rr bmp: 0 1 2
*** 2011-01-07 16:32:07.029
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 64 0.
*** 2011-01-07 16:32:07.030
kjxgrupdtcp: Updating ckpt progress rec, flgs 0x04
kjxgrupdmemthrd: Mem 2, thrd 3, vers 534163215, stat 0x0037 err 0x0000
kjxgrgtld: current system load 2324 (normal)
kjxgrs0h: Reconfig rsn 3, flgs 0x04, st -344944946, bitmap: 0 1 2
*** 2011-01-07 16:32:07.050
Name Service frozen
.......



Final Conclusion =====>>


1. From the lmon trace file, it indicates it need to start another reconfiguration due to loss in communication.

This could be related to the DRM issue for the pre-10.2.0.4 database.

We would like to suggest:
a) apply the latest patchset (10.2.0.5) as soon as possible to this DB
or
b) disable the DRM as suggest in the previous action plan.

2. There is no direct relationship between DRM and resource manager using in DB, therefore, it should be ok to disable DRM.


so suspecting this is the product defect in 10.2.0.3

No comments:

Post a Comment