Tuesday, January 25, 2011

10 G Background process

10g new background processes
With 10g many new background processes were introduced.
This note highlights those.

MMAN

Memory Manager (MMAN) coordinates the sizing of different memory components within SGA. MMAN keeps a track of sizes of components and allocates/de-allocates memory based on their usage. This process is used only when you enable Automatic Shared Memory Management.


RVWR
Process responsible for writing flashback logs which stores pre-image of data-blocks.
These logs can be used to flash back database to past point in time. It proves very handy specially in situation of logical data corruptions like an accidental drop/truncate of a table or commit of an unwanted transaction.


CTWR
Change Tracking Writer (CTWR) Process introduced in 10g to track changes to block to assist fast RMAN incremental backups.
MMNL
The Memory Monitor Light (MMNL) process introduced in 10g is responsible for writing out statistics buffer to disk as needed. It works with Automatic Workload Repository (AWR)



MMON
The memory monitor (MMON) process was introduced in 10g is responsible for gathering statistics for Automatic Workload Repository. This helps Oracle doing automatic problem detection and self tuning.


M000
MMON background slave (m000) processes.


RBAL
RBAL (Re-balancer) is the ASM related process that performs rebalancing of disk resources controlled by ASM.

ARBx
These processes are managed by the RBAL process and are used to do the actual rebalancing of ASM controlled disk resources. The number of ARBx processes invoked is directly influenced by the asm_power_limit parameter.

ASMB
The ASMB process is used to provide information to and from the Cluster Synchronization Services used by ASM to manage the disk resources. It is also used to update statistics and provide a heartbeat mechanism.----------------------------------------------------------------------------------------------------------
11g:

The following process are added in 11g as new background processes.

1 dbrm DB resource manager
2 dia0 Diagnosability process
3 fbda Flashback data archiver process
4 vktm Virtual Timekeeper
5 w000 Space Management Co-ordination process
6 smc0 Space Manager process

NOTE : The above six are mandatory processes.

But 11g has 56 new processes added which can be queried using

select name,description from V$bgprocess;

10g background processes:

ASM and Cluster Synchronization Services – CSS

Cluster Synchronization Services – CSS ==>


ASM was designed to work with single instance as well as with RAC 10g clusters.
ASM, even in singleinstance,requires that Cluster Synchronization Services (CSS) is installed and started before ASM becomes available.

In a single instance, CSS maintains synchronization between the ASM and database instances. CSS, which is a component of Oracle’s Cluster Ready Services (CRS), is automatically installed on every node that runs Oracle Database 10g ASM and starts up automatically on server boot-up.

In RAC 10g environments, the full Oracle Cluster-ware (CRS) is installed on every RAC node.

If CSS does not start automatically on reboot or if there is a hostname configuration change. The following command can be used to reconfigure CSS on that host. Note, this command needs to run from the ORACLE_HOME where CSS is running from.

1. $ORACLE_HOME/bin/localconfig delete

2. $ORACLE_HOME/bin/localconfig add

Restore database from lost control file

Restoring the database after decommission of database:

I have come a cross with this experience to restore and recover the database after it had been decommission.


In our environment we have enabled the controlfile autobackup as below ...

RMAN> CONFIGURE CONTROLFILE AUTOBACKUP on;



and you backup the database using RMAN , then the control files will be backed up , however you need to be aware where the controlfile backup is stored. you can know this by observing the backup command.

For example this is extract from RMAN output with autobackup enabled and we issued backup database command , at the last few line of RMAN output you may get lines like the following :

Starting Control File Autobackup at 30-MAR-10

piece handle=/oracle/oracle/product/10.2.0/db_1/dbs/c-2012553556-20100330-00

Now you can restore the control file connecting to the database on nomount mode :

SQL> startup nomount;

ORACLE instance started...

[oracle@ebilldev test]$ rman

Recovery Manager: Release 10.2.0.4.0 - Production on Tue Mar 30 17:01:44 2010
Copyright (c) 1982, 2007, Oracle. All rights reserved.
RMAN> set dbid=2012553556
executing command: SET DBID

Then you need to issue the following command to set RMAN to look on the right location :

RMAN> set controlfile autobackup format for device type
disk to '/oracle/oracle/product/10.2.0/db_1/dbs/%F';
RMAN> restore controlfile from autobackup;
Starting restore at 30-MAR-10
using channel ORA_DISK_1
channel ORA_DISK_1: looking for autobackup on day: 20100330
channel ORA_DISK_1: autobackup found: /oracle/oracle/product/10.2.0/db_1/dbs/c-2012553556-20100330-00
channel ORA_DISK_1: control file restore from autobackup complete
output filename=/recover/test/oraData/control01.ctl
output filename=/recover/test/oraData/control02.ctl
output filename=/recover/test/oraData/control03.ctl
Finished restore at 30-MAR-10

Then mount,recover,and open the database resetlogs :

RMAN> alter database mount;
database mounted
released channel: ORA_DISK_1
RMAN> recover database;
Starting recover at 30-MAR-10
configuration for DISK channel 2 is ignored
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=47 devtype=DISK
starting media recovery
archive log thread 1 sequence 48 is already on disk as file /recover/test/redo/redo03.log
archive log filename=/recover/test/redo/redo03.log thread=1 sequence=48
media recovery complete, elapsed time: 00:00:00
Finished recover at 30-MAR-10
RMAN> alter database open resetlogs;
database opened

Monday, January 24, 2011

Oracle Notification Services : ==>>

Oracle Notification Services : ==>>

ONS allows users to send SMS messages, e-mails, voice notifications, and fax messages in an easy-to-access manner.

Oracle Clusterware uses ONS to send notifications about the state of the database instances to midtier
applications that use this information for load-balancing and for fast failure detection.


ONS is a daemon process that communicates with other ONS daemons on other nodes which inform
each other of the current state of the database components on the database server.


For example, if a listener, node, or service is down, a down event is triggered by the EVMD process, which is then
sent by the local ONS daemon to the ONS daemon process on other nodes, including all clients and
application servers participating in the network.Only nodes or client machines that have the ONS daemon running
and have registered with each other will receive such notification. Once the
ONS on the client machines receives this notification, the application (if
using an Oracle-provided API) will determine, based on the notification,
which nodes and instances have had a state change and will appropriately

handle a new connection request. ONS informs the application of state
changes, allowing the application to respond proactively instead of in the
traditional reactive method.

ONS configuration ==>>

ONS is installed and configured as part of the Oracle Clusterware installation.
Execution of the root.sh file on Unix and Linux-based systems, during
the Oracle Clusterware installation will create and start the ONS on all
nodes participating in the cluster. This can be verified using the crs_stat
utility provided by Oracle.



Configuration of ONS involves registering all nodes and servers that will communicate with the ONS daemon on the database server.
During Oracle Clusterware installation, all nodes participating in the cluster are automatically registered with the ONS.
Subsequently, during restart of the clusterware, ONS will register all nodes with the respective ONS processes
on other nodes in the cluster.

To add additional members or nodes that should receive notifications,the hostname or IP address of the node should be added to
the ons.config file. The configuration file is located in the $ORACLE_HOME/opmn/conf

directory and has the following format:

[oracle@oradb4 oracle]$ more $ORACLE_HOME/opmn/conf/ons.config
localport=6101
remoteport=6201
loglevel=3
useocr=on
nodes=oradb4.sumsky.net:6101,oradb2.sumsky.net:6201,
oradb1.sumsky.net:6201,oradb3.sumsky.net:6201,
onsclient1.sumsky.net:6200,onsclient2.sumsky.net:6200

The localport is the port that ONS binds to on the local host interface to talk to local clients.
The remoteport is the port that ONS binds to on all interfaces to talk to other ONS daemons.

The loglevel indicates the amount of logging that should be generated. Oracle supports logging levels
from 1 through 9. ONS logs are generated in the $ORACLE_HOME/opmn/logs directory on the respective instances.

The useocr parameter (valid values are on/off) indicates whether ONS should use the OCR to determine which instances and
nodes are participating in the cluster.

The nodes listed in the nodes line are all nodes in the network that will need to receive or send event notifications.


ONS logging ==>>

ONS events can be tracked via logs on both the server side and the client
side.

ONS logs are written to the $ORACLE_HOME/opmn/logs directory.

The default logging level is set to three. Depending on the level of tracking
desired, this can be changed by modifying the ons.config file located in
the $ORACLE_HOME/opmn/conf directory discussed earlier.

Logging at level eight provides event information received by the ONS on the client
machines.

The following extract from the ONS log file illustrates the various stages
of the SRV1 HA service as it transitions from a DOWN state to an UP state:

05/06/18 17:41:11 [7] Connection 25,192.168.2.30,6200 Message content
length is 94
05/06/18 17:41:11 [7] Connection 25,192.168.2.30,6200 Body using 94
of 94 reUse
05/06/18 17:41:11 [8] Connection 25,192.168.2.30,6200 body:
VERSION=1.0 service=SRV1 instance=SSKY1 database=SSKYDB host=oradb1
status=down reason=failure
05/06/18 17:41:11 [8] Worker Thread 120 checking receive queue
05/06/18 17:41:11 [8] Worker Thread 120 sending event 115 to servers
05/06/18 17:41:11 [8] Event 115 route:
3232236062,6200;3232236072,6200;3232236135,6200
05/06/18 17:41:20 [7] Connection 25,192.168.2.30,6200 Message content
length is 104
05/06/18 17:41:20 [7] Connection 25,192.168.2.30,6200 Body using 104
of 104 reUse
05/06/18 17:41:20 [8] Connection 25,192.168.2.30,6200 body:
VERSION=1.0 service=SRV1 instance=SSKY1 database=SSKYDB host=oradb1
status=not_restarting reason=UNKNOWN
05/06/18 17:41:20 [8] Worker Thread 120 checking receive queue
05/06/18 17:41:20 [8] Worker Thread 120 sending event 125 to servers
05/06/18 17:41:20 [8] Event 125 route:
3232236062,6200;3232236072,6200;3232236135,6200
05/06/18 18:22:30 [9] Worker Thread 2 sending body [135:128]:
connection 6,10.1.2.168,6200
VERSION=1.0 service=SRV1 instance=SSKY2 database=SSKYDB host=oradb2
status=up card=2 reason=user
05/06/18 18:22:30 [7] Worker Thread 128 checking client send queues
05/06/18 18:22:30 [8] Worker queuing event 135 (at head): connection
10,10.1.2.177,6200
05/06/18 18:22:30 [8] Worker Thread 124 checking receive queue
05/06/18 18:22:30 [7] Worker Thread 124 checking server send queues
05/06/18 18:22:30 [8] Worker Thread 124 processing send queue:
connection 10,10.1.2.177,6200
05/06/18 18:22:30 [9] Worker Thread 124 sending header [2:135]:
connection 10,10.1.2.177,6200


This extract from the ONS log file illustrates three notifications received
from the ONS server node oradb1 containing instance SSKY1 and application
service SRV1.

The three notifications received at different times indicate
various stages of the service SRV1.


The first message indicates a notification regarding the failure of SRV1 on instance SSKY1.

"VERSION=1.0 service=SRV1 instance=SSKY1 database=SSKYDB host=oradb1
status=down reason=failure"



The second message indicates a notification regarding a restart attempt of service SRV1
on the same node oradb1. This restart notification also indicates that the
instance and node are healthy, or else it would not attempt to restart on the
same node.

"VERSION=1.0 service=SRV1 instance=SSKY1 database=SSKYDB host=oradb1
status=not_restarting reason=UNKNOWN "


The third message is an UP event notification from the server to
the client indicating that the service has started on node oradb2 (instead of
its original node).

"VERSION=1.0 service=SRV1 instance=SSKY2 database=SSKYDB host=oradb2
status=up card=2 reason=user"

Once this message is received, the application can resume
connections using the service SRV1. This illustrates that the service SRV1 has
relocated from node oradb1 to oradb2.









===>>>>>>>>>>>>

FAN ( Fast application notification ):

When state changes occur on a cluster, node, or instance in a RAC environment,an event is triggered by
The Event Manager and propagated by the ONS to the client machines.
Such events that communicate state changes are termed FAN events and have a predefined structure

FAN is a new feature introduced in Oracle Database 10g RAC to proactively notify applications regarding the status of
the cluster and any configuration changes that take place.

FAN uses the Oracle Notification Services (ONS) for the actual notification of the event to its other ONS clients.


ONS provides and supports several callable interfaces that can be used by different applications to take advantage of the HA
solutions offered in Oracle Database 10g RAC.


Oracle supports two types of events:

1. Service events. Service events are application events and contain state changes that will only affect clients that use the service.
Normally,such events only indicate database, instance level, and application service failures.


2. System events. System events are more global and represent events such as node and communication failures. Such events affect all
services supported on the specific system (e.g., cluster membership changes, such as a node leaving or joining the cluster).




====>>>>

Cluster ready service daemon(CRSD) :

CRSD Maintains configuration profile as well as resource status in OCR.
CRSD Created a dedicated process called RACGIMON that monitor the health of database instance and ASM
Instance .




===>>>

Creating group and user ===>>

An operating system group needs to be created that will associated with Oracle Central Inventory(oraInventory).
oraInventory contains a registry of Oracle Home direcotories from all oracle product.



====>>>>


RAC Healthy Instances May Die With Error ORA-29702 When Other RAC Instances Are Hung [ID 789196.1]


RAC Instances (ASM and RDBMS) may crash on hang with a reconfiguration timeout; for example, some instances get hung due to OS or other issues during the reconfiguration, which prevents the reconfiguration phase from completing.

The alert log of crashed RAC instance will show :

Thu Feb 12 19:47:12 2009
Reconfiguration started (old inc 5, new inc 6)
List of nodes:
0 2 3 4
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
* dead instance detected - domain 3 invalid = TRUE
Communication channels reestablished
Thu Feb 12 20:03:56 2009
NOTE: database cdfm1p3:cdfm1p failed during msg 19, reply 2
Thu Feb 12 20:04:52 2009
Error: KGXGN polling error (15)
Thu Feb 12 20:04:52 2009
Errors in file /opt/oracle/admin/+ASM/bdump/+asm3_lmon_4479.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Thu Feb 12 20:04:56 2009
Dump system state for local instance only
System State dumped to trace file /opt/oracle/admin/+ASM/bdump/+asm3_diag_4475.
trc
Thu Feb 12 20:04:57 2009
Trace dumping is performing id=[cdmp_20090212200457]
Thu Feb 12 20:04:58 2009
Instance terminated by LMON, pid = 4479


The LMON tracefile of the crashed instances will show :

kjxgmpoll: terminate the CGS reconfig.
Error: KGXGN polling error (15)
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation
ksuitm: waiting up to [5] seconds before killing DIAG




Note the ocssd.log may not contain any obvious information for instance crash.

nstances can get hung during reconfiguration. When this occurs, the reconfiguration process (LMON)
will get stuck as well and healthy instances will start to die upon the CGS reconfiguration timeout.






====>>>

Like normal database instances ASM instance too have the usual background processes like SMON, PMON, DBWr, CKPT and LGWr.
In addition to that the ASM instance also have the following background processes,


RABL- Rebalancer: It opens all the device files as part of disk discovery and coordinates the ARB processes for rebalance activity.

ARBx - Actual Rebalancer: They perform the actual rebalancing activities. The number of ARBx processes depends on the ASM_POWER_LIMIT init parameter.

ASMB - ASM Bridge: This process is used to provide information to and from the Cluster Synchronization Service (CSS) used by ASM to manage the disk resources. It is also used to update statistics and provide a heartbeat mechanism.

Friday, January 21, 2011

How To Recover From A Drop/Truncate/Delete Table Done On Primary Using Flashback On A Standby

Goal =>

In this article we will be discussing the steps needed to recover
from a drop/truncate/delete table done on primary using flashback on a standby database.

Recovering the table using flashback on standby will eliminate downtime being caused on primary database.


Solution ===>>>>>



Prerequisites: ++++


- Standby Database Must be in Flashback database mode.

- Time at which Drop/Truncate/Delete Table happened should be within the db_flashback_retention_target and all the flashback and archive logs should be available




Steps: ===>>>

1) Stop the managed recovery process.

SQL > recover managed standby database cancel;


2) In order to recover lost data, use the flashback database feature to a time approximately before the Drop/Truncate/Delete Table happened.

Example
-------
SQL > shutdown immediate;

SQL > startup mount;

SQL > flashback database to timestamp to_date('29-MAY-2009 14:30:00','DD-MON-YYYY HH24:MI:SS');


3) Open the standby database in Read Only mode.

SQL> alter database open read only;


4) Check for the required data availability.

5) Take an export of the required table.

6) Import the table on the primary database.

7) Shutdown the standby database, mount and then roll forward the standby database to current time using the recover command.

SQL > shutdown immediate
SQL > startup mount
SQL > recover standby database;


7) Once all the archives are applied manually, you can start the MRP process to apply the archives automatically as and when they are shipped.

SQL > recover managed standby database disconnect from session;

Logical Standby Databases - Support for DSS-xxxx Errors

What to do when you receive a DSS-00001: SQL Apply is not running:

Confirm that the logical is not applying changes and that there is actually an issue.
Check the SqlApply Events to determine the latest error and determine if the SqlApply was stopped manually, or has failed due to an error.

If it was stopped manually, caused by a database restart, or some other non-error reason, restart the SQL Apply and confirm that Apply is catching up.

If it failed due to an error, combine the information from the SqlApply Events and the database alert log and take appropriate action:

All schema related errors need to be sent to Application support such as
ORA- 01403 "no data found" error
ORA-00955 name is already used by an existing object
Any errors such as database parameters, memory issues, ORA-0600 and datafile errors should be resolved by DBA.

Do not skip any transactions without being 100% sure that there will be no impact. E.g. if for some reason there is a DDL to delete an objects that does not exist, although this should not occur, no impact could be caused from skipping drop statement for a non-existent object.

What to do when you receive a DSS-00002: DSS-00002: Logical standby guard status set to None
Nothing needs to be done immediately. This is really a warning that the correct protection level is not set correctly. This can always be resolved during business hours.
What to do when you receive a DSS-00003: Logical standby apply latency greater than .....
Check that the SqlApply is still running
Double check if the SqlApply is up to date. It can catch up quickly sometimes. Run this check a number of times to determine if the SqlApply is progressing, and if it is catching up or falling behind further.

Check what SqlApply processes are doing. This will show if they are waiting for an archive log, They are processing a transaction, or are paging LCR's out to disk.

If SqlApply is waiting for an archivelog, check the archive log destinations for space issues and that the destinations on the primary and standby are not in error.

If the SqlApply has been restarted and a log that is still required had been deleted, there are cases when the log will not ship as a FAL request and needs to be copied manually.

If you find a specific apply process looks to be hung on a single transaction, obtain the SID for that SqlApply process and perform normal dba investigation as you would for any troublesome session.



What to do when you receive a DSS-00004: Archive log shipping contains .... gap(s) between sequences ....

Check that there archive log destinations are correct and not in error. Check the FAL parameters "SHOW PARAMETER FAL" are correct and gaps should be resolved automaticly.

All the queries required to perform these checks are in the section below.
Answers and queries to help monitor the Logical standby databases.

The formatting for these queries assume that SqlPlus is used and that 132 character wide screen is available.

Is this database a Logical standby?

Run the following SQL against the database:

SELECT DATABASE_ROLE FROM V$DATABASE;

Logical standby databases will show "LOGICAL STANDBY"


Is the Logical standby applying changes?

Run the following SQL against the Logical standby database:

SET LINESIZE 132
COLUMN REALTIME_APPLY FORMAT A15
COLUMN STATE FORMAT A60
SELECT REALTIME_APPLY, STATE FROM V$LOGSTDBY_STATE;

If the value of STATE is "NULL" or "SQL APPLY NOT ON" then the Sql Apply is not running.

The value of REALTIME_APPLY should be Y to allow for real time apply from the standby redo logs.





Is the SQL Apply up to date?

Run the following SQL against the Logical standby database:

SELECT
TO_CHAR(LATEST_TIME,'yyyy/mm/dd hh24:mi:ss') "LATEST_TIME",
TO_CHAR(APPLIED_TIME,'yyyy/mm/dd hh24:mi:ss') "APPLIED_TIME",
APPLIED_SCN, LATEST_SCN
FROM V$LOGSTDBY_PROGRESS;

If the LATEST_TIME and the APPLIED_TIME are close, then the SQL Apply is working fine.

A NULL value in the APPLIED_TIME may indicate that the SQL Apply is not running.
If the LATEST_TIME is not close to the current real time, there may be an issue with receiving the archive logs from the primary.


Which archive log sequences are at what stage for logical standby?

Run the following SQL against the Logical standby database:

SELECT 'RESTART' "TYPE", P.RESTART_SCN "SCN", TO_CHAR(P.RESTART_TIME,'yyyy/mm/dd hh24:mi:ss') "TIME", L.SEQUENCE# "SEQ#"
FROM V$LOGSTDBY_PROGRESS P,DBA_LOGSTDBY_LOG L WHERE P.RESTART_SCN >= L.FIRST_CHANGE# and P.RESTART_SCN < L.NEXT_CHANGE#
UNION
SELECT 'RESTART', P.RESTART_SCN, TO_CHAR(P.RESTART_TIME,'yyyy/mm/dd hh24:mi:ss'), L.SEQUENCE#
FROM V$LOGSTDBY_PROGRESS P, V$STANDBY_LOG L WHERE P.RESTART_SCN >= L.FIRST_CHANGE# and P.LATEST_SCN <=L.LAST_CHANGE#
UNION
SELECT 'APPLIED', P.APPLIED_SCN, TO_CHAR(P.APPLIED_TIME,'yyyy/mm/dd hh24:mi:ss'), L.SEQUENCE#
FROM V$LOGSTDBY_PROGRESS P,DBA_LOGSTDBY_LOG L WHERE P.APPLIED_SCN >= L.FIRST_CHANGE# and P.APPLIED_SCN < L.NEXT_CHANGE#
UNION
SELECT 'APPLIED', P.APPLIED_SCN, TO_CHAR(P.APPLIED_TIME,'yyyy/mm/dd hh24:mi:ss'), L.SEQUENCE#
FROM V$LOGSTDBY_PROGRESS P, V$STANDBY_LOG L WHERE P.APPLIED_SCN >= L.FIRST_CHANGE# and P.LATEST_SCN <=L.LAST_CHANGE#
UNION
SELECT 'MINING', P.MINING_SCN, TO_CHAR(P.MINING_TIME,'yyyy/mm/dd hh24:mi:ss'), L.SEQUENCE#
FROM V$LOGSTDBY_PROGRESS P, DBA_LOGSTDBY_LOG L WHERE P.MINING_SCN >= L.FIRST_CHANGE# and P.MINING_SCN < L.NEXT_CHANGE#
UNION
SELECT 'MINING', P.MINING_SCN, TO_CHAR(P.MINING_TIME,'yyyy/mm/dd hh24:mi:ss'), L.SEQUENCE#
FROM V$LOGSTDBY_PROGRESS P, V$STANDBY_LOG L WHERE P.MINING_SCN >= L.FIRST_CHANGE# and P.LATEST_SCN <=L.LAST_CHANGE#
UNION
SELECT 'SHIPPED', P.LATEST_SCN, TO_CHAR(P.LATEST_TIME,'yyyy/mm/dd hh24:mi:ss'), L.SEQUENCE#
FROM V$LOGSTDBY_PROGRESS P, DBA_LOGSTDBY_LOG L WHERE P.LATEST_SCN >= L.FIRST_CHANGE# and P.LATEST_SCN < L.NEXT_CHANGE#
UNION
SELECT 'SHIPPED', P.LATEST_SCN, TO_CHAR(P.LATEST_TIME,'yyyy/mm/dd hh24:mi:ss'), L.SEQUENCE#
FROM V$LOGSTDBY_PROGRESS P, V$STANDBY_LOG L WHERE P.LATEST_SCN >= L.FIRST_CHANGE# and P.LATEST_SCN <=L.LAST_CHANGE#;


An important piece of information is the SEQ# associated with RESTART. Although archive logs may have been mined and applied, the archive logs back to the RESTART SEQ# are required in the event that the SQL Apply is restarted.



********

What major Sql Apply events have occurred?

Run the following SQL against the Logical standby database:

SET LINESIZE 200
SET LONG 400
SET PAGESIZE 999
column EVENT_TIME FORMAT A20
column STATUS FORMAT A50
column EVENT FORMAT A100
SELECT TO_CHAR(EVENT_TIME,'YYYY/MM/DD HH24:MI:SS') "EVENT_TIME", STATUS, EVENT FROM DBA_LOGSTDBY_EVENTS ORDER BY EVENT_TIME;


By default, only the last 100 events are listed, you may be lucky enough to find older events in the alert.log file, however this will not always show all the information.


What major Dataguard events have occurred?
Run the following SQL against the Logical standby database:
SET PAGESIZE 999
SET LINESIZE 132
column TIME FORMAT A19
column ERROR FORMAT 999999
column DEST FORMAT 9999
column MESSAGE FORMAT A90
SELECT TO_CHAR(TIMESTAMP,'yyyy/mm/dd hh24:mi:ss') "TIME", ERROR_CODE "ERROR", DEST_ID "DEST", MESSAGE
FROM V$DATAGUARD_STATUS
WHERE timestamp > TRUNC(sysdate+6/24)
ORDER by timestamp DESC;


Which objects or statements are being skipped by Sql Apply?

Run the following SQL against the Logical standby database:

set linesize 132
set pagesize 999
column STATEMENT_OPT FORMAT A30
column OWNER FORMAT A20
column NAME FORMAT A30
column PROC FORMAT A30
SELECT STATEMENT_OPT, OWNER, NAME, PROC FROM DBA_LOGSTDBY_SKIP;

To check if there are specific transactions scheduled to be skipped, run the following SQL against the Logical standby database:

SELECT * FROM DBA_LOGSTDBY_SKIP_TRANSACTION;
What are the running Sql Apply processes doing?
Run the following SQL against the Logical standby database:
set linesize 132
COLUMN SID FORMAT 99999
COLUMN SERIAL# FORMAT 9999999
COLUMN LOGSTDBY_ID FORMAT 99999
COLUMN SPID FORMAT A10
COLUMN TYPE FORMAT A12
COLUMN STATUS_CODE FORMAT 99999
COLUMN STATUS FORMAT A50
COLUMN HIGH_SCN FORMAT 9999999999
SELECT SID, SERIAL#, LOGSTDBY_ID, SPID, TYPE, STATUS_CODE, HIGH_SCN, STATUS
FROM v$logstdby_process;


The STATUS column is very descriptive of what each Sql Apply process is doing.

Where are the archive logs going and are there any achieving issues?

Run the following SQL against either the logical standby or primary database:
set linesize 150
column DID FORMAT 999
column STATUS FORMAT A10
column DESTINATION FORMAT A30
column ARCHIVER FORMAT A4
column VALID_TYPE FORMAT A15
column VALID_ROLE FORMAT A12
column VALID_NOW FORMAT A16
column RECOVERY_MODE FORMAT A30
column ERROR FORMAT A40
SELECT DEST_ID "DID", STATUS, DESTINATION, ARCHIVER, VALID_NOW, VALID_TYPE, VALID_ROLE, ERROR FROM V$ARCHIVE_DEST
WHERE STATUS <> 'INACTIVE';


What are the current Sql Apply settings and statistics?

Run the following SQL against either the logical standby or primary database:
SELECT NAME, VALUE FROM V$LOGSTDBY_STATS;

Lots of useful information contained in this table.

How do I tell if there are not enough Apply processes?

Run the following SQL against either the logical standby or primary database:

COLUMN TRAN_APPLIED FORMAT A10
COLUMN TRAN_READY FORMAT A10
COLUMN NUM_APPLIERS FORMAT A10
COLUMN RATIO FORMAT 9999999999
SELECT TA.TRAN_APPLIED, TR.TRAN_READY, AP.NUM_APPLIERS, (TA.TRAN_APPLIED - TR.TRAN_READY )/ AP.NUM_APPLIERS "RATIO"
FROM
(SELECT VALUE "TRAN_READY" FROM V$LOGSTDBY_STATS WHERE NAME ='transactions ready') TR,
(SELECT VALUE "TRAN_APPLIED" FROM V$LOGSTDBY_STATS WHERE NAME ='transactions applied') TA,
(SELECT VALUE "NUM_APPLIERS" FROM V$LOGSTDBY_STATS WHERE NAME ='number of appliers') AP;

The RATIO column should return a value of 0 or greater.

A value of between 0 and 1 means there are idle apply processes and there is no gain by adding additional apply processes.

A value of between 1 and 2 means there are queued transactions waiting to be applied and there may be minimal gain by adding additional apply processes.

A value of greater that 2 shows contention on the apply processes and indicates additional apply processes need to be added.

This processes needs to be ran over a period of time as it only measures instantaneous values.


How do I skip a specific transaction?

1. Obtain the XIDUSN, XIDSLT, AND XIDSQN values for the failing transaction from the DBA_LOGSTDBY_EVENTS table. The following SQL may be of help:

SET LINESIZE 200
SET LONG 400
SET PAGESIZE 999
column EVENT_TIME FORMAT A20
column STATUS FORMAT A50
column CURRENT_SCN 999999999999999
column COMMIT_SCN 999999999999999
column XIDUSN FORMAT 999999
column XIDSLT FORMAT 999999
column XIDSQN FORMAT 999999
SELECT TO_CHAR(EVENT_TIME,'YYYY/MM/DD HH24:MI:SS') "EVENT_TIME", STATUS, CURRENT_SCN, COMMIT_SCN, XIDUSN, XIDSLT, XIDSQN
FROM DBA_LOGSTDBY_EVENTS
WHERE EVENT_TIME > SYSDATE-1
and status like 'ORA-00955%'
ORDER BY EVENT_TIME;


2. Execute the skipped transaction manually with any changes as needed

3. Skip the failed DDL statement by using the DBMS_LOGSTDBY.SKIP_TRANSACTION
procedure. With values from step 1.

exec dbms_logstdby.skip_transaction(, , );

The following statement may be helpful:

SELECT 'exec dbms_logstdby.skip_transaction( '||XIDUSN||','|| XIDSLT||','|| XIDSQN||');' FROM DBA_LOGSTDBY_EVENTS where ...


4. Restart Sql Apply.
ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;

How do stop and start the SqlApply processes?
Run the following SQL against the logical standby to start realtime SqlApply:
ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;


Run the following SQL against the logical standby to start realtime SqlApply if the SqlApply failed with an error, and you are 100% certain that the transaction is safe to skip :

ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE SKIP FAILED TRANSACTION;

Run the following SQL against the logical standby to stop SqlApply:
ALTER DATABASE STOP LOGICAL STANDBY APPLY;

Wednesday, January 19, 2011

Global cache services (LMSn)

Global cache services (LMSn) are processes that, when spawned by Oracle,
copy blocks directly from the holding instance’s buffer cache and send a
read-consistent copy of the block to the foreground process on the requesting
instance.

LMS also performs a rollback on any uncommitted transactions
for any blocks that are being requested for consistent read by the remote
instance.The number of LMS processes running is driven by the parameter

GCS_SERVER_PROCESSES.

Oracle supports up to 36 LMS processes (0–9 and a–z).


If the parameter is not defined, Oracle will start two LMS processes,which is the default value of GCS_SERVER_PROCESSES.

FAL_CLIENT and FAL_SERVER Parameters

FAL (Fetch Archive Log) ===>>

Under certain circumstances, either because of network failures or a
busy source environment, a gap builds up between the target sending the
information and the destination receiving the changes and applying them.
Since the MRP/LSP process has no direct communication link with the primary
database to obtain a copy of the missing archive files, such gaps or
archive gaps are resolved by the fetch archive log (FAL) client and server,
identified by the initialization parameters FAL_CLIENT and FAL_SERVER.



FAL_SERVER specifies the FAL (fetch archive log) server for a standby database.
The value is an Oracle Net service name, which is assumed to be configured properly on the standby database system to point to the desired FAL server.

FAL_CLIENT specifies the FAL (fetch archive log) client name that is used by the FAL service, configured through the FAL_SERVER parameter, to refer to the FAL client. The value is an Oracle Net service name, which is assumed to be configured properly on the FAL server system to point to the FAL client (standby database). Given the dependency of FAL_CLIENT on FAL_SERVER, the two parameters should be configured or changed at the same time.




FAL_CLIENT and FAL_SERVER are initialization parameters used to configure log gap detection and resolution at the standby database side of a physical database configuration. This functionality is provided by log apply services
and is used by the physical standby database to manage the detection and resolution of archived redo logs.


FAL_CLIENT and FAL_SERVER only need to be defined in the initialization parameter file for the standby database(s). It is possible; however, to define these two parameters in the initialization parameter for the primary database server to ease the amount of work that would need to be performed if the primary database were
required to transition its role.

FAL_CLIENT specifies the TNS network services name for the standby database (which is sent to the FAL server process by log apply services) that should be used by the FAL server process to connect to the standby database.
The syntax would be:

FAL_CLIENT=

FAL_SERVER specifies the TNS network service name that the standby database should use to connect to the FAL server process.
The syntax would be:

FAL_SERVER=

Taken from http://www.idevelopment.info/data/Oracle/DBA_tips/Data_Guard_9i/DG_75.shtml#Using%20FAL_CLIENT%20and%20FAL_SERVER

In other words ,where my primary server is Afini and standby is stbyaf

Then for STBYAF

Fal_client=stbyaf

Fal_Server=Afini

And for Afini

Fal_client=afini

Fal_Server=stbyaf

Que is to always say it as it while assuming database is acting as standby.

Monday, January 10, 2011

LMON: terminated RAC node 2 instance due to error 29740, Node 2 is evicted from the cluster

Problem Description:

Database is 3 node Oracle RAC cluster on linux environment, LMON: terminated
RAC node 2 instance due to error 29740.
Node 2 is evicted from the cluster with same error.


Info from Alert log from instance 2 (node 2)
====================================================

Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lmon_15344.trc:
ORA-29740: evicted by member 0, group incarnation 30
Thu Jan 6 08:51:38 2011
LMON: terminating instance due to error 29740
Thu Jan 6 08:51:38 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lms1_15399.trc:
ORA-29740: evicted by member , group incarnation
Thu Jan 6 08:51:38 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lms5_15426.trc:
ORA-29740: evicted by member , group incarnation
Thu Jan 6 08:51:38 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lms8_15438.trc:
ORA-29740: evicted by member , group incarnation
Thu Jan 6 08:51:38 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lms4_15422.trc:
ORA-29740: evicted by member , group incarnation
Thu Jan 6 08:51:38 2011


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Info from Alert log from instance 1 (node 1)
====================================================
alert_DBFAP1.log
~~~~~~~~~~~~~~~~
...
Thu Jan 6 08:43:30 2011
IPC Send timeout detected.Sender: ospid 18934
Receiver: inst 2 binc 366295975 ospid 15344
Thu Jan 6 08:43:55 2011
Communications reconfiguration: instance_number 2
Thu Jan 6 08:43:56 2011
Trace dumping is performing id=[cdmp_20110106084355]
Thu Jan 6 08:47:27 2011
Evicting instance 2 from cluster
Thu Jan 6 08:47:51 2011
Waiting for instances to leave:
2
Thu Jan 6 08:48:11 2011
Waiting for instances to leave:
....

where 18934 is the instance 1 LMON process
15344 is the instance 2 LMON process


************************************************************************



alert_DBFAP3.log
~~~~~~~~~~~~~~~
...
Thu Jan 6 08:47:27 2011
Evicting instance 2 from cluster
Thu Jan 6 08:47:51 2011
Waiting for instances to leave:
2
Thu Jan 6 08:48:11 2011
Waiting for instances to leave:
2
Thu Jan 6 08:48:31 2011
Waiting for instances to leave:
2

**************************************************************************



Uploaded the trace file also for further investigation ===>>>>>>



dbfap2_lmon_15344.trc:
----------------------

*** 2011-01-06 01:43:45.239
SKGXP_TRACE_FLUSH: output truncated 11 traces lost
SKGXP_TRACE_FLUSH: start
SKGXPIWAIT: keepalive_reset elapsed 676348 ts 25100078 last ping 24423730 check 600000
. . .
kjfmrcvrchk: receiver LMS[5] check timed out (1294238295.1294238625.0).
kjfmrcvrchk: Dumping callstack of lms5
Submitting asynchronized dump request [20]
kjfmrcvrchk: receiver LMS[9] check timed out (1294238295.1294238625.0).
kjfmrcvrchk: Dumping callstack of lms9
Submitting asynchronized dump request [20]
*** 2011-01-06 08:46:52.715
Received ORADEBUG command 'dump errorstack 1' from process Unix process pid: 15310, image:
*** 2011-01-06 08:46:52.745
ksedst ksedmp ksdxfdmp ksdxcb sspuser __funlockfile __pread_nocancel skgfqio
ksfdread1 ksfdread kccrbp kccgrd kjxgrgetresults kjxgrimember kjxggpoll kjfmact kjfdact kjfcln
ksbrdp opirip opidrv sou2o opimai_real main

*** 2011-01-06 08:51:33.863
ksedst ksedmp ksdxfdmp ksdxcb sspuser __funlockfile __pwrite_nocancel skgfqio
ksfdgo ksfdbio kccwbp kccprd kccpcp kjxgrgetresults kjxgrimember kjxggpoll kjfmact kjfdact kjfcln
ksbrdp opirip opidrv sou2o opimai_real main
*** 2011-01-06 08:51:34.540
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 28 0.
*** 2011-01-06 08:51:34.541
kjxgrupdtcp: Updating ckpt progress rec, flgs 0x04
kjxgrgtld: current system load 6522 (hiload)

****************************


Observation : kjxgrgtld: current system load 6522 (hiload)

system seems to be under high load, so might be the reason.


==>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<====================================

dbfap1_lmon_18934.trc:
---------------------
*** 2011-01-06 08:43:30.072
IPC Send timeout detected.Sender: ospid 18934
Receiver: inst 2 binc 366295975 ospid 15344
SKGXPCTX: 0x0x66af610 ctx
. . .
*** 2011-01-06 08:43:57.050
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 28 0.
*** 2011-01-06 08:43:57.051
kjxgrupdtcp: Updating ckpt progress rec, flgs 0x04
kjxgrupdmemthrd: Mem 1, thrd 2, vers 366295975, stat 0x0007 err 0x0002
kjxgrgtld: current system load 48 (normal)


+++++++++++++++++++++++++++++++++++++++++++++++++++

Observation :kjxgrgtld: current system load 48 (normal)

So system load is healthy , no problem in node 1 and node 3.




After going through alert log and trace files we have notice that the following points :


At about 'Thu Jan 6 08:43:30 2011' instance 2 LMON has gone unresponsive. Hence it had been evicted out of the cluster. Looks like instance2 node was high above normal. It could be that due to high load on node2, the LMON could not respond on timely manner.

Now, we need to look at the OS statistics on Node2 during that time to further analyze on system load. Have you got OS Watcher already installed, if so get the logs which covers the time period 'Thu Jan 6 08:43:30 2011'


===>>>>>>>>>>>>>>>>>


After investigation and Oracle Suggested parameters to disable the DRM with below parameter :


--------------------------------------------------------------------------------
_gc_undo_affinity= FALSE
_gc_affinity_time=0

gdp Strack parameter :
--------------------------------------------------------------------------------
ksb_disable_diagpid = TRUE
=============================================================



We have noticed that not seeing LMON issue for a day , but again after 1 day, behavior reproduce ........






Investigation from alert log and trace file ===>>>


node1_alert_DBFAP1.log
======================
Mon Nov 8 17:46:44 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 bond2 172.16.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 bond0 144.131.84.0 configured from OCR for use as a public interface
...
Cluster communication is configured to use the following interface(s) for this instance
172.16.0.7
Mon Nov 8 17:46:44 2010
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Mon Nov 8 17:46:44 2010
WARNING:Oracle instance running on a system with low open file descriptor
limit. Tune your system to increase this limit to avoid
severe performance degradation.
...
LMON started with pid=5, OS id=18934
...
Fri Jan 7 16:31:50 2011
IPC Send timeout detected.Sender: ospid 18934
Receiver: inst 2 binc 393203158 ospid 5629
Communications reconfiguration: instance_number 2
Fri Jan 7 16:32:00 2011
Trace dumping is performing id=[cdmp_20110107163200]
Fri Jan 7 16:33:47 2011
Evicting instance 2 from cluster
Fri Jan 7 16:33:50 2011
Trace dumping is performing id=[cdmp_20110107163350]
Fri Jan 7 16:33:56 2011
Reconfiguration started (old inc 64, new inc 68)
List of nodes:
0 2
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 not valid according to instance 2
=============================================================


Observation :Oracle instance running on a system with low open file descriptor
limit. Tune your system to increase this limit to avoid
severe performance degradation.




+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


node2_alert_DBFAP2.log
======================
Fri Jan 7 16:31:50 2011
IPC Send timeout detected. Receiver ospid 5629
Fri Jan 7 16:31:58 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lmon_5629.trc:
Fri Jan 7 16:32:00 2011
Trace dumping is performing id=[cdmp_20110107163200]
Fri Jan 7 16:33:50 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_lmon_5629.trc:
ORA-29740: evicted by member 2, group incarnation 66
Fri Jan 7 16:33:50 2011
LMON: terminating instance due to error 29740
...
Fri Jan 7 16:33:50 2011
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_diag_5625.trc
Fri Jan 7 16:33:50 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_pmon_5623.trc:
ORA-29740: evicted by member , group incarnation
Fri Jan 7 16:33:50 2011
Errors in file /u01/app/oracle/admin/DBFAP/bdump/dbfap2_ckpt_5705.trc:
ORA-29740: evicted by member , group incarnation
Fri Jan 7 16:33:51 2011
Shutting down instance (abort)
...
Fri Jan 7 16:33:59 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 bond2 172.16.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 bond0 144.131.84.0 configured from OCR for use as a public interface
...
Cluster communication is configured to use the following interface(s) for this instance
172.16.0.8
Fri Jan 7 16:33:59 2011
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Fri Jan 7 16:33:59 2011
WARNING:Oracle instance running on a system with low open file descriptor
limit. Tune your system to increase this limit to avoid
severe performance degradation.
=========






dbfap2_lmon_5629.trc
~~~~~~~~~~~~~~~~~~~~
...
*** 2011-01-07 16:32:07.029
kjxgrnetchk: Sending comm check req to 0
*** 2011-01-07 16:32:07.029
kjxgrnetchk: Sending comm check req to 2
kjxgrrcfgchk: prev pstate 5 mapsz 512
kjxgrrcfgchk: new bmp: 0 1 2
kjxgrrcfgchk: work bmp: 0 1 2
kjxgrrcfgchk: rr bmp: 0 1 2
*** 2011-01-07 16:32:07.029
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 64 0.
*** 2011-01-07 16:32:07.030
kjxgrupdtcp: Updating ckpt progress rec, flgs 0x04
kjxgrupdmemthrd: Mem 2, thrd 3, vers 534163215, stat 0x0037 err 0x0000
kjxgrgtld: current system load 2324 (normal)
kjxgrs0h: Reconfig rsn 3, flgs 0x04, st -344944946, bitmap: 0 1 2
*** 2011-01-07 16:32:07.050
Name Service frozen
.......



Final Conclusion =====>>


1. From the lmon trace file, it indicates it need to start another reconfiguration due to loss in communication.

This could be related to the DRM issue for the pre-10.2.0.4 database.

We would like to suggest:
a) apply the latest patchset (10.2.0.5) as soon as possible to this DB
or
b) disable the DRM as suggest in the previous action plan.

2. There is no direct relationship between DRM and resource manager using in DB, therefore, it should be ok to disable DRM.


so suspecting this is the product defect in 10.2.0.3

Sunday, January 9, 2011

SHUTDOWN IMMEDIATE HANGS AND GIVES ORA-600 [3708] ...

Database shutdown fails with the following:

Symptoms : ===>>>>

Errors in file /logisys/oracle/admin/xbt/bdump/xbt_lgwr_1560598.trc:
Thu Jun 19 20:16:25 2008
Errors in file /logisys/oracle/admin/xbt/udump/xbt_ora_1544374.trc:
ORA-00600: internal error code, arguments: [3708], [910], [], [], [], [], [], []
Thu Jun 19 20:16:26 2008
CLOSE: Error 600 during database close
Thu Jun 19 20:16:26 2008
SMON: enabling cache recovery
SMON: enabling tx recovery
Thu Jun 19 20:16:26 2008
ORA-600 signalled during: ALTER DATABASE CLOSE NORMAL...

Call stack from ORA-600 trace file:
kcttsc<- kcfcld <- dbsclose <- adbdrv <- opiexe
<- opiosq0 <- kpooprx <- kpoal8 <- opiodr <- ttcpip
<- opitsk <- opiino <- opiodr <- opidrv <- sou2o
<- opimai_real <- main

Call stack from LGWR file:
ksdxfdmp <- ksdxcb <- sspuser <- 000044BC <- skgpwwait
<- ksliwat <- kslwaitns <- kskthbwt <- kslwait <- kcrrwstp
<- kcrrsas <- kcrrxmp <- kcrrstpaa <- kctcls <- kcttha
<- ksbabs <- ksbrdp <- opirip <- opidrv <- sou2o
<- opimai_real <- main


***************************************************


Cause : ===>>>>>>>>>>>>>

This problem is due to

Bug 6512622 SHUTDOWN IMMEDIATE HANGS AND GIVES ORA-600 [3708]
Details:
SHUTDOWN IMMEDIATE may hang or signal ORA-600 [3708]



*****************

Solution : ===>>>


Please check the LGWR stack in the trace file to confirm the bug.

There is no workaround for this bug. To solve this problem:

1. Apply the 10.2.0.5 patchset where the bug is fixed.

OR

2. Apply one off Patch 6512622 if available on My Oracle Support for your platform and Oracle Version.
OR

3. Upgrade to 11.1.0.6 where the bug is fixed.

Friday, January 7, 2011

Oracle Clusterware processes for 10g on Unix and Linux

What are Oracle Clusterware processes for 10g on Unix and Linux



Cluster Synchronization Services (ocssd) — Manages cluster node membership and runs as the oracle user; failure of this process results in cluster restart.

Cluster Synchronization Services(OCSSD):

OCSSD is part of RAC and Single Instance with ASM
Provides access to node membership, group services, basic cluster locking
Integrates with vendor clusterware, when present
Can also runs without integration to vendor clusterware
Runs as Oracle
Failure exit causes machine reboot
Prevents data corruption in event of a split brain.


Shared storage is also required for a voting (or quorum) disk, which is used to determine the nodes that are currently available within the cluster. The voting disk is used by the OCSSD to detect when nodes join and leave the cluster and is therefore also known as the Cluster Synchronization Services (CSS) voting disk.

==> Log files stores:

Cluster Synchronization Services (CSS) Log Files You can find CSS information that the OCSSD generates in log files in the following locations:

CRS Home/css/log/ocssdnumber.log
CRS Home/css/init/node_name.log



==>>>>

Cluster Ready Services (crsd) — The crs process manages cluster resources (which could be a database, an instance, a service, a Listener, a virtual IP (VIP) address, an application process, and so on) based on the resource's configuration information that is stored in the OCR. This includes start, stop, monitor and failover operations. This process runs as the root user


Functional =>>

CRSD: Engine for HA operation
Manages (start/stop/respawn) application resources
Maintains configuration profiles in the OCR (Oracle Configuration Repository)
Stores current known state in the OCR
Runs as root
Is restarted automatically on failure.


Logfiles store : ==>>

Cluster Ready Services Log Files Cluster Ready Services (CRS) has daemon processes that generate log information. Log files for the CRS daemon (crsd) can be found in the following directories:

CRS Home/crs/init
CRS Home/crs/node name.log



Event manager daemon (evmd) —A background process that publishes events that crs creates.

Process Monitor Daemon (OPROCD) —This process monitor the cluster and provide I/O fencing. OPROCD performs its check, stops running, and if the wake up is beyond the expected time, then OPROCD resets the processor and reboots the node. An OPROCD failure results in Oracle Clusterware restarting the node. OPROCD uses the hangcheck timer on Linux platforms.

RACG (racgmain, racgimon) —Extends clusterware to support Oracle-specific requirements and complex resources. Runs server callout scripts when FAN events occur.


==>>> RAC Specific background processes

What are Oracle database background processes specific to RAC

•LMS—Global Cache Service Process

•LMD—Global Enqueue Service Daemon

•LMON—Global Enqueue Service Monitor

•LCK0—Instance Enqueue Process

To ensure that each Oracle RAC database instance obtains the block that it needs to satisfy a query or transaction, Oracle RAC instances use two processes, the Global Cache Service (GCS) and the Global Enqueue Service (GES). The GCS and GES maintain records of the statuses of each data file and each cached block using a Global Resource Directory (GRD). The GRD contents are distributed across all of the active instances.


Global Cache Service (GCS) :

In a RAC database each instance has its own database buffer cache, which is located in the SGA on the local node. However, all instances share the same set of datafiles.

It is therefore possible that one or more instances might attempt to read and/or update the same block at the same time.

So access to the data blocks across the cluster must be managed in order to guarantee only one instance can modify the block at a time. In addition, any changes must be made visible to all other instances immediately once the transaction is committed. This is managed by the GCS, which coordinates requests for data access between the instances of the cluster.


LMSn ====>

The Global Cache Service background processes (LMSn) manage requests for data access between the nodes of the cluster.

Each block is assigned to a specific instance using the same hash algorithm that is used for global resources.

The instance managing the block is known as the resource master. When an instance requires access to a specific block, a request is sent to an LMS process on the resource master requesting access to the block.

The LMS process can build a read-consistent image of the block and return it to the requesting instance, or it can forward the request to the instance currently holding the block.


The LMS processes coordinate block updates, allowing only one instance at a time to make changes to a block and ensuring that those changes are made to the most recent version of the block.

The LMS process on the resource master is responsible for maintaining a record of the current status of the block, including whether it has been updated.
In Oracle 9.0.1 and Oracle 9.2 there can be up to 10 LMSn background processes (LMS0 to LMS9) per instance; in Oracle 10.1 there can be up to 20 LMSn background processes (LMS0 to LMS9, LMSa to LMSj) per instance; in Oracle 10.2 there can be up to 36 LMSn background processes (LMS0 to LMS9, LMSa to LMSz).

The number of required LMSn processes varies depending on the amount of messaging between the nodes in the cluster.






Global Enqueue Service (GES) :

In a RAC database, the GES is responsible for interinstance resource coordination. The GES manages all non-Cache Fusion intra-instance resource operations.

It tracks the status of all Oracle enqueue mechanisms for resources that are accessed by more than one instance.

Oracle uses GES to manage concurrency for resources operating on transactions, tables, and other structures within a RAC environment.



LMON ==>>>

In a single-instance database, access to database resources is controlled using enqueues that ensure that only one session has access to a resource at a time and that other sessions wait on a first in, first out (FIFO) queue until the resource becomes free. In a single-instance database, all locks are local to the instance.

In a RAC database there are global resources, including locks and enqueues that need to be visible to all instances. For example, the database mount lock that is used to control which instances can concurrently mount the database is a global enqueue, as are library cache locks, which are used to signal changes in object definitions that might invalidate objects currently in the library cache.

The Global Enqueue Service Monitor (LMON) background process is responsible for managing global enqueues and resources. It also manages the Global Enqueue Service Daemon (LMD) processes and their associated memory areas. LMON is similar to PMON in that it also manages instance and process expirations and performs recovery processing on global enqueues.

In Oracle 10.1 and below there is only one lock monitor background process.


LMDn ==>>>

The current status of each global enqueue is maintained in a memory structure in the SGA of one of the instances.

For each global resource, three lists of locks are held, indicating which instances are granted, converting, and waiting for the lock.

The LMD background process is responsible for managing requests for global enqueues and updating the status of the enqueues as requests are granted.

Each global resource is assigned to a specific instance using a hash algorithm. When an instance requests a lock, the LMD process of the local instance sends a request to the LMD process of the remote instance managing the resource. If the resource is available, then the remote LMD process updates the enqueue status and notifies the local LMD process.


If the enqueue is currently in use by another instance, the remote LMD process will queue the request until the resource becomes available. It will then update the enqueue status and inform the local LMD process that the lock is available.
The LMD processes also detect and resolve deadlocks that may occur if two or more instances attempt to access the two or more enqueues concurrently.

In Oracle 10.1 and below there is only one lock monitor daemon background process named LMD0.


LCK0 ==>>

The instance enqueue background process (LCK0) is part of GES. It manages requests for resources other than data blocks—for example, library and row cache objects. LCK processes handle all resource transfers not requiring Cache Fusion. It also handles cross-instance call operations. In Oracle 9.0.1 there could be up to ten LCK processes (LCK0 to LCK9). In Oracle 9.2 and Oracle 10.1 and 10.2 there is only one LCK process (LCK0).



DIAG ===>>

The DIAG background process captures diagnostic information when either a process or the entire instance fails. This information is written to a subdirectory within the directory specified by the BACKGROUND_DUMP_DEST initialization parameter. The files generated by this process can be forwarded to Oracle Support for further analysis.
There is one DIAG background process per instance. It should not be disabled or removed. In the event that the DIAG background process itself fails, it can be automatically restarted by other background processes.









What are Oracle Clusterware Components

Voting Disk — Oracle RAC uses the voting disk to manage cluster membership by way of a health check and arbitrates cluster ownership among the instances in case of network failures. The voting disk must reside on shared disk.

Oracle Cluster Registry (OCR) — Maintains cluster configuration information as well as configuration information about any cluster database within the cluster. The OCR must reside on shared disk that is accessible by all of the nodes in your cluster

How do you troubleshoot node reboot

Please check metalink ...

Note 265769.1 Troubleshooting CRS Reboots
Note.559365.1 Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions.

How do you backup the OCR

There is an automatic backup mechanism for OCR. The default location is : $ORA_CRS_HOME\cdata\"clustername"\

To display backups :
#ocrconfig -showbackup
To restore a backup :
#ocrconfig -restore

With Oracle RAC 10g Release 2 or later, you can also use the export command:
#ocrconfig -export -s online, and use -import option to restore the contents back.
With Oracle RAC 11g Release 1, you can do a manaual backup of the OCR with the command:
# ocrconfig -manualbackup

How do you backup voting disk

#dd if=voting_disk_name of=backup_file_name

How do I identify the voting disk location

#crsctl query css votedisk

How do I identify the OCR file location

check /var/opt/oracle/ocr.loc or /etc/ocr.loc ( depends upon platform)
or
#ocrcheck

Is ssh required for normal Oracle RAC operation ?

"ssh" are not required for normal Oracle RAC operation. However "ssh" should be enabled for Oracle RAC and patchset installation.

What is SCAN?

Single Client Access Name (SCAN) is s a new Oracle Real Application Clusters (RAC) 11g Release 2 feature that provides a single name for clients to access an Oracle Database running in a cluster. The benefit is clients using SCAN do not need to change if you add or remove nodes in the cluster.

Click here for more details from Oracle

What is the purpose of Private Interconnect ?

Clusterware uses the private interconnect for cluster synchronization (network heartbeat) and daemon communication between the the clustered nodes. This communication is based on the TCP protocol.
RAC uses the interconnect for cache fusion (UDP) and inter-process communication (TCP). Cache Fusion is the remote memory mapping of Oracle buffers, shared between the caches of participating nodes in the cluster.

Why do we have a Virtual IP (VIP) in Oracle RAC?

Without using VIPs or FAN, clients connected to a node that died will often wait for a TCP timeout period (which can be up to 10 min) before getting an error. As a result, you don't really have a good HA solution without using VIPs.
When a node fails, the VIP associated with it is automatically failed over to some other node and new node re-arps the world indicating a new MAC address for the IP. Subsequent packets sent to the VIP go to the new node, which will send error RST packets back to the clients. This results in the clients getting errors immediately.

What do you do if you see GC CR BLOCK LOST in top 5 Timed Events in AWR Report?

This is most likely due to a fault in interconnect network.
Check netstat -s
if you see "fragments dropped" or "packet reassemblies failed" , Work with your system administrator find the fault with network.

How many nodes are supported in a RAC Database?

10g Release 2, support 100 nodes in a cluster using Oracle Clusterware, and 100 instances in a RAC database.

Srvctl cannot start instance, I get the following error PRKP-1001 CRS-0215, however sqlplus can start it on both nodes? How do you identify the problem?

Set the environmental variable SRVM_TRACE to true.. And start the instance with srvctl. Now you will get detailed error stack.



what is the purpose of the ONS daemon ==>>

Oracle Notification Service (ONS) is used by Oracle Clusterware to propagate messages both within the RAC cluster and to clients and application-tier systems. ONS uses a publish-and-subscribe method to generate and deliver event messages to both local and remote consumers.


ONS is automatically installed as a node application on each node in the cluster. In Oracle 10.1 and above, it is configured as part of the Oracle Clusterware installation process. ONS daemons run locally, sending and receiving messages from ONS daemons on other nodes in the cluster. The daemons are started automatically by Oracle Clusterware during the reboot process.


ONS provides the foundation for Fast Application Notification (FAN), which in turn provides the basis for Fast Connection Failover (FCF).


The Oracle Notification Service (ONS) daemon is an daemon started by the CRS clusterware as part of the nodeapps. There is one ons daemon started per clustered node.
The Oracle Notification Service daemon receive a subset of published clusterware events via the local evmd and racgimon clusterware daemons and forward those events to application subscribers and to the local listeners.


Fast Connection Failover (FCF):

Fast Connection Failover (FCF) was introduced in Oracle 10.1 and relies on the ONS infrastructure.

It works with integrated connection pools in application servers and clients and is used to prevent new connections being directed to failed nodes or instances.

When a failure occurs, the application is immediately notified of the change in cluster configuration by ONS, and the connection pool can react by directing new connections to surviving instances. This behavior is performed internally by the connection pool and is transparent to both the developer and the application.
Oracle clients that provide FCF include Java Database Connectivity (JDBC), Oracle Call Interface (OCI), and the ODP.NET CLI.





FAN ==> which was introduced in Oracle 10.1, allows databases, listeners, application servers, and clients to receive rapid notification of database events, such as the starting and stopping of the database, instances, or services.

This allows the application to respond in a timely fashion to the event. It may be possible for a well-written application to reconnect to another instance without the end user ever being aware that the event has occurred.

a. The FAN or Fast Application Notification feature or allowing applications to respond to database state changes.

b. The 10gR2 Load Balancing Advisory, the feature that permit load balancing accross different rac nodes dependent of the load on the different nodes. The rdbms MMON is creating an advisory for distribution of work every 30seconds and forward it via racgimon and ONS to listeners and applications.

Checkpoint Tuning:

Checkpoint Tuning:

Determining the time to recover from an instance failure is a necessary component for reaching required service levelsagreements. For example, if service levels dictate that when a node fails, instance recovery time can be no more than 3 minutes, FAST_START_MTTR_TARGET should be set to 180.


Fast-start checkpointing refers to the periodic writes by the database writer (DBWn) processes for the purpose of writing changed data blocks from the Oracle buffer cache to disk and advancing the thread-checkpoint. Setting the database parameter FAST_START_MTTR_TARGET to a value greater than zero enables the fast-start checkpointing feature.


Fast-start checkpointing should always be enabled for the following reasons:

It reduces the time required for cache recovery, and makes instance recovery time-bounded and predictable. This is accomplished by limiting the number of dirty buffers (data blocks which have changes in memory that still need to be written to disk) and the number of redo records (changes in the database) generated between the most recent redo record and the last checkpoint.


Fast-Start checkpointing eliminates bulk writes and corresponding I/O spikes that occure traditionally with interval- based checkpoints, providing a smoother, more consistent I/O pattern that is more predictable and easier to manage.

If the system is not already near or at its maximum I/O capacity, fast-start checkpointing will have a negligible impact on performance. Although fast-start checkpointing results in increased write activity, there is little reduction in database throughout, provided the system has sufficient I/O capacity.



Check-Pointing

Check-pointing is an important Oracle activity which records the highest system change number (SCN,) so that all data blocks less than or equal to the SCN are known to be written out to the data files. If there is a failure and then subsequent cache recovery, only the redo records containing changes at SCN(s) higher than the checkpoint need to be applied during recovery.



As we are aware, instance and crash recovery occur in two steps - cache recovery followed by transaction recovery. During the cache recovery phase, also known as the rolling forward stage, Oracle applies all committed and uncommitted changes in the redo log files to the affected data blocks. The work required for cache recovery processing is proportional to the rate of change to the database and the time between checkpoints.



Mean time to recover (MTTR)

Fast-start recovery can greatly reduce the mean time to recover (MTTR), with minimal effects on online application performance. Oracle continuously estimates the recovery time and automatically adjusts the check-pointing rate to meet the target recovery time.



With 10g, the Oracle database can now self-tune check-pointing to achieve good recovery times with low impact on normal throughput. You no longer have to set any checkpoint-related parameters.


This method reduces the time required for cache recovery and makes the recovery bounded and predictable by limiting the number of dirty buffers and the number of redo records generated between the most recent redo record and the last checkpoint. Administrators specify a target (bounded) time to complete the cache recovery phase of recovery with the FAST_START_MTTR_TARGET initialization parameter, and Oracle automatically varies the incremental checkpoint writes to meet that target.



The TARGET_MTTR field of V$INSTANCE_RECOVERY contains the MTTR target in effect. The ESTIMATED_MTTR field of V$INSTANCE_RECOVERY contains the estimated MTTR should a crash happen right away.

Enable MTTR advisory

Enabling MTTR Advisory Enabling MTTR Advisory involves setting two parameters:

STATISTICS_LEVEL = TYPICAL
FAST_START_MTTR_TARGET > 0

Estimate the value for FAST_START_MTTR_TARGET as follows:

SELECT TARGET_MTTR,
ESTIMATED_MTTR,
CKPT_BLOCK_WRITES
FROM V$INSTANCE_RECOVERY;

TARGET_MTTR ESTIMATED_MTTR CKPT_BLOCK_WRITES
----------- -------------- -----------------
214 12 269880

FAST_START_MTTR_TARGET = 214;



Whenever you set FAST_START_MTTR_TARGET to a nonzero value, then set the following parameters to 0.

LOG_CHECKPOINT_TIMEOUT = 0
LOG_CHECKPOINT_INTERVAL = 0
FAST_START_IO_TARGET = 0

Disable MTTR advisory

FAST_START_MTTR_TARGET = 0
LOG_CHECKPOINT_INTERVAL = 200000

How to gather statistics on data dictionary objects in Oracle 10g

How to gather statistics on data dictionary objects in Oracle 10g.


Before Oracle database 10g Oracle explicitly recommeded not to gather statistics on data dictionary objects.


As of Oracle database 10g Oracle explicitly does recommend to gather statistics on data dictionary objects.

As you might know, there is an automatically created SCHEDULER JOB in every 10g database which runs every night and checks for object which have either no statistics at all or for which the statistics have become STALE (which means stat at least 10% of the values have changed).

This job is call GATHER_STATS_JOB and belongs to the autotask job class. It uses a program which again call a procedure from built in package DBMS_STATS which does the statistics collection.

This feature only works if the initialization parameter STATISTICS_LEVEL is set to TYPICAL at least (which is the DEFAULT in 10g) and it utilizes the TABLE MONITORING feature. TABLE MONITORING is enabled for all tables in 10g by DEFAULT. One question which pops uo in my seminars frequently is “Does this job also collect statistics on the data dictionary objects as well?” The answer is not 42 but “YES, it does!” and here is the proof for this:


– first let us check if dbms_stats.gather_database_stats collect statistics for the data dictionary:

SQL> select count(*) from tab$;
COUNT(*) ———- 1227


SQL> create table t2 (col1 number);
Table created.


SQL> select count(*) from tab$;
COUNT(*) ———- 1228


SQL> select NUM_ROWS from dba_tables where table_name=’TAB$’;
NUM_ROWS ———- 1213



SQL> exec dbms_stats.gather_database_stats;
PL/SQL procedure successfully completed.


SQL> select NUM_ROWS from dba_tables where table_name=’TAB$’;

NUM_ROWS ———- 1228 – IT DOES! – and now let’s see if the job does also:

SQL> create table t3 (col1 number);
Table created.

SQL> create table t4 (col1 number);
Table created.

SQL> select NUM_ROWS from dba_tables where table_name=’TAB$’;
NUM_ROWS ———- 1228


– gather_stats_job run manually from DATABASE CONTROL !!! SQL> select NUM_ROWS from dba_tables where table_name=’TAB$’;

NUM_ROWS ———- 1230


– and IT ALSO DOES! — even though there were not even 0.1% of the values changed it did! So when should we gahter statistics for the data dictionary manually?

Oracle recommends to collect them when a significat nnumber of changes were applied to the data dictionary, like dropping significant numbers of partions and creating new ones dropping tables, indexes, creating new ones and so on.

But this only if it is a signifcant number of changes and you cannot wait for the next automatically scheduled job run.

Why do you run orainstRoot and ROOT.SH once you finalize the Installation

Why do you run orainstRoot and ROOT.SH once you finalize the Installation?


orainstRoot.sh needs to be run to change the Permissions and groupname to 770 and to dba.



Root.sh (ORACLE_HOME) location needs to be run to create a ORATAB in /etc/oratab or /opt/var/oratab in Solaris and to copy dbhome, oraenv and coraenv to /usr/local/bin.



orainstRoot.sh

[root@oracle11g ~]# /u01/app/oraInventory/orainstRoot.sh

Changing permissions of /u01/app/oraInventory to 770.

Changing groupname of /u01/app/oraInventory to dba.

The execution of the script is complete





root.sh

[root@oracle11g ~]# /u01/app/oracle/product/11.1.0/db_1/root.sh

Running Oracle 11g root.sh script...

The following environment variables are set as:

ORACLE_OWNER= oracle

ORACLE_HOME= /u01/app/oracle/product/11.1.0/db_1

Enter the full pathname of the local bin directory: [/usr/local/bin]:

Copying dbhome to /usr/local/bin ...

Copying oraenv to /usr/local/bin ...

Copying coraenv to /usr/local/bin ...

Creating /etc/oratab file...

Entries will be added to the /etc/oratab file as needed by

Database Configuration Assistant when a database is created

Finished running generic part of root.sh script.

Now product-specific root actions will be performed.

Finished product-specific root actions.

Delete, Drop and Truncate Difference

The first difference between
delete, drop and truncate

statements is the statement nature itself. DELETE statement is a DML (DATA MANIPULATION LANGUAGE) statement. DROP AND truncate statements are DDL ( DATA definition language) statements.


A DML action can be rolledback if the data is not committed. Whereas all the DDL actions cannot be rolled back until Oracle 9i. But with Oracle 9i FLASH BACK enabled even some of the DDL statements can be rolledback.


Delete : A delete statement deletes the data from a table. A delete statement can have a where clause, on which it deletes the records that satisfy only the “where” condition.

If the “where” condition is omitted the delete statement deletes all the records in a table.

Actions of Delete statement :

* deletes the data from the current table space

* Modifies the undo tablespace with the delete records.

* Executes all the before / after statement and row level triggers.

* Updates the indexes (makes the index empty if the where clause is omitted).

* Constraint checks are performed before deleting the rows

HOW TO APPLY ORACLE PATCH (OPATCH)

HOW TO APPLY ORACLE PATCH (OPATCH)
====================================================



To apply Opatch conditions are db and listener both must be down as opatch will update your current ORACLE_HOME with patches.

in single instance its not possible.

but for RAC instance its possible.
as in RAC there will be two seperate oracle home and two seperate instances running once instance on each oracle_home


use this command:

opatch napply -skip_subset -skip_duplicate -local -oh $ORACLE_HOME



when using -local parameter and -oh $ORACLE_HOME this means this patch session will only apply patch to current sourced ORACLE_HOME.



steps before applying patch:
----------------------------

1) check the database status.
wch_db.sql
-----------
select name,
open_mode,
database_name,
created,
log_mode,
platform_name
from v$database;



2) Check the object's invalid.

user_inv.sql
============
SELECT owner,
COUNT(*)
FROM dba_objects
WHERE status = 'INVALID'
GROUP BY owner;


count_inv.sql
-------------
select count(*)
from dba_objects
WHERE status ='INVALID';


3) Take backup of invalid's

create table bk_inv_ as select * from dba_objects
where status='INVALID';



4) check opatch version using
opatch -v
if opatch version is not compatible check the readme file and
download the latest version and uncompress
in $ORACLE_HOME.



5) check oraInst.loc file pointing to your current $ORACLE_HOME or not.
cat /etc/oraInst.loc

inventory_loc=/u01/app/oracle/10.2.0/GlobalOraInv
inst_group=dba



if your server have more then one $ORACLE_HOME then comment the other $ORACLE_HOME and
uncomment the current $ORACLE_HOME

inventory must point to the current $ORACLE_HOME which is getting patched.



6) check free space on $ORACLE_HOME
df -h $ORACLE_HOME



7) chek the utilities like
which ld
which ar
which make
etc as per readme file.



8) unzip the patch
unzip -d /loc_2_unzip p.zip


9) Go the patch directory
cd /loc_2_unzip/patch_number


10) Bring down the listner.
cd $ORACLE_HOME/bin
lsnrctl stop


11) Bring down the database
Shutdown immediate.


12) export opatch
export PATH=$PATH:$HOME:$ORACLE_HOME/OPatch:/bin

13) Start the patch

skip_duplicate
Skips patches to be applied that are duplicates of other patches installed in the Oracle home. Two patches are duplicates if they fix the same set of bugs.


skip_subset
Skips patches to be applied that are subsets of other patches installed in the Oracle home. One patch is a subset of another patch if the former fixes a subset of bugs fixed by the latter.



opatch napply -skip_subset -skip_duplicate

for RAC database then database can be up
as it may be having more then one instance
so you can bring down one instance and listener and apply the patch and open it
and then do the same on another node.
like this db will be up and no user will face issue in outage also.



to apply opatch in RAC instance (We can use -skip_subset -skip_duplicate with Opatch napply only? And is it right that if we are installing patch for the first time on a oracle software we will generally use opatch apply. consider I have applied october cpu patch and I am applying November patch. so while applting Oct patch I shall just use opatch apply and while applying Nov patch i shall use opatch napply -skip_subset -skip_duplicate)

///////////***************************************************


The readme stats whether you should use opatch apply or opatch napply.

I am not sure whether november CPU patch is a bundle of multiple patch or single patch. If it is bundle of multiple patch then you should be using opatch napply irrespective of whether you applied any patch on your software previously. apply and naply does not depends on previous patches that are appled on your database. It just depend onl current opatch session. If you need to install just one patch currently use opatch apply. And if you want to install multiple patch in a single session use opatch napply.


And regarding -skip_subset and -skip_duplicate you are correct it shoudl be used only with opatch napply.



For more info use below command
opatch apply -help
opatch napply -help

**********************************************************

opatch napply -skip_subset -skip_duplicate -local -oh $ORACLE_HOME

when using -local parameter and
-oh $ORACLE_HOME this means this patch session will only apply patch to current ORACLE_HOME only.

--------------------------------------------------------
. All-Node Patch

. Shutdown all Oracle instances on all nodes
. Apply the patch to all nodes
. Bring all nodes up

. Minimum downtime

. Shutdown the Oracle instance on node 1
. Apply the patch to the Oracle instance on node 1
. Shutdown the Oracle instance on node 2
. Apply the patch to the Oracle instance on node 2
. Shutdown the Oracle instance on node 3
. At this point, instances on nodes 1 and 2 can be brought up
. Apply the patch to the Oracle instance on node 3
. Startup the Oracle instance on node 3

. (no downtime)

. Shutdown the Oracle instance on node 1
. Apply the patch to the Oracle instance on node 1
. Start the Oracle instance on node 1
. Shutdown the Oracle instance on node 2
. Apply the patch to the Oracle instance on node 2
. Start the Oracle instance on node 2
. Shutdown the Oracle instance on node 3
. Apply the patch to the Oracle instance on node 3
. Start the Oracle instance on node 3
-------------------------------------------------------------


14) Once patch installation is completed need to do post patching steps.

a) starup the instance
startup
b) Loading modified sqlfiles into the database.
@$ORACLE_HOME/rdbms/admin/catbundle.sql cpu apply

to check the logs generated
catbundle_CPU__APPLY_.log
catbundle_CPU__GENERATE_.log


c) Recompiling Views in the Database
shutdown immediate
startup upgrade
@$ORACLE_HOME/cpu/view_recompile/view_recompile_jan2008cpu.sql
shutdown immediate
startup

If it is a RAC instance.
shutdown
startup nomount
alter database set cluster database=false scope=spfile;
shutdown
startup upgrade
@?/cpu/view_recompile/view_recompile_jan2008cpu.sql
shutdown
startup
alter database set cluster database=true scope=spfile;


restart the database.
cd $CRS_HOME/bin
srvctl start database -d
15) If any invalid objects were reported, run the utlrp.sql script as follows



user_inv.sql
============
SELECT owner,
COUNT(*)
FROM dba_objects
WHERE status = 'INVALID'
GROUP BY owner;

count_inv.sql
-------------
select count(*)
from dba_objects
WHERE status ='INVALID';

if any new invalids seen then again take backup of invalid objects and compile it.

create table bk_inv_ as select * from dba_objects
where status='INVALID';

@?/rdbms/admin/utlrp.sql --- to compile the invalid objects.


16) Confirm that patch has been applied successfully or not at db level also.


post_patch.sql
--------------
col action_time for a40
col action for a15
col namespace for a15
col version for a15
col comments for a40
set pages 1000
set lines 170
select * from registry$history ;



*************************************************************



Output given below as >>>>>


cd $HOME/DBA/patches
unzip p6239052_10203_SOLARIS64.zip
cd 6239052
export OBJECT_MODE=32_64
export PATH=$ORACLE_HOME/OPatch:$PATH
opatch apply

Invoking OPatch 10.2.0.3.0

Oracle interim Patch Installer version 10.2.0.3.0
Copyright (c) 2005, Oracle Corporation. All rights reserved..

Oracle Home : /u01/app/oracle/product/db10.2.0
Central Inventory : /u01/app/oracle/oraInventory
from : /var/opt/oracle/oraInst.loc
OPatch version : 10.2.0.3.0
OUI version : 10.2.0.3.0
OUI location : /u01/app/oracle/product/db10.2.0/oui
Log file location : /u01/app/oracle/product/db10.2.0/cfgtoollogs/opatch/opatch2007-11-20_02-17-48AM.log

ApplySession applying interim patch '6239052' to OH '/u01/app/oracle/product/db10.2.0'
Invoking fuser to check for active processes.
Invoking fuser on "/u01/app/oracle/product/db10.2.0/bin/oracle"

OPatch detected the node list and the local node from the inventory.
OPatch will patch the local system then propagate the patch to the remote nodes.

This node is part of an Oracle Real Application Cluster.
Remote nodes: 'node2' 'node3'
Local node: 'node1'
Please shutdown Oracle instances running out of this ORACLE_HOME on the local system.
(Oracle Home = '/u01/app/oracle/product/db10.2.0')

Is the local system ready for patching?

Do you want to proceed? [y|n]
y
User Responded with: Y
Backing up files and inventory (not for auto-rollback) for the Oracle Home
Backing up files affected by the patch '6239052' for restore. This might take a while...
Backing up files affected by the patch '6239052' for rollback. This might take a while...

Patching component oracle.rdbms, 10.2.0.3.0...
Updating archive file "/u01/app/oracle/product/db10.2.0/lib/libserver10.a"
with "lib/libserver10.a/kjbm.o"
ApplySession adding interim patch '6239052' to inventory

Verifying the update...
Inventory check OK: Patch ID 6239052 is registered in Oracle Home inventory with proper meta-data.
Files check OK: Files from Patch ID 6239052 are present in Oracle Home.
Running make for target ioracle

The local system has been patched. You can restart Oracle instances on it.

Patching in rolling mode.

Remaining nodes to be patched:
'node2' 'node3'
What is the next node to be patched?
node2
You have selected 'node2' from 'node2' 'node3'

The node 'node2' will be patched next.

Please shutdown Oracle instances running out of this ORACLE_HOME on 'node2'.
(Oracle Home = '/u01/app/oracle/product/db10.2.0')
Is the node ready for patching?

Do you want to proceed? [y|n]
y
User Responded with: Y
Updating nodes 'node2'
Apply-related files are:
FP = "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_files.txt"
DP = "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_dirs.txt"
MP = "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/make_cmds.txt"

Instantiating the
file "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_files.txt.instantiated"
by replacing $ORACLE_HOME in "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_files.txt"
with actual path.
Propagating files to remote nodes...
Instantiating the file
"/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_dirs.txt.instantiated" by replacing
$ORACLE_HOME in "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_dirs.txt" with actual path.
Propagating directories to remote nodes...
Instantiating the file
"/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/make_cmds.txt.instantiated" by replacing
$ORACLE_HOME in "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/make_cmds.txt" with actual path.
Running command on remote node 'node2': cd /u01/app/oracle/product/db10.2.0/rdbms/lib;
/usr/ccs/bin/make -f ins_rdbms.mk
ioracle ORACLE_HOME=/u01/app/oracle/product/db10.2.0 || echo REMOTE_MAKE_FAILED::>&2

The node 'node2' has been patched. You can restart Oracle instances on it.

The node 'node3' will be patched next.

Please shutdown Oracle instances running out of this ORACLE_HOME on 'node3'.
(Oracle Home = '/u01/app/oracle/product/db10.2.0')
Is the node ready for patching?

Do you want to proceed? [y|n]
y
User Responded with: Y
Updating nodes 'node3'
Apply-related files are:
FP = "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_files.txt"
DP = "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_dirs.txt"
MP = "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/make_cmds.txt"

Instantiating the file
"/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_files.txt.instantiated" by replacing
$ORACLE_HOME in "/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_files.txt" with actual path.
Propagating files to remote nodes...
Instantiating the file
"/u01/app/oracle/product/db10.2.0/.patch_storage/.../rac/copy_dirs.txt.instantiated" by replacing
$ORACLE_HOME in "/u01/app/oracle/product/db10.2.0/.patch_storage/....txt" with actual path.
Propagating directories to remote nodes...
Instantiating the file
"/u01/app/oracle/product/db10.2.0/..../rac/make_cmds.txt.instantiated" by replacing
$ORACLE_HOME in "/u01/app/oracle/product/db10.2.0/.patch_storage/....txt" with actual path.
Running command on remote node 'node3': cd /u01/app/oracle/product/db10.2.0/rdbms/lib;
/usr/ccs/bin/make -f ins_rdbms.mk
ioracle ORACLE_HOME=/u01/app/oracle/product/db10.2.0 || echo REMOTE_MAKE_FAILED::>&2

The node 'node3' has been patched. You can restart Oracle instances on it.

There were relinks on remote nodes.
Remember to check the binary size and timestamp on the nodes 'node2' 'node3' .
The following make commands were invoked on remote nodes:
'cd /u01/app/oracle/product/db10.2.0/rdbms/lib; /usr/ccs/bin/make -f ins_rdbms.mk
ioracle $ORACLE_HOME=/u01/app/oracle/product/db10.2.0'

OPatch succeeded.




***********************************




Ensure the new patch appears in the inventory by running the following command:

opatch lsinventory

Invoking OPatch 10.2.0.3.0

Oracle interim Patch Installer version 10.2.0.3.0
Copyright (c) 2005, Oracle Corporation. All rights reserved..

Oracle Home : /u01/app/oracle/product/db10.2.0
Central Inventory : /u01/app/oracle/oraInventory
from : /var/opt/oracle/oraInst.loc
OPatch version : 10.2.0.3.0
OUI version : 10.2.0.3.0
OUI location : /u01/app/oracle/product/db10.2.0/oui
Log file location : /u01/app/oracle/product/db10.2.0/cfgtoollogs/opatch/....log

Lsinventory Output file location : /u01/app/oracle/product/db10.2.0/....txt

--------------------------------------------------------------------------------
Installed Top-level Products (2):

Oracle Database 10g 10.2.0.1.0
Oracle Database 10g Release 2 Patch Set 2 10.2.0.3.0
There are 2 products installed in this Oracle Home.

Interim patches (5) :

Patch 6239052 : applied on Tue Nov 20 02:18:58 GMT 2007
Created on 31 Oct 2007, 08:01:04 hrs PST8PDT
Bugs fixed:
6239052

Patch 4655998 : applied on Fri May 04 10:23:04 BST 2007
Created on 13 Apr 2007, 00:31:22 hrs PST8PDT
Bugs fixed:
4655998

Patch 5749953 : applied on Fri Apr 20 16:13:34 BST 2007
Created on 15 Feb 2007, 02:45:16 hrs US/Pacific
Bugs fixed:
5749953

Patch 5556081 : applied on Mon Apr 16 10:18:22 BST 2007
Created on 9 Nov 2006, 22:20:50 hrs PST8PDT
Bugs fixed:
5556081

Patch 5557962 : applied on Mon Apr 16 10:17:33 BST 2007
Created on 9 Nov 2006, 23:23:06 hrs PST8PDT
Bugs fixed:
4269423, 5557962, 5528974

Rac system comprising of multiple nodes
Local node = node1
Remote node = node2
Remote node = node3

--------------------------------------------------------------------------------

OPatch succeeded




<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>



TROUBLESHOOTING : >>>>

* If the OPatch is NOT applied to the primary node in the cluster then the following error output will be produced:
o ACTION: Restart the patch process on the primary node.

Invoking OPatch 10.2.0.3.0

Oracle interim Patch Installer version 10.2.0.3.0
Copyright (c) 2005, Oracle Corporation. All rights reserved..

Oracle Home : /u01/app/oracle/product/db10.2.0
Central Inventory : /u01/app/oracle/oraInventory
from : /var/opt/oracle/oraInst.loc
OPatch version : 10.2.0.3.0
OUI version : 10.2.0.3.0
OUI location : /u01/app/oracle/product/db10.2.0/oui
Log file location : /u01/app/oracle/product/db10.2.0/cfgtoollogs/opatch/opatch2007-11-20_02-16-36AM.log

ApplySession applying interim patch '6239052' to OH '/u01/app/oracle/product/db10.2.0'
OPatchSession cannot load inventory for the given Oracle Home /u01/app/oracle/product/db10.2.0. Possible #
causes are:
No read or write permission to ORACLE_HOME/.patch_storage
Central Inventory is locked by another OUI instance
No read permission to Central Inventory
The lock file exists in ORACLE_HOME/.patch_storage
The Oracle Home does not exist in Central Inventory

ApplySession failed: ApplySession failed to prepare the system. No detail available.
System intact, OPatch will not attempt to restore the system

OPatch failed with error code 73

* If the crs_stat -t output doesn't show all items in an ONLINE state after a suitable amount of time, e.g. 15 mins, the cluster may require restarting completely.
o ACTION: On each node, ONE NODE AT A TIME, use the following commands:

su -

sync;sync;reboot