Monday, January 24, 2011

Oracle Notification Services : ==>>

Oracle Notification Services : ==>>

ONS allows users to send SMS messages, e-mails, voice notifications, and fax messages in an easy-to-access manner.

Oracle Clusterware uses ONS to send notifications about the state of the database instances to midtier
applications that use this information for load-balancing and for fast failure detection.


ONS is a daemon process that communicates with other ONS daemons on other nodes which inform
each other of the current state of the database components on the database server.


For example, if a listener, node, or service is down, a down event is triggered by the EVMD process, which is then
sent by the local ONS daemon to the ONS daemon process on other nodes, including all clients and
application servers participating in the network.Only nodes or client machines that have the ONS daemon running
and have registered with each other will receive such notification. Once the
ONS on the client machines receives this notification, the application (if
using an Oracle-provided API) will determine, based on the notification,
which nodes and instances have had a state change and will appropriately

handle a new connection request. ONS informs the application of state
changes, allowing the application to respond proactively instead of in the
traditional reactive method.

ONS configuration ==>>

ONS is installed and configured as part of the Oracle Clusterware installation.
Execution of the root.sh file on Unix and Linux-based systems, during
the Oracle Clusterware installation will create and start the ONS on all
nodes participating in the cluster. This can be verified using the crs_stat
utility provided by Oracle.



Configuration of ONS involves registering all nodes and servers that will communicate with the ONS daemon on the database server.
During Oracle Clusterware installation, all nodes participating in the cluster are automatically registered with the ONS.
Subsequently, during restart of the clusterware, ONS will register all nodes with the respective ONS processes
on other nodes in the cluster.

To add additional members or nodes that should receive notifications,the hostname or IP address of the node should be added to
the ons.config file. The configuration file is located in the $ORACLE_HOME/opmn/conf

directory and has the following format:

[oracle@oradb4 oracle]$ more $ORACLE_HOME/opmn/conf/ons.config
localport=6101
remoteport=6201
loglevel=3
useocr=on
nodes=oradb4.sumsky.net:6101,oradb2.sumsky.net:6201,
oradb1.sumsky.net:6201,oradb3.sumsky.net:6201,
onsclient1.sumsky.net:6200,onsclient2.sumsky.net:6200

The localport is the port that ONS binds to on the local host interface to talk to local clients.
The remoteport is the port that ONS binds to on all interfaces to talk to other ONS daemons.

The loglevel indicates the amount of logging that should be generated. Oracle supports logging levels
from 1 through 9. ONS logs are generated in the $ORACLE_HOME/opmn/logs directory on the respective instances.

The useocr parameter (valid values are on/off) indicates whether ONS should use the OCR to determine which instances and
nodes are participating in the cluster.

The nodes listed in the nodes line are all nodes in the network that will need to receive or send event notifications.


ONS logging ==>>

ONS events can be tracked via logs on both the server side and the client
side.

ONS logs are written to the $ORACLE_HOME/opmn/logs directory.

The default logging level is set to three. Depending on the level of tracking
desired, this can be changed by modifying the ons.config file located in
the $ORACLE_HOME/opmn/conf directory discussed earlier.

Logging at level eight provides event information received by the ONS on the client
machines.

The following extract from the ONS log file illustrates the various stages
of the SRV1 HA service as it transitions from a DOWN state to an UP state:

05/06/18 17:41:11 [7] Connection 25,192.168.2.30,6200 Message content
length is 94
05/06/18 17:41:11 [7] Connection 25,192.168.2.30,6200 Body using 94
of 94 reUse
05/06/18 17:41:11 [8] Connection 25,192.168.2.30,6200 body:
VERSION=1.0 service=SRV1 instance=SSKY1 database=SSKYDB host=oradb1
status=down reason=failure
05/06/18 17:41:11 [8] Worker Thread 120 checking receive queue
05/06/18 17:41:11 [8] Worker Thread 120 sending event 115 to servers
05/06/18 17:41:11 [8] Event 115 route:
3232236062,6200;3232236072,6200;3232236135,6200
05/06/18 17:41:20 [7] Connection 25,192.168.2.30,6200 Message content
length is 104
05/06/18 17:41:20 [7] Connection 25,192.168.2.30,6200 Body using 104
of 104 reUse
05/06/18 17:41:20 [8] Connection 25,192.168.2.30,6200 body:
VERSION=1.0 service=SRV1 instance=SSKY1 database=SSKYDB host=oradb1
status=not_restarting reason=UNKNOWN
05/06/18 17:41:20 [8] Worker Thread 120 checking receive queue
05/06/18 17:41:20 [8] Worker Thread 120 sending event 125 to servers
05/06/18 17:41:20 [8] Event 125 route:
3232236062,6200;3232236072,6200;3232236135,6200
05/06/18 18:22:30 [9] Worker Thread 2 sending body [135:128]:
connection 6,10.1.2.168,6200
VERSION=1.0 service=SRV1 instance=SSKY2 database=SSKYDB host=oradb2
status=up card=2 reason=user
05/06/18 18:22:30 [7] Worker Thread 128 checking client send queues
05/06/18 18:22:30 [8] Worker queuing event 135 (at head): connection
10,10.1.2.177,6200
05/06/18 18:22:30 [8] Worker Thread 124 checking receive queue
05/06/18 18:22:30 [7] Worker Thread 124 checking server send queues
05/06/18 18:22:30 [8] Worker Thread 124 processing send queue:
connection 10,10.1.2.177,6200
05/06/18 18:22:30 [9] Worker Thread 124 sending header [2:135]:
connection 10,10.1.2.177,6200


This extract from the ONS log file illustrates three notifications received
from the ONS server node oradb1 containing instance SSKY1 and application
service SRV1.

The three notifications received at different times indicate
various stages of the service SRV1.


The first message indicates a notification regarding the failure of SRV1 on instance SSKY1.

"VERSION=1.0 service=SRV1 instance=SSKY1 database=SSKYDB host=oradb1
status=down reason=failure"



The second message indicates a notification regarding a restart attempt of service SRV1
on the same node oradb1. This restart notification also indicates that the
instance and node are healthy, or else it would not attempt to restart on the
same node.

"VERSION=1.0 service=SRV1 instance=SSKY1 database=SSKYDB host=oradb1
status=not_restarting reason=UNKNOWN "


The third message is an UP event notification from the server to
the client indicating that the service has started on node oradb2 (instead of
its original node).

"VERSION=1.0 service=SRV1 instance=SSKY2 database=SSKYDB host=oradb2
status=up card=2 reason=user"

Once this message is received, the application can resume
connections using the service SRV1. This illustrates that the service SRV1 has
relocated from node oradb1 to oradb2.









===>>>>>>>>>>>>

FAN ( Fast application notification ):

When state changes occur on a cluster, node, or instance in a RAC environment,an event is triggered by
The Event Manager and propagated by the ONS to the client machines.
Such events that communicate state changes are termed FAN events and have a predefined structure

FAN is a new feature introduced in Oracle Database 10g RAC to proactively notify applications regarding the status of
the cluster and any configuration changes that take place.

FAN uses the Oracle Notification Services (ONS) for the actual notification of the event to its other ONS clients.


ONS provides and supports several callable interfaces that can be used by different applications to take advantage of the HA
solutions offered in Oracle Database 10g RAC.


Oracle supports two types of events:

1. Service events. Service events are application events and contain state changes that will only affect clients that use the service.
Normally,such events only indicate database, instance level, and application service failures.


2. System events. System events are more global and represent events such as node and communication failures. Such events affect all
services supported on the specific system (e.g., cluster membership changes, such as a node leaving or joining the cluster).




====>>>>

Cluster ready service daemon(CRSD) :

CRSD Maintains configuration profile as well as resource status in OCR.
CRSD Created a dedicated process called RACGIMON that monitor the health of database instance and ASM
Instance .




===>>>

Creating group and user ===>>

An operating system group needs to be created that will associated with Oracle Central Inventory(oraInventory).
oraInventory contains a registry of Oracle Home direcotories from all oracle product.



====>>>>


RAC Healthy Instances May Die With Error ORA-29702 When Other RAC Instances Are Hung [ID 789196.1]


RAC Instances (ASM and RDBMS) may crash on hang with a reconfiguration timeout; for example, some instances get hung due to OS or other issues during the reconfiguration, which prevents the reconfiguration phase from completing.

The alert log of crashed RAC instance will show :

Thu Feb 12 19:47:12 2009
Reconfiguration started (old inc 5, new inc 6)
List of nodes:
0 2 3 4
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
* dead instance detected - domain 3 invalid = TRUE
Communication channels reestablished
Thu Feb 12 20:03:56 2009
NOTE: database cdfm1p3:cdfm1p failed during msg 19, reply 2
Thu Feb 12 20:04:52 2009
Error: KGXGN polling error (15)
Thu Feb 12 20:04:52 2009
Errors in file /opt/oracle/admin/+ASM/bdump/+asm3_lmon_4479.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Thu Feb 12 20:04:56 2009
Dump system state for local instance only
System State dumped to trace file /opt/oracle/admin/+ASM/bdump/+asm3_diag_4475.
trc
Thu Feb 12 20:04:57 2009
Trace dumping is performing id=[cdmp_20090212200457]
Thu Feb 12 20:04:58 2009
Instance terminated by LMON, pid = 4479


The LMON tracefile of the crashed instances will show :

kjxgmpoll: terminate the CGS reconfig.
Error: KGXGN polling error (15)
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation
ksuitm: waiting up to [5] seconds before killing DIAG




Note the ocssd.log may not contain any obvious information for instance crash.

nstances can get hung during reconfiguration. When this occurs, the reconfiguration process (LMON)
will get stuck as well and healthy instances will start to die upon the CGS reconfiguration timeout.






====>>>

Like normal database instances ASM instance too have the usual background processes like SMON, PMON, DBWr, CKPT and LGWr.
In addition to that the ASM instance also have the following background processes,


RABL- Rebalancer: It opens all the device files as part of disk discovery and coordinates the ARB processes for rebalance activity.

ARBx - Actual Rebalancer: They perform the actual rebalancing activities. The number of ARBx processes depends on the ASM_POWER_LIMIT init parameter.

ASMB - ASM Bridge: This process is used to provide information to and from the Cluster Synchronization Service (CSS) used by ASM to manage the disk resources. It is also used to update statistics and provide a heartbeat mechanism.

No comments:

Post a Comment