dropdown menu

Oracle ASM, RAC, Data Guard

ASM (Automatic Storage Management)

ASM is Oracle's recommended storage management solution. Oracle ASM uses disk groups to store data files. A disk group consists of multiple disks and for each ASM disk group, a level of redundancy is defined (normal (mirrored), high (3 mirrors), or external (no ASM mirroring)). When a file is created within ASM, it is automatically striped across all disks allocated to the disk groups. The performance is comparable to the performance of raw devices. ASM allows disk management to be done using SQL statements (such as CREATE, ALTER, and DROP), Enterprise Manager or with command line.

ASM is a single DB instance (as a normal DB instance would be), with its own processes.
# ps -ef | grep asm <--shows what asm uses (it has pmon, smon...)

ASM requires a special type of Oracle instance to provide the interface between a traditional Oracle instance and the storage elements presented to AIX. The Oracle ASM instance mounts disk groups to make ASM files available to database instances. An Oracle ASM instance is built on the same technology as an Oracle Database instance. The ASM software component is shipped with the Grid Infrastructure software.
Most commonly used storage objects that are mapped to ASM disks are AIX raw hdisks and AIX raw logical volumes. The disks or logical volumes are presented by special files in the /dev directory:
- Raw hdisks as /dev/rhdisknn or
- Raw logical volume as /dev/ASMDataLVnn

To properly present those devices to the Oracle ASM instance, they must be owned by the Oracle user (chown oracle.dba /dev/rhdisknn) and the associated file permission must be 660 (chmod 660 /dev/rhdisknn). Raw hdisks cannot belong to any AIX volume group and should not have a PVID defined. One or more raw logical volumes presented to the ASM instance could be created on the hdisks belonging to the AIX volume group.

For systems that do not use external redundancy, ASM provides its own internal redundancy mechanism and additional high availability by way of failure groups. A failure group, which is a subset of a diskgroup, by definition is a collection of disks that can become unavailable due to a failure of one of its associated components; e.g., controllers or entire arrays.  Thus, disks in two separate failure groups (for a given diskgroup) must not share a common failure component.

In a diskgroup usually 2 Failgroups are defined, for mirroing purposes inside the ASM. At OS side it looks like:
oradata-PCBDE-rz2-50GB-1 <--Failgroup1
oradata-PCBDE-rz3-50GB-1 <--Failgroup2

In this case storage extension is possible only by 2 disks at a time (from 2 separate storage box, in optimal case) and in a disk group all the disks should have the same size. When you have 2 disks in a Failgroup, and you create a 50GB tablespace, ASM will stripe it across the disks (25-25GB on each disk). When you add 2 more disks, then ASM starts to rebalancing tha data, so you will have 4x12.5Gb on each disk.

If hdisks are not part of the AIX volume group, its PVIDs can be cleared using the chdev command:
# chdev –l hdiskn –a pv=yes
# chdev –l hdiskn –a pv=clear

PVIDs are physically stored in the first 4k block of the hdisk, which happens to be where Oracle stores the ASM, OCR and/or Voting disk header. For ASM managed disks hdisk numbering is not important. Some Oracle installation documentation recommends temporarily setting PVIDs during the install process (this is not the preferred method). Assigning or clearing a PVID on an existing ASM managed disk will overwrite the ASM header, making data unrecoverable without the use of KFED (See Metalink Note #353761.1)

AIX 5.3 TL07 (and later) has a specific set of Oracle ASM related enhancements. Execution process of the "mkvg" or "extendvg" commands will now check for presence of ASM header before writing PVID information on hdisk. Command will fail and return an error message if ASM header signature is detected:
0516-1339 /usr/sbin/mkvg: Physical volume contains some 3rd party volume group.
0516-1397 /usr/sbin/mkvg: The physical volume hdisk3, will not be added to the volume group.
0516-862 /usr/sbin/mkvg: Unable to create volume group.

The force option (-f) will not work for an hdisk with an ASM header signature. If an hdisk formerly used by ASM need to be used for another purpose, the ASM header area can be cleared using the AIX "dd" command:
# dd if=/dev/zero/ of=/dev/rhdisk3 bs=4096 count=10

Using the chdev utility with pv=yes or pv=clear operations do not check for ASM signature before setting or clearing PVID area.
AIX 6.1 TL06 and AIX 7.1 introduced a rendev command that can be used for permanent renaming of the AIX hdisks.

ASM devices have a header which contains an asm id. To extract, do:
# dd if=/dev/$disk bs=1 skip=72 count=32 2>/dev/null
These ids can be used to map the old and the new devices and therefore create new asm device files which point to the correct, new disks.

An ASM disks have no pvid and so looks like it's unassigned. An AIX admin can therefore mistakenly think the disk is free and add it to a volume group, thus destroying data. Use rendev to rename ASM hdisks to something more obviously ASM, e.g. hdiskasm5, and if necessary update the Oracle ASM device scan path. Also, lkdev can be used as an extra level of protection. The "lkdev" command is used to lock the disk to prevent the device from inadvertently being altered by a system administrator at a later time. It locks the device so that any attempt to modify the device attributes (chdev, chpath) or remove the device or one of its paths (rmdev, rmpath) will be denied. The ASM header name can also be added as a comment when using lkdev, to make it even more obvious.

# rendev -l hdisk4 -n hdiskASMd01
# lkdev -l hdisk4 -n OracleASM

mknod (old)
If rendev is not available, device files are created in /dev using "mknod /dev/asm_disk_name c maj min" to have the same major and minor number as the disk device to be used. The Oracle DBA will use these device names created with mknod.


Oracle Clusterware 

Starting with Oracle Database 11g Release 2, Oracle has packaged Oracle Clusterware, Automatic Storage Management and the listener as a single package called "Oracle Grid Infrastructure".

Oracle Clusterware provides basic clustering services at the operating system level, it is the technology that transforms a server farm into a cluster. Theoretically Oracle Clusterware can be used to provide clustering services to other applications (not Oracle).

With Oracle Clusterware you can provide a cold failover cluster to protect an Oracle instance from a system or server failure. The basic function of a cold failover cluster is to monitor a database instance running on a server, and if a failure is detected, to restart the instance on a spare server in the cluster. Network addresses are failed over to the backup node. Clients on the network experience a period of lockout while the failover takes place and are then served by the other database instance once the instance has started.

It consist of these components:
crsd (Cluster Ready Services):
It manages resources (start/stop of services, failovers...), it requires public and private interfaces and the Virtual IP (VIP) and it runs as root. Failure of the CRS daemon can cause node failure and it automatically reboots nodes to avoid data corruption because of the possible communication failure between the nodes.

ocssd (Oracle Cluster Synchronization Services): 
It provides synchronization between the nodes, and manages locking and runs as oracle user. Failure of ocssd causes the machine to reboot to avoid split-brain situation. This is also required in a single instance configuration if ASM is used.

evmd (Event Management Logger):
The Event Management daemon spawns a permanent child process called "evmlogger" and generates the events when things happen. It will restart automatically on failures, and if evmd process fails, it does not halt the instance. Evmd runs as "oracle" user.

Oprocd provides I/O Fencing solution for the Oracle Clusterware. (Fencing is isolating a node when it is malfunctioning.) It is the process monitor for the oracle clusterware. It runs as "root" and failure of the Oprocd process causes the node to restart. (log file is in /etc/oracle/oprocd)

Important components at storage side:
-OCR (Oracle Cluster Repository/Registry)
Any resource that is going to be managed by the Orcle Clusterware needs to be registered as a CRS resource, and then CRS stores the the resource definitions in the OCR.
It is a repository of the cluster, which is a file (disk) in ASM (ocr-rz4-256MB-1).
crsstat <--this will show what OCR consists of
ocrcheck <--shows ocr disks

It is a file (disk) in ASM, that manages node memberships. It is needed to have the necessary quorum (ora_vot1_raw_256m). 3 disks are needed, in optimal case every disk is from different storage box. If you don't have 3 storage boxes, then create on 2 boxes, and do an nfs mount to RAC nodes for the 3rd voting disk
crsctl query css votedisk       <-- shows vote disks

vote disk movement:
create a new voting disk device then: dd if=/dev/<old device> of=/dev/<new device> bs=4096

Oracle Clusterware provides seamless integration with, Oracle Real Application Clusters (Oracle RAC) and Oracle Data Guard. (RAC environment is using shared storage, however in a Data Guard setup each node has its own separate storage.)

Checking CRS network topology:
# /ora_u02/app/oracle/product/crs/bin/oifcfg getif -global
en7  global  public
en11  global  cluster_interconnect


RAC (Real Application Cluster)

RAC is based on Oracle Clusterware and in a RAC environment, two or more computers (each with an instance) concurrently access a single database. This allows an application or user to connect to either computer and have access to the data. It combines the processing power of multiple interconnected computers to provide system redundancy and scalability. Unlike the cold cluster model where one node is completely idle, all instances and nodes can be active to scale your application.

ASM with RAC:

With the release of 12cR1 & 12cR2 Oracle no longer supports the use of raw logical volumes with the DB and RAC (see My Oracle Support note “Announcement of De-Support of using RAW devices in Oracle Database Version 12.1” (Doc ID 578455.1)). Oracle continues to support the coexistence of PowerHA with Oracle clusterware.

If using a file system for your Oracle Database 12c RAC data files (rather than ASM), you’ll need to use a cluster file system. Oracle ACFS allows file system access by all members in a cluster at the same time. That requirement precludes JFS and JFS2 from being used for Oracle Database 12c RAC data files. The IBM Spectrum Scale is an Oracle RAC 12c certified cluster file system.

Finding out the nodes of RAC (olsnodes):
(As oracle user "crstat -t" should work as well)
# /u02/app/oracle/product/10.2/crs/bin/olsnodes

In Oracle RAC versions prior to 11.2, when a node gets rebooted due do scheduling problems, the process, which would initiate the reboot, is oprocd. When the oprocd process reboots the node there should be only one entry in errpt (SYSTEM SHUTDOWN BY USER). There should not be a 'SYSDUMP' entry since ‘oprocd’ does not initiate a sysdump. A ‘SYSDUMP’ entry is an indication that other problems may be the root cause of node reboots.

In Oracle RAC 11g Release 2, severe operating system scheduling issues are detected by the Oracle cssdagent and cssmonitor processes and the node is rebooted. T

Files to check if oprocd or css... rebooted the node:
before 11: /etc/oracle/oprocd/<node>.oprocd.lgl.<time stamp> . 
11GR2: /etc/oracle/lastgasp/cssagent_<node>.lgl, /etc/oracle/lastgasp/cssmonit_<node>.lgl 

In the ocssd.log file on the other node (not on the node which was rebooted) could be some entries:
# tail -200 /pscon_u01/app/oracle/product/crs/log/aix12/cssd/ocssd.log
[    CSSD]2010-05-18 01:13:53.446 [4114] >WARNING: clssnmPollingThread: node aix11 (1) at 90 2.481040e-265artbeat fatal, eviction in 1.047 seconds
[    CSSD]2010-05-18 01:13:54.439 [4114] >WARNING: clssnmPollingThread: node aix11 (1) at 90 2.481040e-265artbeat fatal, eviction in 0.054 seconds
[    CSSD]2010-05-18 01:13:54.493 [4114] >TRACE:   clssnmPollingThread: Eviction started for node aix11 (1), flags 0x040f, state 3, wt4c 0
[    CSSD]2010-05-18 01:13:54.551 [2829] >TRACE:   clssnmDiscHelper: aix11, node(1) connection failed, con (1112cb1f0), probe(0)
[    CSSD]2010-05-18 01:13:54.551 [2829] >TRACE:   clssnmDeactivateNode: node 1 (aix11) left cluster

Oracle RAC clusterware has strict timeout requirements for VIP address failover in case of a public network failure. When DNS servers are unreachable due to a public network failure, DNS name resolution calls such as getaddrinfo may hang for the default AIX query timeout duration of 5 minutes. Name resolution calls made by Oracle processes can thus delay the VIP failover. To reduce such delays, the DNS query timeout can be reduced to 1 minute, by adding the following options line in /etc/resolv.conf for all RAC cluster nodes:
"options timeout:1"

No reboot is necessary to activate this change. If you need even faster VIP failover the timeout can be further reduced to a value of 0; provided your network infrastructure (network and DNS servers) has the speed to serve name queries within a few (5-6) seconds. If you use a value of 0 for timeout and your DNS or network is slow to respond, DNS name lookups will start to fail prematurely.


Oracle RAC IPs

-At least 2 NICs will be needed and /etc/hosts should contain private, public, and virtual IP addresses
-Configure them with Public and Private IPs (ifconfig will show these)
-DNS registration for: Public, VIP
(The virtual IP's do not have to be added to IFCONFIG.  This is because the VIPCA takes care of it.)

Public IP: (server IP address from OS side)
- DNS registrations + IP configuration for AIX (as usual)
- servers in cluster should be in same subnet

Virtual IP: (VIP is used by Oracle for RAC failover)
-same subnet as Public IP
-DNS registration needed (not needed to be configured during installation, RAC will take care of them)
-same interface name on each node (like en2)

Private IP: (for RAC hearbeat)
-separate interface from public IP,
-same interface name on each node (like en1)
-separate network from public IP (something like 192.168...)
-no DNS registration

SCAN IP: (Single Client Access Name, managed by Oracle, so users can use only 1 name to reach cluster)
(SCAN works by replacing a hostname or IP list with virtual IP addresses (VIP))
- DNS registration: single DNS domain name that resolves to all of the IP addresses in your RAC cluster (one for each node)
- not needed to be configured during install, RAC will do it
- in /etc/hosts, looks something like this: myscan.mydomain.com IN A IN A IN A

en0:     aix-sd31 <--Public (DNS)
en0:      aix-sd31-vip        <--Virtual IP (DNS)
en0: RACD001.domain.com <--SCAN IP 1 (DNS)
en0: RACD001.domain.com <--SCAN IP 2 (DNS)
en0: RACD001.domain.com <--SCAN IP 3 (DNS)
en1: aix-sd31-priv      <--Private IP



Data Guard 

A Data Guard configuration consists of the primary database that contains the original data and any copy of that data in separate databases (on different servers) that are kept in synch with the primary. In 11gR2 it can consist of up to 30 databases, in any combination of RAC, non-RAC, physical, logical, or snapshot.

In this setup it can be used for failover for the primary database or the copies of the production data can be used in read-only mode for reporting purposes etc.

Transitions from one database role to another are called switchovers (planned events) or failovers (unplanned events), where Data Guard can actually execute all of the tasks of the transition with just a few commands.

Data Guard broker is itself a background Oracle monitor process (DMON) that provides a complex set of role management services governing all of the databases in a configuration.  This broker controls the redo transport and is accountable for transmitting defect-free archive logs from any possible archive location. The Log Apply Services within Data Guard are responsible for maintaining the synchronization of transactions between the primary and standbys.

Data Gurad does not use shared storage, it is most applicable for DR scenarios.

Finding out Data Guard primary (prod) or standby (shadow) node:

# ps -ef | grep mrp
  orap02 5496874       1   2   Jan 28      - 291:26 ora_mrp0_P02 <--if you see this process then it is the standby (mrp: media recovery process)


No comments:

Post a Comment