Friday, February 7, 2014

Grid Infrastructure 12.1.0.1.0 Cluster Health Monitor - A Deconstruction

While digging around Grid Infrastructure logs, I came across this new feature with 12c called Cluster Health Monitor (CHM) - I knew the MGMTDB database was good for something when I opted to install it even though it is not required.

From Oracle’s Documentation

“The Cluster Health Monitor (CHM) detects and analyzes operating system and cluster resource-related degradation and failures. CHM stores real-time operating system metrics in the Oracle Grid Infrastructure Management Repository that you can use for later triage with the help of My Oracle Support should you have cluster issues."

Consisting of three components (see below), the CHM collects and stores data for later review on the cluster’s over-all health.

System Services Monitor (osysmond) 

Where every node in the cluster contains this process, it is responsible for up-to-date monitoring and metric collection service at the Operating System level.

Cluster Logger Service (ologgerd)

This process is what actually retrieves data from the osysmond and writes it to the repository

Grid Infrastructure Management Respository

An Oracle instance which stores the collected data from osysmond. This will only run on a single (hub) node in a cluster, and by design will fail-over to another node should the one its on be unavailable. Interestingly enough, the data files for the instance are located on the same disk group as the OCR and Voting files. Oracle Docs do not talk about any specific sizing, but the onclumon utility is responsible for retention of the stored data.

Let’s take a look at all processes associated with a Grid Infrastructure setup.

NewImage

[root@flex1 ~]# ps -ef | grep root | grep grid
root 2210 1 1 21:07 ? 00:00:57 /u01/app/12.1.0.1/grid/bin/ohasd.bin reboot
root 2516 1 0 21:07 ? 00:00:06 /u01/app/12.1.0.1/grid/bin/orarootagent.bin
root 2729 1 0 21:07 ? 00:00:02 /u01/app/12.1.0.1/grid/bin/cssdmonitor
root 2743 1 0 21:07 ? 00:00:02 /u01/app/12.1.0.1/grid/bin/cssdagent
root 4809 1 0 21:08 ? 00:00:20 /u01/app/12.1.0.1/grid/bin/octssd.bin reboot
root 5699 1 1 21:08 ? 00:01:04 /u01/app/12.1.0.1/grid/bin/osysmond.bin
root 5705 1 0 21:08 ? 00:00:39 /u01/app/12.1.0.1/grid/bin/crsd.bin reboot
root 5969 1 0 21:08 ? 00:00:22 /u01/app/12.1.0.1/grid/bin/orarootagent.bin
root 19405 1 0 22:10 ? 00:00:01 /u01/app/12.1.0.1/grid/bin/gnsd.bin -trace-level 1 -ip-address 192.168.78.244 -startup-endpoint ipc://GNS_flex1.muscle_5969_a598858b344350d1
root 20713 1 0 22:13 ? 00:00:01 /u01/app/12.1.0.1/grid/bin/ologgerd -M -d /u01/app/12.1.0.1/grid/crf/db/flex1

Diagnostics Collection

The most convenient method to query the data in the CHM Repository, is by executing the oclumon utility. 

To collect diagnostic information, preferably all nodes in a cluster, you can run the diagcollection.pl script located in the $GRID_HOME/bin. There are options with this script to collected either all, or specific CRS daemon process logs.

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/diagcollection.pl --help
Production Copyright 2004, 2010, Oracle.  All rights reserved

Cluster Ready Services (CRS) diagnostic collection tool

diagcollection
    --collect  
             [--crs] For collecting crs diagnostic information 
             [--adr] For collecting diagnostic information for ADR; specify ADR location
             [--chmos] For collecting Cluster Health Monitor (OS) data
             [--acfs] Unix only. For collecting ACFS diagnostic information 
             [--all] Default.For collecting all diagnostic information. 
             [--core] UNIX only. Package core files with CRS data 
             [--afterdate] UNIX only. Collects archives from the specified date. Specify in mm/dd/yyyy format
             [--aftertime] Supported with -adr option. Collects archives after the specified time. Specify in YYYYMMDDHHMISS24 format
             [--beforetime] Supported with -adr option. Collects archives before the specified date. Specify in YYYYMMDDHHMISS24 format
             [--crshome] Argument that specifies the CRS Home location 
             [--incidenttime] Collects Cluster Health Monitor (OS) data from the specified time.  Specify in MM/DD/YYYYHH24:MM:SS format
                  If not specified, Cluster Health Monitor (OS) data generated in the past 24 hours are collected
             [--incidentduration] Collects Cluster Health Monitor (OS) data for the duration after the specified time.  Specify in HH:MM format.
                 If not specified, all Cluster Health Monitor (OS) data after incidenttime are collected 
             NOTE: 
             1. You can also do the following 
                diagcollection.pl --collect --crs --crshome 
     --clean        cleans up the diagnosability
                    information gathered by this script
     --coreanalyze  UNIX only. Extracts information from core files
                    and stores it in a text file

1. First off, we need to find out which node the OLOGGERD service is currently running.

[root@flex1 bin]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get master

Master = flex1

2. Good, it happens to run on the same node I am currently on. Next, we can invoke the diagcollection.pl script to collect the data in the repository.

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/diagcollection.pl --collect
Production Copyright 2004, 2010, Oracle. All rights reserved
Cluster Ready Services (CRS) diagnostic collection tool
The following CRS diagnostic archives will be created in the local directory.
crsData_flex1_20140206_2335.tar.gz -> logs,traces and cores from CRS home. Note: core files will be packaged only with the --core option. 
ocrData_flex1_20140206_2335.tar.gz -> ocrdump, ocrcheck etc 
coreData_flex1_20140206_2335.tar.gz -> contents of CRS core files in text format

osData_flex1_20140206_2335.tar.gz -> logs from Operating System
Collecting crs data
/bin/tar: log/flex1/cssd/ocssd.log: file changed as we read it
Collecting OCR data 
Collecting information from core files
No corefiles found 
The following diagnostic archives will be created in the local directory.
acfsData_flex1_20140206_2335.tar.gz -> logs from acfs log.
Collecting acfs data
Collecting OS logs
Collecting sysconfig data

3. It generates a few tar balls, and a text file.

[root@flex1 tmp]# ls -lhtr
total 25M
-rw-r--r-- 1 root   root      25M Feb  6 23:36 crsData_flex1_20140206_2335.tar.gz
-rw-r--r-- 1 root   root      57K Feb  6 23:37 ocrData_flex1_20140206_2335.tar.gz
-rw-r--r-- 1 root   root      927 Feb  6 23:37 acfsData_flex1_20140206_2335.tar.gz
-rw-r--r-- 1 root   root     329K Feb  6 23:37 osData_flex1_20140206_2335.tar.gz
-rw-r--r-- 1 root   root      31K Feb  6 23:37 sysconfig_flex1_20140206_2335.txt

4. You could limit the data that collected by using date fields

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/diagcollection.pl --collect --afterdate 02/04/2014

5. I was curious, so I untar’d the crsData_flex1_20140206_2335.tar.gz file, and found that the logs from the following locations in the $GRID_HOME directory.

install/
log/flex1/
log/flex1/crfmond/
log/flex1/mdnsd/
log/flex1/gpnpd/
log/flex1/gipcd/
log/flex1/cvu/cvutrc/
log/flex1/cvu/cvulog/
log/flex1/racg/
log/flex1/crflogd/
log/flex1/cssd/
log/flex1/ohasd/
log/flex1/acfs/kernel/
log/flex1/ctssd/
log/flex1/gnsd/
log/flex1/crsd/
log/flex1/client/
log/flex1/agent/ohasd/oracssdmonitor_root/
log/flex1/agent/crsd/oraagent_oracle/
log/flex1/evmd/
cfgtoollogs/
cfgtoollogs/cfgfw/
cfgtoollogs/crsconfig/
cfgtoollogs/oui/
cfgtoollogs/mgmtca/
oc4j/j2ee/home/log/
oc4j/j2ee/home/log/wsmgmt/auditing/
oc4j/j2ee/home/log/wsmgmt/logging/
oc4j/j2ee/home/log/oc4j/
oc4j/j2ee/home/log/dbwlm/auditing/
oc4j/j2ee/home/log/dbwlm/logging/

OCLUMON

Now that we have dispensed with the logs, let’s see what this fancy OCLUMON can do.

1. First off, we need to set the logging level for the daemon we’d like to monitor.

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon debug log osysmond CRFMOND:3

2. Next, start the process with the dumpnodeview

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon dumpnodeview -n flex1

----------------------------------------

Node: flex1 Clock: '14-02-07 00.02.08' SerialNo:2081 

----------------------------------------

SYSTEM:

#pcpus: 1 #vcpus: 1 cpuht: N chipname: Intel(R) cpu: 26.27 cpuq: 2 physmemfree: 141996 physmemtotal: 4055420 mcache: 2017648 swapfree: 3751724 swaptotal: 4063228 hugepagetotal: 0 hugepagefree: 0 hugepagesize: 2048 ior: 105 iow: 248 ios: 55 swpin: 0 swpout: 0 pgin: 105 pgout: 182 netr: 47.876 netw: 24.977 procs: 302 rtprocs: 12 #fds: 24800 #sysfdlimit: 6815744 #disks: 9 #nics: 3 nicErrors: 0

TOP CONSUMERS:

topcpu: 'apx_vktm_+apx1(7240) 3.40' topprivmem: 'java(19943) 138464' topshm: 'ora_mman_sport(14493) 223920' topfd: 'ocssd.bin(2778) 341' topthread: 'console-kit-dae(1973) 64' 



----------------------------------------

Node: flex1 Clock: '14-02-07 00.02.13' SerialNo:2082 

----------------------------------------

SYSTEM:

#pcpus: 1 #vcpus: 1 cpuht: N chipname: Intel(R) cpu: 46.13 cpuq: 17 physmemfree: 110400 physmemtotal: 4055420 mcache: 2027560 swapfree: 3751708 swaptotal: 4063228 hugepagetotal: 0 hugepagefree: 0 hugepagesize: 2048 ior: 7714 iow: 210 ios: 399 swpin: 0 swpout: 6 pgin: 7212 pgout: 190 netr: 19.810 netw: 17.537 procs: 303 rtprocs: 12 #fds: 24960 #sysfdlimit: 6815744 #disks: 9 #nics: 3 nicErrors: 0


TOP CONSUMERS:

topcpu: 'apx_vktm_+apx1(7240) 3.00' topprivmem: 'java(19943) 138464' topshm: 'ora_mman_sport(14493) 223920' topfd: 'ocssd.bin(2778) 341' topthread: 'console-kit-dae(1973) 64' 

This will regularly dump an output similar to a “top” command in linux. As with the diagcollection.pl script, there are date duration parameters for oclumon as well.

3. The data (in the MGMTDB instance) is stored in the CHM schema.

SQL> select table_name from dba_tables where owner = 'CHM';

TABLE_NAME
--------------------------------------------------------------------------------
CHMOS_SYSTEM_SAMPLE_INT_TBL
CHMOS_SYSTEM_CONFIG_INT_TBL
CHMOS_SYSTEM_PERIODIC_INT_TBL
CHMOS_SYSTEM_MGMTDB_CONFIG_TBL
CHMOS_CPU_INT_TBL
CHMOS_PROCESS_INT_TBL
CHMOS_DEVICE_INT_TBL
CHMOS_NIC_INT_TBL
CHMOS_FILESYSTEM_INT_TBL
CHMOS_ASM_CONFIG


10 rows selected.

4. As mentioned earlier, you can also manage the repository retention period from oclumon.

4.1 To find out the current settings, we can issue the -get parameter.

4.1.1 Find the repository size, in bytes.

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get repsize

CHM Repository Size = 136320

4.1.2 Find the repository data file location

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get reppath

CHM Repository Path = +DATA/_MGMTDB/DATAFILE/sysmgmtdata.260.835192031

4.1.3 Find the master and logger nodes

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get master

Master = flex1
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get alllogger

Loggers = flex1,
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get mylogger

Logger = flex1

4.2 To set parameters, follow some of the examples below.

4.2.1 The changeretentiontime is merely an indicator for how much longer the underlying tablespace can accommodate the collected data. The value is (I believe) in seconds.

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -repos changeretentiontime 1000

The Cluster Health Monitor repository can support the desired retention for 2 hosts

4.2.2 Change the repository’s tablespace size (in MB). This also changes the retention period.

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -repos changerepossize 6000
The Cluster Health Monitor repository was successfully resized.The new retention is 399240 seconds.

The alert.log for -MGMTDB shows a simple ALTER TABLESPACE command.

Fri Feb 07 00:28:11 2014
ALTER TABLESPACE SYSMGMTDATA RESIZE 6000 M
Completed: ALTER TABLESPACE SYSMGMTDATA RESIZE 6000 M

The size of the instance has obviously increased as well.

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get repsize

CHM Repository Size = 399240

5. And last, but not least, the version check!

[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon version

Cluster Health Monitor (OS), Version 12.1.0.1.0 - Production Copyright 2007, 2013 Oracle. All rights reserved.

Well, I hope this has been an insightful post on the new CHM feature in the 12c release of Grid Infrastructure. If anything, the diagcollection.pl will be a nice replacement to the RDA that Support might request. I haven’t had to troubleshoot any clusteware issues on 12c, but I plan to break this environment and use the oclumon utility to debug the processes at a later date.

Cheers!

No comments :

Post a Comment