Friday, February 7, 2014
Grid Infrastructure 12.1.0.1.0 Cluster Health Monitor - A Deconstruction
While digging around Grid Infrastructure logs, I came across this new feature with 12c called Cluster Health Monitor (CHM) - I knew the MGMTDB database was good for something when I opted to install it even though it is not required.
From Oracle’s Documentation
“The Cluster Health Monitor (CHM) detects and analyzes operating system and cluster resource-related degradation and failures. CHM stores real-time operating system metrics in the Oracle Grid Infrastructure Management Repository that you can use for later triage with the help of My Oracle Support should you have cluster issues."
Consisting of three components (see below), the CHM collects and stores data for later review on the cluster’s over-all health.
System Services Monitor (osysmond)
Where every node in the cluster contains this process, it is responsible for up-to-date monitoring and metric collection service at the Operating System level.
Cluster Logger Service (ologgerd)
This process is what actually retrieves data from the osysmond and writes it to the repository
Grid Infrastructure Management Respository
An Oracle instance which stores the collected data from osysmond. This will only run on a single (hub) node in a cluster, and by design will fail-over to another node should the one its on be unavailable. Interestingly enough, the data files for the instance are located on the same disk group as the OCR and Voting files. Oracle Docs do not talk about any specific sizing, but the onclumon utility is responsible for retention of the stored data.
Let’s take a look at all processes associated with a Grid Infrastructure setup.
[root@flex1 ~]# ps -ef | grep root | grep grid root 2210 1 1 21:07 ? 00:00:57 /u01/app/12.1.0.1/grid/bin/ohasd.bin reboot root 2516 1 0 21:07 ? 00:00:06 /u01/app/12.1.0.1/grid/bin/orarootagent.bin root 2729 1 0 21:07 ? 00:00:02 /u01/app/12.1.0.1/grid/bin/cssdmonitor root 2743 1 0 21:07 ? 00:00:02 /u01/app/12.1.0.1/grid/bin/cssdagent root 4809 1 0 21:08 ? 00:00:20 /u01/app/12.1.0.1/grid/bin/octssd.bin reboot root 5699 1 1 21:08 ? 00:01:04 /u01/app/12.1.0.1/grid/bin/osysmond.bin root 5705 1 0 21:08 ? 00:00:39 /u01/app/12.1.0.1/grid/bin/crsd.bin reboot root 5969 1 0 21:08 ? 00:00:22 /u01/app/12.1.0.1/grid/bin/orarootagent.bin root 19405 1 0 22:10 ? 00:00:01 /u01/app/12.1.0.1/grid/bin/gnsd.bin -trace-level 1 -ip-address 192.168.78.244 -startup-endpoint ipc://GNS_flex1.muscle_5969_a598858b344350d1 root 20713 1 0 22:13 ? 00:00:01 /u01/app/12.1.0.1/grid/bin/ologgerd -M -d /u01/app/12.1.0.1/grid/crf/db/flex1
Diagnostics Collection
The most convenient method to query the data in the CHM Repository, is by executing the oclumon utility.
To collect diagnostic information, preferably all nodes in a cluster, you can run the diagcollection.pl script located in the $GRID_HOME/bin. There are options with this script to collected either all, or specific CRS daemon process logs.
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/diagcollection.pl --help Production Copyright 2004, 2010, Oracle. All rights reserved Cluster Ready Services (CRS) diagnostic collection tool diagcollection --collect [--crs] For collecting crs diagnostic information [--adr] For collecting diagnostic information for ADR; specify ADR location [--chmos] For collecting Cluster Health Monitor (OS) data [--acfs] Unix only. For collecting ACFS diagnostic information [--all] Default.For collecting all diagnostic information. [--core] UNIX only. Package core files with CRS data [--afterdate] UNIX only. Collects archives from the specified date. Specify in mm/dd/yyyy format [--aftertime] Supported with -adr option. Collects archives after the specified time. Specify in YYYYMMDDHHMISS24 format [--beforetime] Supported with -adr option. Collects archives before the specified date. Specify in YYYYMMDDHHMISS24 format [--crshome] Argument that specifies the CRS Home location [--incidenttime] Collects Cluster Health Monitor (OS) data from the specified time. Specify in MM/DD/YYYYHH24:MM:SS format If not specified, Cluster Health Monitor (OS) data generated in the past 24 hours are collected [--incidentduration] Collects Cluster Health Monitor (OS) data for the duration after the specified time. Specify in HH:MM format. If not specified, all Cluster Health Monitor (OS) data after incidenttime are collected NOTE: 1. You can also do the following diagcollection.pl --collect --crs --crshome --clean cleans up the diagnosability information gathered by this script --coreanalyze UNIX only. Extracts information from core files and stores it in a text file
1. First off, we need to find out which node the OLOGGERD service is currently running.
[root@flex1 bin]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get master Master = flex1
2. Good, it happens to run on the same node I am currently on. Next, we can invoke the diagcollection.pl script to collect the data in the repository.
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/diagcollection.pl --collect Production Copyright 2004, 2010, Oracle. All rights reserved Cluster Ready Services (CRS) diagnostic collection tool The following CRS diagnostic archives will be created in the local directory. crsData_flex1_20140206_2335.tar.gz -> logs,traces and cores from CRS home. Note: core files will be packaged only with the --core option. ocrData_flex1_20140206_2335.tar.gz -> ocrdump, ocrcheck etc coreData_flex1_20140206_2335.tar.gz -> contents of CRS core files in text format osData_flex1_20140206_2335.tar.gz -> logs from Operating System Collecting crs data /bin/tar: log/flex1/cssd/ocssd.log: file changed as we read it Collecting OCR data Collecting information from core files No corefiles found The following diagnostic archives will be created in the local directory. acfsData_flex1_20140206_2335.tar.gz -> logs from acfs log. Collecting acfs data Collecting OS logs Collecting sysconfig data
3. It generates a few tar balls, and a text file.
[root@flex1 tmp]# ls -lhtr total 25M -rw-r--r-- 1 root root 25M Feb 6 23:36 crsData_flex1_20140206_2335.tar.gz -rw-r--r-- 1 root root 57K Feb 6 23:37 ocrData_flex1_20140206_2335.tar.gz -rw-r--r-- 1 root root 927 Feb 6 23:37 acfsData_flex1_20140206_2335.tar.gz -rw-r--r-- 1 root root 329K Feb 6 23:37 osData_flex1_20140206_2335.tar.gz -rw-r--r-- 1 root root 31K Feb 6 23:37 sysconfig_flex1_20140206_2335.txt
4. You could limit the data that collected by using date fields
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/diagcollection.pl --collect --afterdate 02/04/2014
5. I was curious, so I untar’d the crsData_flex1_20140206_2335.tar.gz file, and found that the logs from the following locations in the $GRID_HOME directory.
install/ log/flex1/ log/flex1/crfmond/ log/flex1/mdnsd/ log/flex1/gpnpd/ log/flex1/gipcd/ log/flex1/cvu/cvutrc/ log/flex1/cvu/cvulog/ log/flex1/racg/ log/flex1/crflogd/ log/flex1/cssd/ log/flex1/ohasd/ log/flex1/acfs/kernel/ log/flex1/ctssd/ log/flex1/gnsd/ log/flex1/crsd/ log/flex1/client/ log/flex1/agent/ohasd/oracssdmonitor_root/ log/flex1/agent/crsd/oraagent_oracle/ log/flex1/evmd/ cfgtoollogs/ cfgtoollogs/cfgfw/ cfgtoollogs/crsconfig/ cfgtoollogs/oui/ cfgtoollogs/mgmtca/ oc4j/j2ee/home/log/ oc4j/j2ee/home/log/wsmgmt/auditing/ oc4j/j2ee/home/log/wsmgmt/logging/ oc4j/j2ee/home/log/oc4j/ oc4j/j2ee/home/log/dbwlm/auditing/ oc4j/j2ee/home/log/dbwlm/logging/
OCLUMON
Now that we have dispensed with the logs, let’s see what this fancy OCLUMON can do.
1. First off, we need to set the logging level for the daemon we’d like to monitor.
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon debug log osysmond CRFMOND:3
2. Next, start the process with the dumpnodeview
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon dumpnodeview -n flex1 ---------------------------------------- Node: flex1 Clock: '14-02-07 00.02.08' SerialNo:2081 ---------------------------------------- SYSTEM: #pcpus: 1 #vcpus: 1 cpuht: N chipname: Intel(R) cpu: 26.27 cpuq: 2 physmemfree: 141996 physmemtotal: 4055420 mcache: 2017648 swapfree: 3751724 swaptotal: 4063228 hugepagetotal: 0 hugepagefree: 0 hugepagesize: 2048 ior: 105 iow: 248 ios: 55 swpin: 0 swpout: 0 pgin: 105 pgout: 182 netr: 47.876 netw: 24.977 procs: 302 rtprocs: 12 #fds: 24800 #sysfdlimit: 6815744 #disks: 9 #nics: 3 nicErrors: 0 TOP CONSUMERS: topcpu: 'apx_vktm_+apx1(7240) 3.40' topprivmem: 'java(19943) 138464' topshm: 'ora_mman_sport(14493) 223920' topfd: 'ocssd.bin(2778) 341' topthread: 'console-kit-dae(1973) 64' ---------------------------------------- Node: flex1 Clock: '14-02-07 00.02.13' SerialNo:2082 ---------------------------------------- SYSTEM: #pcpus: 1 #vcpus: 1 cpuht: N chipname: Intel(R) cpu: 46.13 cpuq: 17 physmemfree: 110400 physmemtotal: 4055420 mcache: 2027560 swapfree: 3751708 swaptotal: 4063228 hugepagetotal: 0 hugepagefree: 0 hugepagesize: 2048 ior: 7714 iow: 210 ios: 399 swpin: 0 swpout: 6 pgin: 7212 pgout: 190 netr: 19.810 netw: 17.537 procs: 303 rtprocs: 12 #fds: 24960 #sysfdlimit: 6815744 #disks: 9 #nics: 3 nicErrors: 0 TOP CONSUMERS: topcpu: 'apx_vktm_+apx1(7240) 3.00' topprivmem: 'java(19943) 138464' topshm: 'ora_mman_sport(14493) 223920' topfd: 'ocssd.bin(2778) 341' topthread: 'console-kit-dae(1973) 64'
This will regularly dump an output similar to a “top” command in linux. As with the diagcollection.pl script, there are date duration parameters for oclumon as well.
3. The data (in the MGMTDB instance) is stored in the CHM schema.
SQL> select table_name from dba_tables where owner = 'CHM'; TABLE_NAME -------------------------------------------------------------------------------- CHMOS_SYSTEM_SAMPLE_INT_TBL CHMOS_SYSTEM_CONFIG_INT_TBL CHMOS_SYSTEM_PERIODIC_INT_TBL CHMOS_SYSTEM_MGMTDB_CONFIG_TBL CHMOS_CPU_INT_TBL CHMOS_PROCESS_INT_TBL CHMOS_DEVICE_INT_TBL CHMOS_NIC_INT_TBL CHMOS_FILESYSTEM_INT_TBL CHMOS_ASM_CONFIG 10 rows selected.
4. As mentioned earlier, you can also manage the repository retention period from oclumon.
4.1 To find out the current settings, we can issue the -get parameter.
4.1.1 Find the repository size, in bytes.
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get repsize
CHM Repository Size = 136320
4.1.2 Find the repository data file location
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get reppath CHM Repository Path = +DATA/_MGMTDB/DATAFILE/sysmgmtdata.260.835192031
4.1.3 Find the master and logger nodes
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get master Master = flex1 [root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get alllogger Loggers = flex1, [root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get mylogger Logger = flex1
4.2 To set parameters, follow some of the examples below.
4.2.1 The changeretentiontime is merely an indicator for how much longer the underlying tablespace can accommodate the collected data. The value is (I believe) in seconds.
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -repos changeretentiontime 1000
The Cluster Health Monitor repository can support the desired retention for 2 hosts
4.2.2 Change the repository’s tablespace size (in MB). This also changes the retention period.
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -repos changerepossize 6000 The Cluster Health Monitor repository was successfully resized.The new retention is 399240 seconds.
The alert.log for -MGMTDB shows a simple ALTER TABLESPACE command.
Fri Feb 07 00:28:11 2014 ALTER TABLESPACE SYSMGMTDATA RESIZE 6000 M Completed: ALTER TABLESPACE SYSMGMTDATA RESIZE 6000 M
The size of the instance has obviously increased as well.
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon manage -get repsize
CHM Repository Size = 399240
5. And last, but not least, the version check!
[root@flex1 tmp]# /u01/app/12.1.0.1/grid/bin/oclumon version Cluster Health Monitor (OS), Version 12.1.0.1.0 - Production Copyright 2007, 2013 Oracle. All rights reserved.
Well, I hope this has been an insightful post on the new CHM feature in the 12c release of Grid Infrastructure. If anything, the diagcollection.pl will be a nice replacement to the RDA that Support might request. I haven’t had to troubleshoot any clusteware issues on 12c, but I plan to break this environment and use the oclumon utility to debug the processes at a later date.
Cheers!
No comments :
Post a Comment