Table of Contents
Being a distributed system, the Oracle NoSQL Database is composed of several software components and each expose unique metrics that can be monitored, interpreted, and utilized to understand the general health, performance, and operational capability of the NoSQL Database cluster.
This section focuses on best practices for monitoring the Oracle NoSQL software components. While there are several software dependencies for the Oracle NoSQL Database itself (for example, Java virtual machine, operating system, NTP), this section focuses solely on the NoSQL components.
There are three basic mechanisms for monitoring the health of the NoSQL Database:
System Log File Monitoring – Oracle NoSQL Database uses the java.util.logging package to write all trace, information, and error messages to the log files for each component of the store. These files can be parsed using the typical log file probing mechanism supported by the leading system management solutions.
System Monitoring Agents – Oracle NoSQL Database publishes MIBs for integration with SNMP based monitoring solutions as well as JMX Management Beans for integration with JMX based monitoring solutions.
Application Monitoring – A good proxy for the “health” of the NoSQL Database rests with application level metrics. Metrics like average and 90th percentile response times, average and 90th percentile throughput, as well average number of timeout exceptions encountered from NoSQL API calls are all potential indicators that something may be wrong with a component in the NoSQL cluster. In fact, sampling these metrics and looking for deviations from mean values can be the best way to know that something may be wrong with your environment.
The following sections discuss details of each of these monitoring techniques and illustrate how each of them can be utilized to detect failures in NoSQL Database components.
The Oracle NoSQL Database is composed of the following components, and each component produces log files that can be monitored:
Replication Nodes – Service read and write requests from API calls. Replication nodes for a particular shard are laid out on different storage nodes (physical servers) by the topology manager, so the log files for the nodes in each shard are spread across multiple machines.
Storage Node Agents – Manage the replication nodes that are running on each storage node. The SNA maintains its own log regarding the state of each replication node it is managing. You can think of the SNA log as a high level log of the replication node activity on a particular storage node.
Admin Nodes – Administrative nodes handle the execution of commands from the administrative command line interface. Long running plans are also staged from the administrative nodes. Administrative nodes also maintain a consolidated log of all the other logs in the Oracle NoSQL cluster.
All of the above mentioned log files can be found in the
following directory structure KVROOT/kvstore/log
on the machine where the component is running. The following steps
can be used to find the machines that are running the components of
the cluster:
java -jar kvstore.jar ping -host <any machine in the cluster> -port <the port number used to initialize the KVStore>
Each storage node (snXX) is listed in the output of the ping command, along with a list of replication nodes (rgXX-rnXX) running on the host listed in the ping output. XX denotes the unique number assigned to that component by NoSQL Database. For replication nodes, rg denotes the shard number and stands for replication group, while rn denotes the replication node number within that shard.
Admin Nodes – Identifying the nodes in the cluster that are running administrative services is a bit more challenging. To identify these nodes, a script would run ps axww on every host in the cluster and grep for kvstore.jar and -class Admin.
The Oracle NoSQL Database maintains a single consolidated log of every node in the cluster, and this can be found on any of the nodes running an administrative service. While this is a convenient and easy single place to monitor for errors, it is not 100% guaranteed. The single consolidated view is aggregated by getting log messages over the network, and transient network failures, packet loss, and high network utilization can cause this consolidated log to either be out of date, or have missing entries. Therefore, we recommend monitoring each host in the cluster as well as monitoring each type of log file on each host in the cluster.
Generally speaking, any log message with a level of SEVERE should be considered a potentially critical event and worthy of generating a systems management notification. The sections in the later part of this document illustrate how to correlate specific SEVERE exceptions with hardware component failure.