Node monitoring scenarios
Some common node monitoring scenarios are described below that you may observe when using node metrics data.
One important reason to monitor a node is to decide when to failover.
Risk of OutOfMemoryError
The HeapMemoryUsage
attribute of the java.lang:type=Memory
MBean contains a MemoryUsage object that represents a snapshot of heap memory usage. The value of the used
variable in this object indicates the amount of memory currently used, and the value of the max
variable indicates the maximum amount of memory that can be used for memory management.
If the proportion of these two values is repeatedly over 0.85, this could indicate a condition where there is a risk of reaching OutOfMemoryError condition.
High CPU usage
The SystemCpuLoad
property of the java.lang:type=OperatingSystem
is a double data type
value that indicates the “recent CPU usage” for the entire system. The maximum value of this property is 1, which corresponds to a 100% CPU usage.
If the SystemCpuLoad
value is repeatedly close to 1 (for example, over 0.9), this means that the overall system CPU usage is consistently high.
High flow error rate
The net.corda:name=StartedPerMinute,type=Flows
and net.corda:name=ErrorPerMinute,type=Flows
metrics data is collected using meters. A meter measures the rate of events over time - for example, “flows per second”. In addition to the average rate, meters also track 1-minute, 5-minute, and 15-minute moving averages. The value of the oneMinuteRate
property for each of these metrics indicates, respectively, the rates of flows started and flows failed with an error during the past minute.
It is an indication of a high flow error rate if the “flows failed with an error” rate for the past minute reaches a significant percentage of the “flows started” - for example, over 10%.
Network parameter update proposed and not accepted
If the value of the net.corda:name=UpdateProposed,type=NetworkParameter
boolean type of metric is true
, this can indicate that a network parameter update was proposed but it has not yet been accepted.
Processing messages takes too long
The net.corda:type=P2P,name=ReceiveDuration
metric is a histogram that measures the latency between the node receiving a P2P message and delivering it to the state machine. The properties of this metric can be combined to detect a delay in message processing.
For example, if you assume that a sufficient number of messages have been received during the past minute (at least three per second) in order to make a decision, you can flag up an error if 25% of the messages took significantly longer (at least 50%) than the average message process duration. The example below shows how scenario would look like using the properties of the metric:
oneMinuteRate >3.0, 75thPercentile() > mean * 1.5
Committing transactions takes too long
The net.corda:name=Actions.CommitTransaction,type=Flows
metric is a histogram that indicates the time taken to execute the CommitTransaction
action. You can combine the properties of this metric to detect if the execution of this action takes an unexpectedly long time.
For example, if you assume that a sufficient number of actions have been executed during the past minute (at least three per second) in order to make a decision, you can flag up an error if 25% of the actions took significantly longer (at least 50%) to execute than the average duration of the CommitTransaction
action. The example below shows how scenario would look like using the properties of the metric:
oneMinuteRate >3.0, 75thPercentile() > mean * 1.5
Signing transactions takes too long
The net.corda:name=SignDuration,type=Transaction
metric is a histogram that indicates the duration of signing a transaction.
You can combine the properties of this metric to detect if signing a transaction takes an unexpectedly long time.
For example, if you assume that a sufficient number of transactions have been signed during the past minute (at least three per second)in order to make a decision, you can flag up an error if 25% of the transactions took significantly longer (at least 50%) to sign than the average time it takes to sign a transaction. The example below shows how scenario would look like using the properties of the metric:
oneMinuteRate >3.0, 75thPercentile() > mean * 1.5
Signing events
The total number of signing events on the node can be found by looking at the totalCounts
metric.
Was this page helpful?
Thanks for your feedback!
Chat with us
Chat with us on our #docs channel on slack. You can also join a lot of other slack channels there and have access to 1-on-1 communication with members of the R3 team and the online community.
Propose documentation improvements directly
Help us to improve the docs by contributing directly. It's simple - just fork this repository and raise a PR of your own - R3's Technical Writers will review it and apply the relevant suggestions.
We're sorry this page wasn't helpful. Let us know how we can make it better!
Chat with us
Chat with us on our #docs channel on slack. You can also join a lot of other slack channels there and have access to 1-on-1 communication with members of the R3 team and the online community.
Create an issue
Create a new GitHub issue in this repository - submit technical feedback, draw attention to a potential documentation bug, or share ideas for improvement and general feedback.
Propose documentation improvements directly
Help us to improve the docs by contributing directly. It's simple - just fork this repository and raise a PR of your own - R3's Technical Writers will review it and apply the relevant suggestions.