Monitoring Notary Latency
To measure notary latency, combine three metrics:
P2P.ReceiveDuration
- The time between receiving a message to queueing a flow.Flows.StartupQueueTime
- The time flows spends in the queue.FlowDuration.(Non)ValidatingFLow
- The time the flow spends being running.
When latency elements are counted
P2P.ReceiveDuration
measures the latency for a consumer to deliver the message to the state machine. At the end of the delivery, a flow state machine is created. The Flows.StartupQueueTime
timer starts after the flow state machine is created, and ends when FlowDuration
starts. The FlowDuration
timer starts immediately before the flow state machine calls the flow’s call()
function.
Using Flows.StartupQueueTime
alone to measure latency doesn’t take account of the queue time or the time until a flow is queued.
Additional performance indicators that can be monitored
If notary workers are performing badly, the latency of the entire cluster can be affected. There are several metrics that you can monitor to alert you to problems within your notary cluster:
Flows.Actions.PersistCheckpoint
- This measures checkpoint latency and should remain stable. If it increases, there is likely a problem with the notary worker database.Flows.Actions.CommitTransaction
- This metric can vary, but a sudden spike can indicate problems with the notary worker database.
For cluster-specific monitoring:
UniquenessProvider.Rollback
- This counter increases every time a transaction fails and is rolled back. This may indicate attempted double-spends, race conditions between workers, and database errors. This metric should remain stable.UniquenessProvider.BatchSignLatency
- This metric provides insights about how long generating a batch signature takes. This includes the building of a merkle tree, so it is expected that the time will vary depending on the number of events to be signed. This metric should remain stable.P2P.SendQueueSize
- This metric should be monitored to ensure the outbound bandwidth is sufficient. If the outbound bandwidth is not sufficient, the cluster may fail.
Was this page helpful?
Thanks for your feedback!
Chat with us
Chat with us on our #docs channel on slack. You can also join a lot of other slack channels there and have access to 1-on-1 communication with members of the R3 team and the online community.
Propose documentation improvements directly
Help us to improve the docs by contributing directly. It's simple - just fork this repository and raise a PR of your own - R3's Technical Writers will review it and apply the relevant suggestions.
We're sorry this page wasn't helpful. Let us know how we can make it better!
Chat with us
Chat with us on our #docs channel on slack. You can also join a lot of other slack channels there and have access to 1-on-1 communication with members of the R3 team and the online community.
Create an issue
Create a new GitHub issue in this repository - submit technical feedback, draw attention to a potential documentation bug, or share ideas for improvement and general feedback.
Propose documentation improvements directly
Help us to improve the docs by contributing directly. It's simple - just fork this repository and raise a PR of your own - R3's Technical Writers will review it and apply the relevant suggestions.