ElasticSearch monitoring

Description

A database that stores transcripts and information on agent conversations in React application

Critical service

How to ensure high availability

Launching at least 3 instances of the application in master and data mode and connecting them into a cluster in accordance with the installation instructions

Effects of failure of all instances

Current transcripts are not being saved in the ElasticSearch database. Ongoing transcripts are queued

Effects of a single instance failure

None, the default ElasticSearch cluster allows you to work when a single node is unavailable. The default behavior is that replicas are copied on other nodes. After restarting, the node connects to the cluster and automatically synchronizes data. The cluster automatically strives to equalize the number of replicas on each node, and after a correctly completed synchronization/data transfer, space is freed up automatically

Monitoring

ElasticSearch cluster exposes HTTP endpoint for assessing its stability.

GET _cluster/health

Sample response

{
  "cluster_name" : "testcluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 5,
  "active_shards" : 5,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 50.0
}

Cluster state can be assessed by value of the property called "status". Below you can find possible values with their interpretation.

Status	Meaning
green	0 (OK) - Works fine
yellow	1 (WARN) - There might be something wrong going on. Needs further attention.
red	2 (CRIT) - Serious issue within the cluster.

Default port for ElasticSearch is 9200/TCP. It can be as well used to check if database is still runing.

Disaster Recovery

After restarting the instance, the cluster automatically synchronizes itself, no additional work of administrators is required.

In the event of a longer failure, failed attempts to write data may be recorded in the RabbitMQ queue. In such a situation, the only solution is to manually move all messages from the failed queue to the upload queue. To do this, enter the RabbitMQ panel, search for the queue statements.priority.failed, which should contain messages that could not be saved, then select "Move messages" and specify the destination queue:statements.priority.upload

Alternatively, the above can be done from the CLI (after logging into the server where the RabbitMQ service is running):

rabbitmqctl set_parameter shovel my-dynamic-shovel \
'{"src-queue": "statements.priority.failed", "src-delete-after": "queue-length", "src-uri": "amqp://", "dest-queue": "statements.priority.upload","dest-uri": "amqp://"}'

Mentions should be slowly processed and saved in ES, it would be best to watch the status of queues with the command below:

watch -n 0.1 'rabbitmqctl list_queues -q name messages | egrep "statements.priority.failed|transcription.metadata.update|statements.priority.'

At the same time, you can watch the logs of the slim-uploader application to confirm whether the mentions have been processed correctly, e.g.

08-17 15:52:47,220 INFO UploaderStreamWorker-4 p.c.u.SlimUploaderStream:60 - Message processed
08-17 15:52:47,166 INFO UploaderStreamWorker-4 p.c.u.SlimUploaderStream:60 - Message processed
08-17 15:52:47,166 INFO uploader-akka.actor.default-dispatcher-12 p.c.d.e.i.ElasticWriterImpl:103 - Saved 1 statements in 14ms velocity: 71.429 st/s and 0 already existed in IS
08-17 15:52:47,166 INFO uploader-akka.actor.default-dispatcher-12 p.c.d.e.i.ElasticWriterImpl:228 - Finished saving 1 batch. Updates: 0, Creates: 0, Upserts: 1, Failed statements: 0

As all mentions will be correctly saved in ES, they are queued mentions transcription.metadata.update should also start to be processed successively, and there should be no error logs in the bot-integration logs as before. However, there should be information about the correct saving of metadata.