Business logic monitoring

⚠️
Performance monitoring
Please read the section about the performance monitoring and troubleshooting for more efficient way to make sure your bots are wroking smoothly

Method #1 - Monitoring errors in Analytics database

Querying analytics database to find sessions in which errors were reported.

SELECT * from session_log WHERE errors > 0;

Querying database in order to find messages that included errors.

SELECT * from message_event WHERE "error" IS NOT NULL;

The message_event.error column contains JSON objects with error messages and context details. Depending on the architecture of your bots, errors can be handled by the error handling context. In such case you can assume they are expected, and bot logic handled this case i.e. call to external REST API was not successful and bot provided a valid response that it cannot help at this moment.

☠️ Nevertheless, some errors may not be handled and in this case there is a 100% certainty that the conversation was broken, and the user was presented with a generic error or the call was disconnected.

It is advised to monitor both cases separately with the following SQL queries:

SELECT *
FROM message_event
WHERE (error ->> 'handled')::boolean = FALSE;

SELECT *
FROM message_event
WHERE (error ->> 'handled')::boolean = TRUE;

Filtering and visualization

In both cases, you can filter or visualize the data in time by using either session_log.start_time column or message_event.time

📘
Alerting
You should configure alerts that will trigger in case there is a spike in any kind of errors. Examine the volume that you have to set proper thereshold for anomalies. In general on production environment number of unhandled errors should be zero.

Troubleshooting

If you see errors in the database, the easiest way to troubleshoot them would be logging into admin UI and navigating to Analytics >> Errors tab. From there, you can click on a specific error, and it will navigate to the block in the flow where the error occurred.

Method #2 - Monitoring logs for error messages

System logs messages with severity ERROR or FATAL if it encounters a problem that is actionable. In most cases, such errors do affect end users, so it is fair to assume that the system is healthy if there are no messages logged with such severity from any application.

Please note that creating alerts based on application logs require centralized log repository of logs from all applications and may trigger false positive alerts.

In most cases, errors will be stored and can be detected using method #1, so monitoring based on logs analysis is advised for the following applications/pods:

gateway
webchat
thread-coordinator
channels-connector
cron-orchestrator
redis
RabbitMQ
PostgreSQL*
charon*
pytia*
crocotta*
gall*
qpido*
philotes*
platform-api*

* - Optional components, not present on all deployments.

Troubleshooting

In most cases error messages should be self-explanatory, for diagnosis of known error messages please refer to logs troubleshooting page.

Method #3 - Generic Kubernetes monitoring

In case of hardware, network or resources issues, pods may restart or be removed from the Kubernetes cluster. This may affect the end users, especially if there is no pod replication.

⚠️ Please make sure that you store and monitor cluster events as described on the following page. If you are not using Grafana+Loki stack provided by SentiOne you need to configure it as cluster events are not stored by default by the Kubernetes.

Method #4 - Monitoring the HTTP responses from the gateway service (legacy)

Main business logic monitoring can be executed with use of single endpoint exposed by the gateway service. The endpoint requires as input following parameters: text and ID of the project.

When the system is stable, the following conditions are met:

There are no error logs in gateway application (namespace = "external", "app" = "gateway", level = ERROR)
Application responds with 200 HTTP status code when valid requests are sent
Application responds within set timeouts

Conditions above should be monitored with external services.

📘
It is possible to check stability by periodically sending the same text like "Good Morning" to gateway service. It's worth mentioning though that apart from the message there should be valid project id sent. Also, response might change depending on configuration of the selected project.

Business logic monitoring

⚠️
Performance monitoring

Method #1 - Monitoring errors in Analytics database

Filtering and visualization

📘
Alerting

Troubleshooting

Method #2 - Monitoring logs for error messages

Troubleshooting

Method #3 - Generic Kubernetes monitoring

Method #4 - Monitoring the HTTP responses from the gateway service (legacy)

📘

⚠️Performance monitoring

Method #1 - Monitoring errors in Analytics database

Filtering and visualization

📘Alerting

Troubleshooting

Method #2 - Monitoring logs for error messages

Troubleshooting

Method #3 - Generic Kubernetes monitoring

Method #4 - Monitoring the HTTP responses from the gateway service (legacy)

📘

⚠️
Performance monitoring

📘
Alerting