Business logic monitoring
Performance monitoring
Please read the section about the performance monitoring and troubleshooting for more efficient way to make sure your bots are wroking smoothly
Method #1 - Monitoring errors in Analytics database
Querying analytics database to find sessions in which errors were reported.
SELECT * from session_log WHERE errors > 0;
Querying database in order to find messages that included errors.
SELECT * from message_event WHERE "error" IS NOT NULL;
The message_event.error
column contains JSON objects with error messages and context details. Depending on the architecture of your bots, errors can be handled by the error handling context. In such case you can assume they are expected, and bot logic handled this case i.e. call to external REST API was not successful and bot provided a valid response that it cannot help at this moment.
☠️ Nevertheless, some errors may not be handled and in this case there is a 100% certainty that the conversation was broken, and the user was presented with a generic error or the call was disconnected.
It is advised to monitor both cases separately with the following SQL queries:
SELECT *
FROM message_event
WHERE (error ->> 'handled')::boolean = FALSE;
SELECT *
FROM message_event
WHERE (error ->> 'handled')::boolean = TRUE;
Filtering and visualization
In both cases, you can filter or visualize the data in time by using either session_log.start_time
column or message_event.time
Alerting
You should configure alerts that will trigger in case there is a spike in any kind of errors. Examine the volume that you have to set proper thereshold for anomalies. In general on production environment number of unhandled errors should be zero.
Troubleshooting
If you see errors in the database, the easiest way to troubleshoot them would be logging into admin UI and navigating to Analytics >> Errors tab. From there, you can click on a specific error, and it will navigate to the block in the flow where the error occurred.
Method #2 - Monitoring logs for error messages
System logs messages with severity ERROR or FATAL if it encounters a problem that is actionable. In most cases, such errors do affect end users, so it is fair to assume that the system is healthy if there are no messages logged with such severity from any application.
Please note that creating alerts based on application logs require centralized log repository of logs from all applications and may trigger false positive alerts.
In most cases, errors will be stored and can be detected using method #1, so monitoring based on logs analysis is advised for the following applications/pods:
- gateway
- webchat
- thread-coordinator
- channels-connector
- cron-orchestrator
- redis
- RabbitMQ
- PostgreSQL*
- charon*
- pytia*
- crocotta*
- gall*
- qpido*
- philotes*
- platform-api*
* - Optional components, not present on all deployments.
Troubleshooting
In most cases error messages should be self-explanatory, for diagnosis of known error messages please refer to logs troubleshooting page.
Method #3 - Generic Kubernetes monitoring
In case of hardware, network or resources issues, pods may restart or be removed from the Kubernetes cluster. This may affect the end users, especially if there is no pod replication.
⚠️ Please make sure that you store and monitor cluster events as described on the following page. If you are not using Grafana+Loki stack provided by SentiOne you need to configure it as cluster events are not stored by default by the Kubernetes.
Method #4 - Monitoring the HTTP responses from the gateway service (legacy)
Main business logic monitoring can be executed with use of single endpoint exposed by the gateway service. The endpoint requires as input following parameters: text and ID of the project.
When the system is stable, the following conditions are met:
- There are no error logs in gateway application (namespace = "external", "app" = "gateway", level = ERROR)
- Application responds with 200 HTTP status code when valid requests are sent
- Application responds within set timeouts
Conditions above should be monitored with external services.
It is possible to check stability by periodically sending the same text like "Good Morning" to gateway service. It's worth mentioning though that apart from the message there should be valid project id sent. Also, response might change depending on configuration of the selected project.
Updated 13 days ago