Components monitoring

SentiOne Automate monitoring is based on checking status of each system component. Each component is check to verify if it is ready to provide it's services. It's checked periodically by specified time interval and tested if is responding correctly.
For these checks, base Kubernetes functions are used:

  • readiness probes - If readiness probe tests fails (returned result is missing), Kubernetes interprets component as not ready and retries the test after specified time. Only when the result is correct, component is considered as ready to serve the traffic.
  • liveness probes - are pericardial operations performed on a component, which allow to determine whether the tested component is working properly. If test fail, Kubernetes is being informed about the need to restart that application to ensure continuity of operation of the entire IT system

In case of the SentiOne Automate system, we will use two types of tests (in this case: pods):

  • HTTP request - simple GET request is sent to the specific endpoint, configured by application endpoint. Response from that endpoint is being interpreted (codes greater of equal to 200 and less than 400 are interpreted as success, any other code means that component is not working properly)
  • TCP probe - check which verifies if the application opened the specified TCP port (if it is open then check succeeds, otherwise component is considered down)

Popular monitoring systems (eg. Nagios, Sensu, Prometheus) allows preparing custom scripts for monitoring. These could be very simple bash scripts that inform about component health by exit code.

Standard exit codes have following interpretation

Pod monitoring

Each of SentiOne Automate system components exposes an appropriate endpoint to which HTTP GET request should be sent and based on response should allow an interpretation of the application state. In below table the are all components with short description of exposed port, responses etc.
Popular monitoring systems (eg. Nagios, Sensu, Prometheus) allows preparing custom scripts for monitoring. These could be very simple bash scripts that inform about component health by exit code.

Standard exit codes have following interpretation

Exit codeMeaning
0OK - Works fine
1WARN - Warning status
2CRIT - Critical status

Following list of components contains also results interpretation that should be used for writing your own monitoring scripts.

Readiness / Liveness configuration

All settings but the initial delay should be common for all Automate applications (excluding NLU part). Here are the default parameters:

ParameterInitial delayPeriodTimeoutFailure threshold
readiness5s10s5s4
liveness60s20s5s4

Some applications got longer starting times, so for these applications, we need to increase both initial delays by the application's starting time. See table below

ApplicationStarting time
new-web30s

Note:
Some applications (e.g. analyser) start loading models after receiving the first HTTP request. Therefore, these applications could throw couple of readiness warnings. Nothing to worry about provided they load after 4-6 of those warnings.

Components

📘

Description of each application is available on Components page

admin

EndpointDefault TCP portResult meaning
/healthCheck5750/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://admin:5750/healthCheck

Description

Admin panel

Critical service?

YES

How to ensure high availability

Minimum requirement: running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

It is possible to ensure high availability by configuring master-failover and/or “sticky sessions”, running 2 applications independently on different ports/addresses and configuring HA at the HTTP(s) proxy level (e.g. HaProxy, F5).

Effects of failure of all instances

No possibility to configure the course of the conversation, no management of the training set, no possibility to start training the NLU engine, no access to managing variables inknowledge base, no access to the bot analytics module.

It is not possible to load variables from the knowledge base in the bot process. The dialogs module has a 60-minute cache of this data, after this time errors resulting from the lack of data will start to be returned.

The default k8s cluster restores the application itself in a few minutes

Effects of a single instance failure

None. The default k8s cluster restores the second pod automatically in a few minutes.


dialogs

EndpointDefault TCP portResult meaning
/healthCheck5748/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://dialogs:5748/healthCheck

Description

Dialogs

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot does not respond to any requests

Effects of a single instance failure

Potential bot performance issues, possible slower response times


gateway

EndpointDefault TCP portResult meaning
/healthCheck5000/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://gateway:5000/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot does not respond to any requests

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


nlu-facade

EndpointDefault TCP portResult meaning
/healthCheck5750/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://nlu-facade:5750/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot will return an error (HTTP 500 - Internal Server Error) for each request to the gateway.

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


nlu-pipeline

EndpointDefault TCP portResult meaning
/healthCheck8080/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://nlu-pipeline:8080/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot will return an error (HTTP 500 - Internal Server Error) for each request to the gateway.

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


web-chat

EndpointDefault TCP portResult meaning
/healthCheck5760/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://web-chat:5760/healthCheck

Description

WebChat

Critical service?

YES, limited only to webchat channel

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot does not respond to any requests on webchat

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


cron-orchestrator

EndpointDefault TCP portResult meaning
/healthCheck5758/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://cron-orchestrator:5758/healthCheck

Critical service?

NO

How to ensure high availability

Not applicable. Component is responsible for triggering periodic tasks and should not be replicated. If for some reason pod fails it is restarted automatically by Kubernetes.

Effects of failure

No cron jobs for the time when the application is not available


twitter-bot

EndpointDefault TCP portResult meaning
/healthCheck5756/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://twitter-bot:5756/healthCheck

Description

Twitter Bot

Critical service?

YES, limited only to Twitter channel

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot does not respond to any requests on Twitter

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


channels-connector

EndpointDefault TCP portResult meaning
/healthCheck5782/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://channels-connector:5782/healthCheck

Description

Channels Connector

Critical service?

YES, limited only to Facebook, Twilio, and Vonage channels

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot does not respond to any requests on Facebook, Twilio nor Vonage

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


skype-bot

🚧

TCP port can be changed

TCP Port of healthCheck endpoint can be changed with following configuration keys

chatbots.skype-bot.http-app-status.host
chatbots.skype-bot.http-app-status.port

EndpointDefault TCP portResult meaning
/healthCheck8392/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://skype-bot:8392/healthCheck

Description

Skype Bot

Critical service?

YES, limited only to legacy Skype for Business channels

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot does not respond to any requests via Skype for Business

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


sso

EndpointDefault TCP portResult meaning
/healthCheck9000/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://sso:9000/healthCheck

Critical service?

NO

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

No ability to log into admin panel through Active Directory SSO

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


thread-coordinator

EndpointDefault TCP portResult meaning
/healthCheck5762/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://thread-coordinator:5762/healthCheck

Critical service?

YES, if handover to agent through React application is enabled

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot does not respond to any requests on webchat, WhatsApp nor Facebook.

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


platform-api

EndpointDefault TCP portResult meaning
/healthCheck5780/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://platform-api:5780/healthCheck

Critical service?

NO

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

It won't be possible to externally call exposed endpoints. It won't affect any crucial elements of the system.

Effects of a single instance failure

None, the component acts as an intermediary and one instance should handle all incoming traffic.


sentiduck

EndpointDefault TCP portResult meaning
/healthCheck2012/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://sentiduck:2012/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot does not respond to any requests

Effects of a single instance failure

Potential bot performance issues, possible slower response times


duckling

🚧

Duckling service is dependant of sentiduck service. To monitor it's health you have to use healthCheck endpoint of sentiduck.

Sample curl

curl -XGET http://sentiduck:2012/healthCheck
{
   "status":"ERROR",
   (...)
   "dependency_status":{
  	"status":"ERROR",
  	"msg":"(...)"
   }
}

inferrer

EndpointDefault TCP portResult meaning
/healthCheck12416/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://inferrer:12416/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot responds with error to all requests

Effects of a single instance failure

None, one instance should handle all incoming traffic.

intentizer-multi

EndpointDefault TCP portResult meaning
/healthCheck6543/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://intentizer-multi:6543/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot responds with error to all requests

Effects of a single instance failure

Potential bot performance issues, possible slower response times


intentizer-fitter

EndpointDefault TCP portResult meaning
/healthCheck6544/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://intentizer-fitter:6544/healthCheck

Critical service?

NO

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

No NLU training is possible

Effects of a single instance failure

Training in progress needs to be restarted.

keywords

EndpointDefault TCP portResult meaning
/healthCheck11234/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://keywords:11234/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot responds with error to all requests

Effects of a single instance failure

None, one instance should handle all incoming traffic.


name-service

EndpointDefault TCP portResult meaning
/healthCheck3456/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://name-service:3456/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot responds with error to all requests

Effects of a single instance failure

None, one instance should handle all incoming traffic.


ner-pl

EndpointDefault TCP portResult meaning
/healthCheck5000/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://ner-pl:5000/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot responds with error to all requests

Effects of a single instance failure

Potential bot performance issues, possible slower response times


tf-serving (deprecated)

Default TCP port: 8500

🚧

tf-serving service is dependant of ner-pl service. To monitor it's health you have to use healthCheck endpoint of ner-pl component.

Sample curl

curl -XGET http://ner-pl:5000/healthCheck

Sample error response

{
   "status": "ERROR",
   (...)
   "dependency_status": {
     "status": "ERROR",
     "msg": (...)
  }
}

pcre

EndpointDefault TCP portResult meaning
/healthCheck5000/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://pcre:5000/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot responds with error to all requests

Effects of a single instance failure

None, one instance should handle all incoming traffic.


tagger-pl

EndpointDefault TCP portResult meaning
/healthCheck9003/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://tagger-pl:9003/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot responds with error to all requests

Effects of a single instance failure

None, one instance should handle all incoming traffic.


pattern

EndpointDefault TCP portResult meaning
/healthCheck5000/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://pattern:5000/healthCheck

Critical service?

YES

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

The bot responds with error to all requests

Effects of a single instance failure

Potential bot performance issues, possible slower response times


new-web

EndpointDefault TCP portResult meaning
/healthCheck9000/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://new-web:9000/healthCheck

Critical service?

NO

How to ensure high availability

Minimum configuration
Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin). In the case of such a configuration, minor “glitches” may appear in the application in the interface, although the behavior of business functions is preserved.

Optimal configuration
It is recommended to run the application as a single instance (in case of failure, the k8s cluster will restore the application itself within a few minutes, and the service is not critical)

It is also possible to ensure high availability by configuring master-failover and/or sticky-session, running 2 applications independently on different ports/addresses and configuring HA at the HTTP(s) proxy level (e.g. HaProxy, F5).

Effects of failure of all instances

No possibility to view transcripts (substitute functionality is possible in the admin module or in other tools such as Speech Analytics, provided that they are available in the client's infrastructure)

Agents cannot handle text conversations manually (if the project provides for such a possibility)

It is only an administrative service, the system's business processes offered to end users are fully operational at all times, i.e. the bot conducts conversations.

The default k8s cluster restores the application itself in a few minutes

Effects of a single instance failure

The default k8s cluster restores the application itself in a few minutes. If high availability is configured at the proxy level (e.g. F5, HaProxy), it should automatically reconnect to the second failover instance in a manner imperceptible to the user and the continuity of business functions is ensured.


analyser

EndpointDefault TCP portResult meaning
/healthCheck7080/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://analyser:7080/healthCheck

Critical service?

NO

How to ensure high availability

Minimum configuration
Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin). In the case of such a configuration in the application, long-term analyzes will be performed twice, which may result in "glitches" in the user interface (applies to widgets and analyzes: tag cloud, opinion leaders, reach analysis, the most popular hashtags, fanpage statistics), although the behavior of business functions is preserved .

Optimal configuration
It is recommended to run the application as a single instance (in case of failure, the k8s cluster will restore the application itself within a few minutes, and the service is not critical)It is also possible to ensure high availability by configuring the master-failover type, running 2 applications independently on different ports/addresses and configuring HA at the HTTP(s) proxy level (e.g. HaProxy, F5). One instance of the analyzer should be assigned to one new-web application.

Effects of failure of all instances

There is no possibility to view analyses, i.e. dashboards and charts.

It is only an analytical service, the system's business processes offered to end users are fully operational at all times, i.e. the bot conducts conversations.

The default k8s cluster restores the application itself in a few minutes

Effects of a single instance failure

The default k8s cluster restores the application itself in a few minutes. If high availability is configured at the proxy level (e.g. F5, HaProxy), it should automatically reconnect to the second failover instance in a manner imperceptible to the user and the continuity of business functions is ensured.

Addressing configuration

The address in which new-web communicates with the analyzer module is contained in the fileconfig.yaml added as config map/secret (object in k8s)

kubernetes:  
  JEE Server:  
    port: 7080  
    host: analyzer-react.example-cluster.local #Internal addressing within k8s cluster

It is possible to either assign an analyzer instance with new-web 1:1 and communicate directly between the pods, or configure a proxy server and communicate new-web with the analyzer through it using "external" addressing. In the case of a 1:1 configuration (a pair of new-web and analyzer modules), make sure that such a pair is run in the same DC.


bot-integration

EndpointDefault TCP portResult meaning
/healthCheck9010/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://bot-integration:9010/healthCheck

Critical service?

NO

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

Transcript analytics module unavailable. The export will be repeated several times - waiting for the disposition of the bot-integration component - more in the export documentation. Writeable transcripts are stored in the RabbitMQ queue and will be processed/saved when the application is restarted.

Effects of a single instance failure

None, one instance should handle all incoming traffic.


slim-uploader

EndpointDefault TCP portResult meaning
/healthCheck8765/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://slim-uploader:8765/healthCheck

Critical service?

NO

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

Current transcripts are not being saved in the ElasticSearch database. The transcripts are waiting in the RabbitMQ queue until the service is restarted.

Effects of a single instance failure

None, one instance should handle all incoming traffic.


refinery

EndpointDefault TCP portResult meaning
/healthCheck8765/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Sample curl

curl -XGET http://refinery:8765/healthCheck

Critical service?

NO

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

No enrichment of transcriptions with semantic analysis, and thus their subsequent recording in the ElasticSearch database. The transcripts are waiting in the RabbitMQ queue until the service is restarted.

Effects of a single instance failure

None, one instance should handle all incoming traffic.


hooks-server

EndpointDefault TCP portResult meaning
/healthCheck8069/TCP0 (OK) - if the HTTP status code is equal to 200
2 (CRIT) - if the HTTP status is not equal 200

Description

Hooks server

Sample CURL

curl -XGET http://hooks-server:8069/healthCheck

Critical service?

YES (Facebook, WhatsApp, Twitter and Automate platform)

How to ensure high availability

Running at least 2 instances. The k8s cluster will automatically ensure high availability at this point (load balancing: round-robin)

Effects of failure of all instances

Bot does not respond to conversation on Facebook, WhatsApp, Twitter or webchat

Effects of a single instance failure

None, one instance should handle all incoming traffic.