Architecture

Ensuring high availability in a Kubernetes (k8s) cluster is done by using built-in mechanisms. The vast majority of applications are designed to be stateless, i.e. round-robin load-balancing is possible. Any state in this model is saved in the database (PostgreSQL, ElasticSearch, RabbitMQ enabling configuration in HA mode) or transferred in the form of requests, e.g. encrypted HTTP cookies. All applications developed by SentiOne are running within the k8s cluster.

Monitoring

Check for any cluster events, pod failures, restarts, etc with command:

$ kubectl get events

Consequences of failure

Single pod

The effects of single pod failures are described for each component separately in the chapter Components monitoring

Node

In the event of a node failure, all pods running on it will fail. The effects are dependent on the specific pods, howevercritical business services of the system will be preserved.

Data center

In the case of a configuration assuming the deployment of the k8s cluster into two separate data centers in such a way that each center maintains at least one instance, the DC failure should be invisible to users, assuming the correct configuration of HA (default k8s loadbalancing) or master/failover in accordance with the description of the components (e.g. new web, analyzer)

High availability of stateful components

Stateful components

new-web, analyser

Architecture

The components are designed as stateful and run in a master/failover configuration. Nevertheless, it is possible to run them in a "stateless" configuration and use them via round robin loadbalancing (default k8s configuration). In the case of such a configuration, minor “glitches” may appear in the application in the interface, although the behavior of business functions is preserved.

To ensure high availability, it is best to configure an external reverse proxy solution (e.g. HaProxy, nginx, F5), while application instances should be run as independent pods in the k8s cluster with different addresses (e.g. on different ports).

The proxy mechanism itself should be configured so that if the health check fails, it automatically switches to the failover instance. This way, all traffic will automatically be routed to the failover instance. If the master instance is restored, all traffic will be redirected to it. Optionally, you can configure the proxy server to use the "sticky sessions" mechanism (sticking to the instance through an entry in cookies), although currently it is not used.

Example HaProxy configuration

backend 10_sense  
    	mode http  
    	redirect scheme https code 301 if !{ ssl_fc }  
    	http-response set-header Strict-Transport-Security max-age=16000000;\\ preload;  
    	option httpchk GET /status  
    	http-check expect string version
      use-server sen2 if { hdr(Host) -i master.domain.com }
			use-server sen1 if { hdr(Host) -i failover.domain.com }

			acl path_websocket path /websocket
			acl hdr_connection hdr(connection) -i upgrade
			acl hdr_upgrade hdr(upgrade) -i websocket
			http-request deny if path_websocket !hdr_upgrade OR path_websocket !hdr_connection

			server sen1 react1.domain.com:8079 check port 8079 inter 40000 backup
			server sen2 react2.domain.com:8079 check port 8079 inter 40000

Effects of switching between instances

In the event of a failure and switching between the master and failover instances, it should not be visible to users. It may be necessary to log in again or refresh the page (e.g. refreshing data in the analytics dashboard in the SentiOne Listen&React tool). There may be notifications that the websocket connection is broken and the page needs to be reloaded.

The main state of the application is maintained in the database, therefore in case of a switch there is no loss of business data.

Disaster Recovery

At the time of writing the documentation, there are no required actions related to restoring the system to full functionality. Components in a Kubernetes cluster restart themselves according to the default configuration.

🚧
In the event of a bot or system failure resulting, for example, in the need to disconnect users from the system the safe approach is to restart the application (pods in Kubernetes cluster) even without taking detailed diagnostics.

As a rule, restarting the application is safe and does not have to be performed in a specific order. When errors appear in the application logs or other components report a problem with a given component, it should be restarted. If the restart action helps and restores the operation of the application, you can perform a log analysis without time pressure. Do not restart entire VMs or worker nodes in a Kubernetes cluster (unless in a given situation it is a direct recommendation of SentiOne), such activity is associated with unnecessary risk.

Configuration

Manifests

Increasing the number of replicas is done by changing the configuration during deployment

spec:  
  replicas: 2

kubectl admin console

Changing the number of replicas can be changed by command in the cluster administration console with the command:

kubectl-n salt scale--replicas=2 deployment/pattern-chatbots-research-deploy-1

Update strategy

Since the application automatically updates SQL database schemas during the update, performing the update in such a way that two different versions of components in the k8s cluster are running simultaneously is recommended and carries the risk of service instability. In connection with we recommend performing the update in a way that involves completely disabling the pods in the cluster. Just set the Recreate strategy in the deployment manifests for the Kubernetes cluster, as in the example:

kind: Deployment  
metadata:  
  ...  
spec:  
  ...  
  strategy:  
    type: Recreate

Architecture

Monitoring

Consequences of failure

Single pod

Node

Data center

High availability of stateful components

Stateful components

Architecture

Example HaProxy configuration

Effects of switching between instances

Disaster Recovery

🚧In the event of a bot or system failure resulting, for example, in the need to disconnect users from the system the safe approach is to restart the application (pods in Kubernetes cluster) even without taking detailed diagnostics.

Configuration

Manifests

kubectl admin console

Update strategy

🚧
In the event of a bot or system failure resulting, for example, in the need to disconnect users from the system the safe approach is to restart the application (pods in Kubernetes cluster) even without taking detailed diagnostics.