Kubernetes
Architecture
Ensuring high availability in a Kubernetes (k8s) cluster is done by using built-in mechanisms. The vast majority of applications are designed to be stateless, i.e. round-robin load-balancing is possible. Any state in this model is saved in the database (PostgreSQL, ElasticSearch, RabbitMQ enabling configuration in HA mode) or transferred in the form of requests, e.g. encrypted HTTP cookies. All applications developed by SentiOne are running within the k8s cluster.
Monitoring
Check for any cluster events, pod failures, restarts, etc with command:
$ kubectl get events
Consequences of failure
Single pod
The effects of single pod failures are described for each component separately in the chapter Components monitoring
Node
In the event of a node failure, all pods running on it will fail. The effects are dependent on the specific pods, howevercritical business services of the system will be preserved.
Data center
In the case of a configuration assuming the deployment of the k8s cluster into two separate data centers in such a way that each center maintains at least one instance, the DC failure should be invisible to users, assuming the correct configuration of HA (default k8s loadbalancing) or master/failover in accordance with the description of the components (e.g. new web, analyzer)
High availability of stateful components
Stateful components
new-web, analyser
Architecture
The components are designed as stateful and run in a master/failover configuration. Nevertheless, it is possible to run them in a "stateless" configuration and use them via round robin loadbalancing (default k8s configuration). In the case of such a configuration, minor “glitches” may appear in the application in the interface, although the behavior of business functions is preserved.
To ensure high availability, it is best to configure an external reverse proxy solution (e.g. HaProxy, nginx, F5), while application instances should be run as independent pods in the k8s cluster with different addresses (e.g. on different ports).
The proxy mechanism itself should be configured so that if the health check fails, it automatically switches to the failover instance. This way, all traffic will automatically be routed to the failover instance. If the master instance is restored, all traffic will be redirected to it. Optionally, you can configure the proxy server to use the "sticky sessions" mechanism (sticking to the instance through an entry in cookies), although currently it is not used.
Example HaProxy configuration
backend 10_sense
mode http
redirect scheme https code 301 if !{ ssl_fc }
http-response set-header Strict-Transport-Security max-age=16000000;\\ preload;
option httpchk GET /status
http-check expect string version
use-server sen2 if { hdr(Host) -i master.domain.com }
use-server sen1 if { hdr(Host) -i failover.domain.com }
acl path_websocket path /websocket
acl hdr_connection hdr(connection) -i upgrade
acl hdr_upgrade hdr(upgrade) -i websocket
http-request deny if path_websocket !hdr_upgrade OR path_websocket !hdr_connection
server sen1 react1.domain.com:8079 check port 8079 inter 40000 backup
server sen2 react2.domain.com:8079 check port 8079 inter 40000
Effects of switching between instances
In the event of a failure and switching between the master and failover instances, it should not be visible to users. It may be necessary to log in again or refresh the page (e.g. refreshing data in the analytics dashboard in the SentiOne Listen&React tool). There may be notifications that the websocket connection is broken and the page needs to be reloaded.
The main state of the application is maintained in the database, therefore in case of a switch there is no loss of business data.
Disaster Recovery
At the time of writing the documentation, there are no required actions related to restoring the system to full functionality. Components in a Kubernetes cluster restart themselves according to the default configuration.
In the event of a bot or system failure resulting, for example, in the need to disconnect users from the system the safe approach is to restart the application (pods in Kubernetes cluster) even without taking detailed diagnostics.
As a rule, restarting the application is safe and does not have to be performed in a specific order. When errors appear in the application logs or other components report a problem with a given component, it should be restarted. If the restart action helps and restores the operation of the application, you can perform a log analysis without time pressure. Do not restart entire VMs or worker nodes in a Kubernetes cluster (unless in a given situation it is a direct recommendation of SentiOne), such activity is associated with unnecessary risk.
Configuration
Manifests
Increasing the number of replicas is done by changing the configuration during deployment
spec:
replicas: 2
kubectl admin console
Changing the number of replicas can be changed by command in the cluster administration console with the command:
kubectl-n salt scale--replicas=2 deployment/pattern-chatbots-research-deploy-1
Update strategy
Since the application automatically updates SQL database schemas during the update, performing the update in such a way that two different versions of components in the k8s cluster are running simultaneously is recommended and carries the risk of service instability. In connection with we recommend performing the update in a way that involves completely disabling the pods in the cluster. Just set the Recreate strategy in the deployment manifests for the Kubernetes cluster, as in the example:
kind: Deployment
metadata:
...
spec:
...
strategy:
type: Recreate
Updated over 1 year ago