Administrator’s guide#

HA deployment#

To enhance high availability and minimize system downtime, deploy the Acura High Availability (HA) solution.

Acura HA ensures consistent operation for applications and services, which is especially critical for mission-critical systems. The multiple nodes eliminate failure points and ensure continuous system operation even in the case of hardware or software failures. Connected nodes prevent data loss during failures and allow for quick recovery of the system state. The load balancer (LB) effectively distributes traffic between nodes, eliminating overloads and ensuring stable performance even under peak loads. Troubleshooting nodes is performed without the need to stop the system, making this solution a reliable foundation for business processes.

HA systems are an important infrastructure element that maintains stable business operations and minimizes the risks associated with failures.

To deploy HA, you only need:

three virtual machines running Ubuntu 24.04
one virtual machine for load balancing.

Increase the number of nodes if necessary, adapting the system to the needs of your organization.

Preconditions#

To deploy an Acura HA cluster, you need a deploy VM with Python 3.11. You can use Ubuntu 24.04, or if you use a later release, you can install pyenv to set up Python 3.11. Instructions can be found here.

Deployment steps#

Step 1. Prepare Security Group#

Please review the following ingress rules before deployment:

Table 1: Prepare Security Group.
Ether Type	IP Protocol	Port Range	Remote IP Prefix	Remote Security Group	Description
IPv4	TCP	1 - 65535	-	Acura_ha	Please note that we need to open all TCP ports for the security group we created. This is required for Kubernetes connectivity between nodes (e.g., for etcd or MariaDB clusters).
IPv4	TCP	22 (SSH)	<IP*>/32	-	Allow TCP port 22 from my public IP for SSH and Ansible.
IPv4	TCP	80 (HTTP)	-	-	The Acura UI port redirects the user to the HTTPS port.
IPv4	TCP	443 (HTTPS)	0.0.0.0/0	-	Acura UI port
IPv4	TCP	4443	0.0.0.0/0	-	Initial Configuration Port
IPv4	TCP	6443	0.0.0.0/0	-	Port for the Kubernetes API service
IPv4	UDP	12201	0.0.0.0/0	-	Logs port: Used by agents and pods to send logs to the ELK stack.

IP* - the IP address you want to use to access the Acura.

Note

The deploy VM should have access to port 22 for all three VMs and the load balancer (LB). In the security group (SG) above, only one public IP is allowed for port 22. Ensure your deploy VM uses this public IP, or add another rule to the SG.

Step 2. Create three Ubuntu VMs#

Create three Ubuntu 24.04 instances with 200 GB volumes (SSD backend recommended). The entire 200 GB should be available as a single partition, without separate partitions for /home, etc. It is also recommended to allocate at least 16 GB of RAM and 8 vCPUs for each instance.

Example:

ha deployment step2

Verify that the DNS servers on all nodes (cat /etc/resolv.conf) are correct. They should be obtained from the DHCP server. For example, the result of the nslookup command should resemble the following:

$ nslookup google.com
Server:     8.8.8.8
Address:    8.8.8.8#53

Non-authoritative answer:cd 
Name:   google.com
Address: 64.233.165.113
Name:   google.com
...

If the DNS servers don’t work, check the DNS configuration of the subnet (e.g., use 8.8.8.8 as the DNS server). Hostnames should not be fully qualified domain names (FQDNs); they should not contain dots.

ha deployment step 2 dns

Step 3. Create a Load Balancer#

Use the following listeners:

Table 2: Create a Load Balancer.
Name	Description	Protocol	Port
initconfig	Initial Configuration UI Listener	TCP	4443
http	HTTP Listener	TCP	80
k8s	Kubernetes API Service Listener	TCP	6443
https	HTTPS Listener for Acura UI	TCP	443

Example of Cloud-Native LB Configuration

ha deployment step3

NGINX Configuration

Don’t forget to replace X.X.X.X, Y.Y.Y.Y, and Z.Z.Z.Z with the IP addresses of your nodes. Use the IP address of the first node for X.X.X.X, the second node for Y.Y.Y.Y, and the third node for Z.Z.Z.Z.

$ cat /etc/nginx/nginx.conf
user www-data;
worker_processes auto;
worker_rlimit_nofile 70000;
pid /run/nginx.pid;

events {
        worker_connections 70000;
    # multi_accept on;
}

stream {
    upstream stream_http {
        least_conn;
        server X.X.X.X:80;
        server Y.Y.Y.Y:80;
        server Z.Z.Z.Z:80;
    }

    upstream stream_https {
        least_conn;
        server X.X.X.X:443;
        server Y.Y.Y.Y:443;
        server Z.Z.Z.Z:443;
    }

    upstream stream_initial_config {
        least_conn;
        server X.X.X.X:4443;
        server Y.Y.Y.Y:4443;
        server Z.Z.Z.Z:4443;
    }

    upstream stream_kubernetes {
        least_conn;
        server X.X.X.X:6443;
        server Y.Y.Y.Y:6443;
        server Z.Z.Z.Z:6443;
    }

    upstream stream_elk {
        least_conn;
        server X.X.X.X:12201;
        server Y.Y.Y.Y:12201;
        server Z.Z.Z.Z:12201;
    }

    upstream stream_registry {
        least_conn;
        server X.X.X.X:15000;
        server Y.Y.Y.Y:15000;
        server Z.Z.Z.Z:15000;
    }

    server {
        listen        80;
        proxy_pass    stream_http;
    }

    server {
        listen        443;
        proxy_pass    stream_https;
        proxy_buffer_size 10m;
    }

    server {
        listen        4443;
        proxy_pass    stream_initial_config;
    }

    server {
        listen        6443;
        proxy_pass    stream_kubernetes;
        proxy_timeout 10s;
        proxy_connect_timeout 10s;
    }

    server {
        listen     12201 udp;
        proxy_pass stream_elk;
        proxy_timeout 30s;
        proxy_connect_timeout 60s;
    }

    server {
        listen        15000;
        proxy_pass    stream_registry;
    }
}

Troubleshooting

If you encounter issues starting or restarting NGINX after changing the configuration, you can verify the correctness of the config using the following command: nginx -t
If you receive an error such as the unknown directive 'stream' when starting NGINX, add the following line to the beginning of the configuration: load_module /usr/lib/nginx/modules/ngx_stream_module.so;

Check SSH access from the deploy VM to the Acura VMs

To save time, check access to port 22 before running any installations. You can verify the access with the following simple Bash command:

$ for i in X.X.X.X Y.Y.Y.Y Z.Z.Z.Z; do nc -vz -w 2 $i 22 ; done
Connection to X.X.X.X 22 port [tcp/ssh] succeeded!
Connection to Y.Y.Y.Y 22 port [tcp/ssh] succeeded!
Connection to Z.Z.Z.Z 22 port [tcp/ssh] succeeded!

Where X.X.X.X, Y.Y.Y.Y, and Z.Z.Z.Z are the target public IPs to check access. If access is unavailable, you will see output similar to this:

$ for i in X.X.X.X Y.Y.Y.Y Z.Z.Z.Z; do nc -vz -w 2 $i 22 ; done
nc: connect to X.X.X.X port 22 (TCP) timed out: Operation now in progress
nc: connect to Y.Y.Y.Y port 22 (TCP) timed out: Operation now in progress
nc: connect to Z.Z.Z.Z port 22 (TCP) timed out: Operation now in progress

In that case, check your security group settings.

Step 4. Prepare the environment on the localhost#

1. Install required packages

sudo apt install python3-pip sshpass
pip3 install --upgrade pip==20.3.4
pip3 install "virtualenv<20.0"

2. Download Hystax archive from the provided URL

3. Extract archive “hystax-ha-archive.tar.gz”

tar xvzf hystax-ha-archive.tar.gz

4. Install requirements

cd hystax_ha_archive
source venv.sh
pip install -r requirements.txt

If installation fails, try installing this Ubuntu package: sudo apt install libffi-dev. And then run pip install -r requirements.txt again.

5. Create “ha_env_file” HA environment file with the following content

<HOSTNAME1> ansible_ssh_host=<X.X.X.X>
<HOSTNAME2> ansible_ssh_host=<Y.Y.Y.Y>
<HOSTNAME3> ansible_ssh_host=<Z.Z.Z.Z>

Note

HOSTNAME1 is the hostname of the first Ubuntu machine, X.X.X.X is the public IP address of the first Ubuntu machine, and so on.

Troubleshooting

If the Acura hosts use separate SSH keys (not ~/.ssh/id_rsa), you can specify the key in the ha_env_file as follows:

ansible_ssh_private_key_file=<path_to_your_private_key>

If you're using password-protected SSH keys, add them to the SSH agent with the following commands:

1. Start the SSH agent in the background:

$ eval "$(ssh-agent -s)"

2. Add the SSH private key to the SSH agent:

$ ssh-add ~/.ssh/id_rsa"

Step 5. Install Kubernetes master#

Deploy Kubernetes HA Master:

export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook -e "ansible_ssh_user=<SSH_USER_NAME>" \
-e "lb_address=<LOAD_BALANCER_IP>" \
-e "ansible_python_interpreter=/usr/bin/python3" \
-e "logstash_host=127.0.0.1" \
-i ha_env_file ansible/k8s-master.yaml

Troubleshooting

If, when executing the command, you are prompted for a password, and you want to use SSH keys from the instances created in the step 2, you can specify the SSH private key in the ha_env_file (step 4) as follows:

ansible_ssh_private_key_file=<path_to_your_private_key>

This will avoid issues with entering passwords manually.

If the user does not have root privileges, add the -K flag to the command to prompt for the password immediately when the playbook starts.

Step 6. Deploy Acura#

Use the following command to deploy the Acura HA cluster:

export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook -e "ansible_ssh_user=<SSH USER NAME>" \
-e '{"overlay_list": ["overlay/ha.yaml", "overlay/golden-image.yml", "overlay/distr-dr.yaml", "overlay/startup_reconfigure.yaml"]}' \
-e "copy_patch=true" \
-e "copy_ca=true" \
-i ha_env_file ansible/ha-acura.yaml

Note

This command deploys Acura Disaster Recovery. If you want to deploy Acura Migration, replace distr-dr.yaml with distr-migration.yaml.

Post-Deployment

1. Check that all Acura Kubernetes pods have started.

2. To configure Acura, go to the initial-config UI at:

https://<LOAD-BALANCER-PUBLIC-IP>

Example: https://89.208.86.10.

How-tos#

Update an already deployed Acura#

1. Perform Step 4. Prepare the environment on the localhost. Get a list of existing overlays used when creating the cluster:

kubectl exec -ti etcd-0 -- etcdctl get /overlay_list

If necessary, change the list of overlays in etcd:

kubectl exec -ti etcd-0 -- etcdctl set /overlay_list ["old_overlay1","old_overlay2",<...>,"new_overlay"]

2. Perform Step 6. Deploy Acura. When deploying Acura, add the new parameter to the ansible-playbook command: -e "update_cluster=true". For example:

export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook -e "ansible_ssh_user=<SSH_USER_NAME>" \
-e '{"overlay_list": ["overlay/ha.yaml", "overlay/golden-image.yml", "overlay/distr-dr.yaml", "overlay/startup_reconfigure.yaml", "overlay/big-heat-stack-timeout.yaml"]}' \
-e "copy_patch=true" \
-e "copy_ca=true" \
-e "update_cluster=true" \
-i ha_env_file ansible/ha-acura.yaml

Remove a dead node from Acura HA#

1. Log into a Live Node.

2. Run the following command to identify the dead node names:

kubectl get nodes

3. Run kubectl delete node <DEAD_NODE_NAME> to remove the dead node from Kubernetes. Replace with the dead node's name identified in the previous step.

4. Run

kubectl exec etcd-<LIVE_NODE_NAME> -n kube-system -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key member list

to identify a dead node etcd member ID (first column in output). Replace with the live node name you are logged into.

5. Run

kubectl exec etcd-<LIVE_NODE_NAME> -n kube-system -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key member remove <DEAD_MEMBER_ID>

to remove the dead node from the etcd cluster. Replace with the live node name you are logged into. Replace with dead node etcd member ID from the previous step.

Add a new node#

1. Follow the steps in the previous section to remove any dead nodes from the cluster.

2. Perform deployment steps 2-3 for a new Ubuntu machine:

3. Log into a live node and run the following commands:

Create the join token:

kubeadm token create --print-join-command

This command will output master_token and master_discovery_token. Copy and store them securely.
Upload the certificates:

sudo kubeadm init phase upload-certs --upload-certs

This command will output master_cert_key. Copy and store it securely.

4. Perform deployment Step 4. Prepare the environment on the localhost.

5. Use the tokens and certificate key generated earlier to add the new node to the cluster:

source venv.sh
pip install -r requirements.txt
patch_url='<link-to-patch>'
ca_url='<link-to-ca-image>'
export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook -e "ansible_ssh_user=<SSH_USER_NAME>" \
-e "ansible_python_interpreter=/usr/bin/python3" \
-e "lb_address=<LOAD_BALANCER_IP>" \
-e "logstash_host=127.0.0.1" \
-e "master_token=ae9wn..." \
-e "master_discovery_token=bf917..." \
-e "master_cert_key=4cc2c..." \
-e "copy_patch=true" \
-e "copy_ca=true" \
-i "<NEWLY_PROVISIONED_INSTANCE_IP>," \
ansible/k8s-add-node.yaml

Notes:

Set lb_address to your load balancer's address.
Set master_token, master_discovery_token, and master_cert_key to the tokens you obtained previously. Refer to the example to confirm their correct format.
Set the address after -i to the IP address of your newly provisioned instance.

Backup internal Acura DBs#

Overview#

Sometimes issues can occur with the machine hosting the Acura deployment. The problem can be solved by backups of etcd, Maria and Mongo databases, with the option to restore them on an external virtual machine. Backups must be performed regularly, which is why the process is automated using a cron job. It can be divided into two stages: creating backups and uploading the archive to external storage.

etcd and DBs Backup#

Cron job parameters are defined through the overlay. Below is an example with default values:

external_backup:
  suspend: false
  mountpoint: /acura/backup
  schedule: "@daily"
  keep: 5
  name: &external_backup_name
  startingDeadlineSeconds: 100
  image:
    repository: *external_backup_name
    tag: local

mountpoint – local backup storage path on the Acura machine
keep – maximum number of backups to retain
name – name of the external backup archive

The backup saves all etcd keys to acura_etcd-<timestamp>.dump. It also runs shell scripts to export MariaDB to acura_db_<timestamp>.sql.gz and the MongoDB events collection to acura_mongo_<timestamp>.json.gz. Only the last n files are retained (see the keep parameter). Find all three files archived in the acura_etcd-<timestamp>.tar.gz file.

External Backup#

The external backup copies the Acura DBs backup stored locally to the remove VM.

The external backup parameters are configured in etcd. If they are not specified, backups are assumed to be stored locally only. Example:

external_backup:
  type: object-storage 
  key: username
  secret: password
  url: http://172.22.5.88:9000
  bucket: some-bucket
  region: some-region (optional)
  backup_path: some/path/folder (optional)

type – expected values: object-storage or local
url – URL of the object storage
bucket – name of the bucket to store backups
region – storage region (optional parameter)
backup_path – path where the external backup archive will be stored (optional parameter)

If any parameter is missing (except for the region and backup_path), the external backup process fails. However, it's important to note that in case of any error, the individual local backups still remain on the machine.

The backup archives the latest backups from the etcd and dbs backup stage into a file named acura_backup_<timestamp>.tar.gz. If the appropriate parameters are set, the archive is uploaded to object storage (see the backup_path parameter). If no storage parameters are specified, the backups remain stored locally.

Recovery Process#

After the local backups of the Acura databases are uploaded to the remote VM, restore them using the alias: hx_restore_from_backup.

Port Security & Aliases#

Network ports and services#

Acura controllers are hardened to reduce the attack surface and simplify access management. Find services on ports:

SSH - tcp/22
HTTP (redirects to 443 if enabled) - tcp/80
HTTPS - tcp/443
Initial Configuration (closes automatically after setup) - tcp/4443
GELF logging - tcp+udp/12201
Internal Docker registry (agent auto-update, external replication agent for oVirt/OpenStack docker-compose mode) - tcp/15000

Web tools access#

phpMyAdmin, Metrics Dashboard, and Logs can be find at:

phpMyAdmin (disabled by default) → https://<ACURA_IP_ADDRESS>/tools/db
Metrics Dashboard (Grafana) → https://<ACURA_IP_ADDRESS>/tools/metrics (credentials: user: grafana_user; password: LxY83mXr)
Logs → https://<ACURA_IP_ADDRESS>/tools/logs (credentials: user: grafana_user; password: LxY83mXr)

Managing features via scripts#

To balance security and usability, convenient aliases (hx_* scripts) are provided for enabling or disabling features such as phpMyAdmin, SSL, and others.

These scripts apply changes immediately, without requiring a pod restart.

To use them:

Access the Acura box, enter the string:

ssh user@<ACURA_IP_ADDRESS>

Replace <ACURA_IP_ADDRESS> with the device’s real IP or hostname. When prompted, enter the password: px%96jA-8zX{.
Run the desired hx_* script:
- Show Acura version: hx_cluster_info
- Enable engineer user (for specific boxes where it’s disabled by default, e.g. AWS): hx_enable_engineer
- Restore Acura databases from a backup file (about Backup internal Acura DBs): hx_restore_from_backup
- Enable or disable phpMyAdmin access:
  - enable for 12 hours: hx_enable_phpmyadmin
  - disable: hx_disable_phpmyadmin
- Enable or disable SSL redirection (HTTP → HTTPS):
  - enable: hx_enable_ssl_redirect
  - disable: hx_disable_ssl_redirect

Grafana#

Grafana is a powerful visualization and monitoring tool that helps to understand system performance at a glance. In the Acura environment, it provides ready-made dashboards with real-time metrics on replication, network usage, and resource bottlenecks. By using these dashboards, administrators can quickly identify issues, optimize system performance, and track the health of replication processes.

How to interpret data using Grafana#

To access the Grafana dashboards log in at https://<ACURA_IP_ADDRESS>/tools/metrics. User: grafana_user; password: LxY83mXr.

On the Home Dashboard page find:

the list of useful dashboards
global replication speed and network utilization.

Grafana main window

We’ve starred 3 dashboards for your usability. Find them on the left panel of the Home Dashboard: the stats for customers, devices, and snapshots. They include a set of panels:

Data Block Flow measures replication speed, with a breakdown of block types
Network Utilization measures real network traffic (currently only between Replication Agent and Acura, without Cloud Agent traffic)
Receiver Operation Breakdown measures the percent of time the Receiver spent doing its operations. This is useful for finding bottlenecks:
- If the Receiver spends most of the time waiting for input, it means the Replication Agent can’t read or send fast enough.
- If the Receiver spent most of the time on storage processing, it means the Cloud Agent can’t receive or write fast enough.
Replication agent → Receiver Bandwidth optimization metric measures how much network traffic was saved through compression and WAN optimization
Processing speed measures the speed of individual Receiver and Replication Agent operations
Receiver threads measure how many threads in the Receiver are active

Get the tooltip by hovering on the chart.

How to view logs#

There are two ways how to access logs. The credentials are the same: user: grafana_user; password: LxY83mXr.

1. Go to https://<ACURA_IP_ADDRESS>/tools/logs.

2. Go to https://<ACURA_IP>/tools/metrics. Navigate to Drilldown -> Logs.

On the service tab of the main page, find logs grouped by service_name. Use this view to observe the general situation on the Acura cluster.

For detailed information about a specific service, use the Labels field to set filters according to your needs.

Labels field

Use Log levels, Fields to filter data more properly.

Additional filters

Expand any log line to see all available metadata and access quick filter controls. Explore the different tabs to get complete information.

Tips:

1. Use time and refresh controls in the upper right corner to set log interval and auto-refresh.

2. Use “Labels” input to filter by labels (service_name, agent_id, etc). Remove all labels to return back to log overview.

3. Use “Fields” input to filter by structured metadata (all fields that aren’t labels: customer_id, image_name, etc).

4. Use “Search in log lines” for search. Spaces and special characters are fine – unlike Kibana, Grafana runs a simple substring search, similar to grep.

If you want to search with AND condition (apple AND banana), type apple, hit enter, then type banana and hit enter again – you’ll add two line filters.
If you want to search with OR condition (apple OR banana), use .* button to enable regex search and then write the usual regex syntax: apple|banana.

5. The right bar near log lines is very useful: you can hide/show timestamps, automatically expand multi-line logs and download displayed log lines.

6. If you want to see all Acura logs in one place, you can use the catch-all product label, which is always dr/mgr.

7. Group by labels you need by pressing “Add label” (on the right of the service tab).

8. To add a filter:

Click Include to add multiple label values.
Click Show logs to display logs for the selected labels.