Galera Awareness and ProxySQL Scheduler

  • Date:
  • Tags: galera cluster

One of the most common requests for ProxySQL is to support Galera Cluster.

Galera cluseter in the wild

The Issue

Support for Galera requires a lot of features to be added to ProxySQL: checking if a node is in sync state, if it is a garbd process, auto-detect of new nodes, send writes to only one node, remote DC support, and more. Furthermore, as Galera is a clustering solution and not all setups have the same requirements, to support all the possible configurations seemed to be too complex a task.

To top it off, support for Galera in all its possible configurations wasn't the only complex request that needed to be implemented in the core of ProxySQL. It also had to handle additional requests not related to Galera, like changing the weight on a node while overloaded (as measured by different metrics! CPU usage, disk IO, number of threads) or removing a node from a pool while a backup was running.

I always describe ProxySQL as easily configurable: anyone can write a script that configures it at runtime through the Admin interface. The script can have any complex logic or algorithm, and then configures ProxySQL accordingly. In short, I always advised that ProxySQL should be a core component, giving the possiblity for tools to be built around it.

Enter Scheduler

But a part was still missing: the glue to link ProxySQL and the external script(s). For this reason, ProxySQL recently supported a new feature: jobs Scheduler.

Scheduler is a cron-like solution embedded directly inside ProxySQL. For more details, please have a look at the documentation.

It can run any sort of external executable. Therefore we can write a script that monitors all the backends of a specific hostgroup (a Galera cluster), remove hosts that are not in sync and re-add them once in sync. A ready to use script is distributed with ProxySQL itself: proxysql_galera_checker.sh.

This script is inspired by Percona clustercheck, and accepts 3 arguments mandatory and one optional:

$ ./proxysql_galera_checker.sh
Usage: ./proxysql_galera_checker.sh <hostname> <port> <hostgroup_id> <log_file>

How it works

When running the script we need to pass ProxySQL's hostname and the port where the Admin interface is running. The script will connect (using hardcoded credential) to the Admin interface, it will retrieve credentials to connect to the backends and a list of backends for a specific hostgroup, it will check the value of wsrep_local_state on all these backends, and if they change status will automatically reconfigure ProxySQL.

As an example, we have a Galera cluster with 3 nodes:

Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------+
| hostgroup_id | hostname    | port | status |
+--------------+-------------+------+--------+
| 0            | 10.10.10.11 | 3306 | ONLINE |
| 0            | 10.10.10.12 | 3306 | ONLINE |
| 0            | 10.10.10.13 | 3306 | ONLINE |
+--------------+-------------+------+--------+
3 rows in set (0.00 sec)

Next, we need to configure the scheduler to run the script at regular interval. In this example, we will run the job every 10000 milliseconds (10 seconds):

Admin> SHOW CREATE TABLE scheduler\G
*************************** 1. row ***************************
       table: scheduler
Create Table: CREATE TABLE scheduler (
id INTEGER NOT NULL,
interval_ms INTEGER CHECK (interval_ms>=100 AND interval_ms<=100000000) NOT NULL,
filename VARCHAR NOT NULL,
arg1 VARCHAR,
arg2 VARCHAR,
arg3 VARCHAR,
arg4 VARCHAR,
arg5 VARCHAR,
PRIMARY KEY(id))
1 row in set (0.00 sec)

Admin> INSERT INTO scheduler(id,interval_ms,filename,arg1,arg2,arg3,arg4) VALUES
(1,'10000','/var/lib/proxysql/proxysql_galera_checker.sh','127.0.0.1','6032','0',
'/var/lib/proxysql/proxysql_galera_checker.log');
Query OK, 1 row affected (0.00 sec)

Admin> select * from scheduler\G
*************************** 1. row ***************************
         id: 1
interval_ms: 10000
   filename: /var/lib/proxysql/proxysql_galera_checker.sh
       arg1: 127.0.0.1
       arg2: 6032
       arg3: 0
       arg4: /var/lib/proxysql/proxysql_galera_checker.log
       arg5: NULL
1 row in set (0.00 sec)

Admin> SELECT * FROM runtime_scheduler;
Empty set (0.00 sec)

At this stage the job is not loaded at runtime yet. So next we must run LOAD SCHEDULER TO RUNTIME:

Admin> LOAD SCHEDULER TO RUNTIME;
Query OK, 0 rows affected (0.00 sec)

Admin> select * from runtime_scheduler\G
*************************** 1. row ***************************
         id: 1
interval_ms: 10000
   filename: /var/lib/proxysql/proxysql_galera_checker.sh
       arg1: 127.0.0.1
       arg2: 6032
       arg3: 0
       arg4: /var/lib/proxysql/proxysql_galera_checker.log
       arg5: NULL
1 row in set (0.00 sec)

Bringing it to the test

The first test we will run is to shutdown a node:

vagrant@n3:~$ sudo service mysql stop
 * Stopping MySQL (Percona XtraDB Cluster) mysqld

From the error log of this check, we can see that the script detected that a node is not reachable, and set it to offline:

vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
Sat Jul 16 18:41:33 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:34 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:44 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:44 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:44 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:54 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:54 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:54 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:04 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:04 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:04 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:14 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:42:14 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state
Sat Jul 16 18:42:14 UTC 2016 Changing server 10.10.10.12:3306 to status OFFLINE_SOFT
Sat Jul 16 18:42:14 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:25 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:42:25 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:42:25 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4

Also, mysql_servers confirms the change:

Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------------+
| hostgroup_id | hostname    | port | status       |
+--------------+-------------+------+--------------+
| 0            | 10.10.10.11 | 3306 | ONLINE       |
| 0            | 10.10.10.12 | 3306 | OFFLINE_SOFT |
| 0            | 10.10.10.13 | 3306 | ONLINE       |
+--------------+-------------+------+--------------+
3 rows in set (0.00 sec)

To continue our tests, we will set a host in desync mode:

mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync  | OFF   |
+---------------+-------+
1 row in set (0.00 sec)

mysql> set global wsrep_desync=ON;
Query OK, 0 rows affected (0.05 sec)

mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync  | ON    |
+---------------+-------+
1 row in set (0.00 sec)

The script will notice it in this case as well:

vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:01 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:01 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:11 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:11 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:11 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:21 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:21 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:21 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:31 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 2
Sat Jul 16 18:48:31 UTC 2016 Changing server 10.10.10.11:3306 to status OFFLINE_SOFT
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:31 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:31 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:41 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:42 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:42 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4

And mysql_servers will confirm the change:

Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------------+
| hostgroup_id | hostname    | port | status       |
+--------------+-------------+------+--------------+
| 0            | 10.10.10.11 | 3306 | OFFLINE_SOFT |
| 0            | 10.10.10.12 | 3306 | OFFLINE_SOFT |
| 0            | 10.10.10.13 | 3306 | ONLINE       |
+--------------+-------------+------+--------------+
3 rows in set (0.00 sec)

To continue this proof of concept, we will now put the last node in sync:

mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync  | ON    |
+---------------+-------+
1 row in set (0.00 sec)

mysql> set global wsrep_desync=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync  | OFF   |
+---------------+-------+
1 row in set (0.00 sec)

The script will detect that the node is back ONLINE:

vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:49:53 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:49:53 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:03 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:03 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:03 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:13 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:13 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:13 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:23 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:24 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:24 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:34 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 4
Sat Jul 16 18:50:34 UTC 2016 Changing server 10.10.10.11:3306 to status ONLINE
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:34 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:34 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4

And correctly reconfigure mysql_servers:

Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------------+
| hostgroup_id | hostname    | port | status       |
+--------------+-------------+------+--------------+
| 0            | 10.10.10.11 | 3306 | ONLINE       |
| 0            | 10.10.10.12 | 3306 | OFFLINE_SOFT |
| 0            | 10.10.10.13 | 3306 | ONLINE       |
+--------------+-------------+------+--------------+
3 rows in set (0.01 sec)

Finally, we will restart the node we previously shut down:

vagrant@n3:~$ sudo service mysql start
 * Starting MySQL (Percona XtraDB Cluster) database server mysqld   [ OK ]

Once again, the script will detect that a node is now healthy:

vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
Sat Jul 16 18:51:15 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:25 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:51:25 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:51:25 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:35 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:51:35 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:51:35 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:45 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:51:45 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:51:45 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:55 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:55 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state 4
Sat Jul 16 18:51:55 UTC 2016 Changing server 10.10.10.12:3306 to status ONLINE
Sat Jul 16 18:51:55 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:52:06 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:52:06 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:52:06 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4

And correctly reconfigure mysql_servers:

Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------+
| hostgroup_id | hostname    | port | status |
+--------------+-------------+------+--------+
| 0            | 10.10.10.11 | 3306 | ONLINE |
| 0            | 10.10.10.12 | 3306 | ONLINE |
| 0            | 10.10.10.13 | 3306 | ONLINE |
+--------------+-------------+------+--------+
3 rows in set (0.00 sec)

Finally, do not forget to save the job to disk:

Admin> SAVE SCHEDULER TO DISK;
Query OK, 0 rows affected (0.01 sec)

Conclusion

While complete Support for Galera Cluster in ProxySQL is still not ready, this is a huge step forward.

The ability of ProxySQL to be easily reconfigurable at runtime, together with its Scheduler, will quickly extend ProxySQL's capabilities. This includes a better integration with Galera in complex setups