Keepalived详解:原理、编译安装与高可用集群配置(部分实现,待脚本跟新)
PG流复制集群-CLup一键创建一、简介和工作原理Keepalived 是一种利用 VRRP 协议实现服务器集群高可用的软件通过实时监测节点状态并在故障发生时自动进行流量切换确保服务始终稳定运行避免因单点故障导致业务中断。1. VRRP协议想象一个场景有两台路由器像“双胞胎值班员”一样提供相同服务一台是“主值班员”另一台是“候补值班员”。它们共用一个“虚拟电话号码”VIP客户只知道拨打这个号码。正常工作时“主值班员”负责接听电话持有VIP同时每隔一段时间就通过内部广播地址224.0.0.18向“候补值班员”喊一句“我还活着”。故障发生时如果“候补值班员”长时间没听到“主值班员”的喊声就会认为对方“掉线了”。自动切换此时“候补值班员”会根据预设的优先级顺序自动晋升为“新主值班员”立刻接管虚拟电话号码让客户的所有呼叫无缝转移到自己身上。整个过程中客户完全不知情服务就像从未中断过一样这就是Keepalived通过VRRP协议实现的高可用“隐形交接术”。2. Keepalived的三大模块Keepalived主要由三个模块构成Core模块作为核心负责主进程的启动、维护及全局配置文件的加载和解析。Check模块负责健康检查支持对负载均衡器后端的真实服务器进行TCP检查、HTTP_GET检查等确保只有健康的服务器才接收流量。VRRP模块这是实现VRRP协议的功能模块负责处理主备节点间的通信和状态切换。二、安装Keepalived1、前期准备安装前我们需要创建数据库Keepalived并且创建表探测表sr_delay后续Keepalived探测刷新sr_delay表的last_alive字段为当前探测时间。这张表用来判断主备延迟情况数据库故障切换时会查询用到这张表。create user keepalived password keepalived CONNECTION LIMIT 4 ; create database keepalived owner keepalived; \c keepalived keepalived create table sr_delay(id int4, last_alive timestamp(0) without time zone); #表sr_delay只允许写入一条记录并且不允许删除此表数据通过触发器实现。创建触发器函数如下所示 CREATE FUNCTION cannt_delete () RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN RAISE EXCEPTION Table sr_delay can not delete !; END; $$; #创建触发器 CREATE TRIGGER trigger_sr_delay_del BEFORE DELETE ON sr_delay FOR EACH ROW EXECUTE PROCEDURE cannt_delete() ; CREATE TRIGGER trigger_sr_delay_tru BEFORE TRUNCATE ON sr_delay FOR STATEMENT EXECUTE PROCEDURE cannt_delete() ; #插入数据 INSERT INTO sr_delay VALUES(1,now()) ;后续Keepalived会每隔指定时间探测PostgreSQL数据库存活并且以Keepalived用户登录Keepalived数据库刷新这张表 配置主备库pg_hba.conf增加如下内容echo host keepalived keepalived 10.197.167.121/20 md5 host keepalived keepalived 10.197.167.122/20 md5 pg_hba.confpg_hba.conf配置文件一般在PGDATA里面2、安装keepalived方法一yum安装快速上手yum install keepalived -y会在/etc/keepalived/生成一个keepalived.conf配置文件安装后可以使用systemctl命令来管理服务现在还没编写配置文件启动不起来systemctl start keepalived.service # 启动 systemctl enable keepalived.service # 设置开机自启 systemctl status keepalived.service # 查看状态方法二源码编译安装安装依赖包yum install -y gcc openssl-devel libnl3-devel libnfnetlink-devel net-snmp-devel curl make下载安装编译# 进入常用源码目录下载请替换为最新稳定版链接 cd /usr/local/src/ curl -O http://keepalived.org/software/keepalived-2.2.4.tar.gz # 或使用 wget https://www.keepalived.org/software/keepalived-2.2.4.tar.gz # 解压并进入目录 tar xvf keepalived-2.2.4.tar.gz cd keepalived-2.2.4 # 配置、编译并安装 ./configure --prefix/usr/local/keepalived make make install配置系统服务# 将启动脚本复制到系统目录 cp /usr/local/src/keepalived-2.2.4/keepalived/keepalived.service /usr/lib/systemd/system/ # 重新加载systemd配置 systemctl daemon-reload三、配置Keepalived实现PG主备高可用主节点Master物理IP为192.168.18.11备节点Backup物理IP为192.168.18.12虚拟IPVIP192.168.18.100主节点配置keepalived配置文件/etc/keepalived/keepalived.conf! Configuration File for keepalived global_defs { router_id pg_xc1 #主备机必须不一样 script_user postgres #如果是涉及到pg数据库的相关操作则需要选择pg用户 enable_script_security } vrrp_script check_pg_alived { script /etc/keepalived/check_pg.sh #检查pg数据库状态的脚本主要靠这个判断是否进行主备切换 interval 5 weight 30 #很重要判断主备切换的权重 fall 3 # require 3 failures for KO } vrrp_instance VI_1 { state BACKUP #BACKUP 必须大写 nopreempt #no preempt为非抢占模式 interface ens10 #vip使用的网络接口 virtual_router_id 10 #主备必须一样 priority 100 #设置节点的优先级跟weight值组合使用决定选择哪个为主 advert_int 1 authentication { auth_type PASS auth_pass keepalived } track_script { check_pg_alived } virtual_ipaddress { 192.168.18.100 #vip } notify_master /etc/keepalived/failover.sh #当节点成为master时执行的脚本 notify_fault /etc/keepalived/fault.sh #节点出现故障执行的脚本 } ~其中interface的值通过ip a可以查看网络接口是什么脚本编写1、脚本要放在/etc/keepalived目录下否则无法切换可能是基于安全的原因。2、脚本的权限很重要权限太多执行不成功一般为755check_pg.sh配置文件/etc/keepalived/check_pg.sh#!/bin/bash export PGDATABASEkeepalived export PGPORT5432 export PGUSERkeepalived export PGHOME/usr/pgsql-16 export PATH$PGHOME/bin:$PATH:$HOME/bin export PGMIP192.168.18.11 LOGFILE/etc/keepalived/log/pg_status.log SQL1SELECT pg_is_in_recovery from pg_is_in_recovery(); SQL2update sr_delay set last_alive now() where id 1; SQL3SELECT 1; db_roleecho $SQL1 | $PGHOME/bin/psql -h $PGMIP -p 5432 -d keepalived -U keepalived -At -w if [ $db_role t ]; then echo -e date %F %T Attention1:the current database is standby DB! $LOGFILE exit 0 fi # 备库不检查存活,主库更新状态 echo $SQL3 | $PGHOME/bin/psql -h $PGMIP -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w if [ $? -eq 0 ] ; then echo $SQL2 | $PGHOME/bin/psql -h $PGMIP -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w echo -e date %F %T Success: update the master sr_delay successed! $LOGFILE exit 0 else echo -e date %F %T Error:Is the server is running? $LOGFILE exit 1 fi切换脚本failover.sh主库宕机后keepalived调用执行切换脚本/etc/keepalived/failover.sh#!/bin/bash export PGPORT5432 export PGUSERkeepalived export PG_OS_USERkeepalived export PGDATA/data/pgdata export PGDBNAMEkeepalived export LANGzh_CN.UTF-8 export PGPATH/usr/pgsql-16/bin export PATH$PATH:$PGPATH export PGMIP192.168.18.11 LOGFILE/etc/keepalived/log/failover.log # master-to-slave delay sr_allowed_delay_time10 SQL1select pg_is_in_recovery from pg_is_in_recovery(); SQL2select last_alive as delay_time from sr_delay where now()-last_alive interval 60; db_roleecho $SQL1 | psql -At -h $PGMIP -p $PGPORT -U $PGUSER -d $PGDBNAME -w db_sr_delaytimeecho $SQL2 | psql -h $PGMIP -p $PGPORT -d $PGDBNAME -U $PGUSER -At -w SWITCH_COMMANDpg_ctl promote -D $PGDATA # slave switchover to master if delay large than specical second if [ $db_role f ];then echo -e date %F %T Attention:The current postgreSQL DB is master database,cannot switched! $LOGFILE exit 0 fi if [ $db_sr_delaytime -gt 0 ];then echo -e date %F %T Attention:The current master database is health,the standby DB cannot switched! $LOGFILE exit 0 fi echo $db_sr_delaytime if [ -z $db_sr_delaytime ];then echo -e date %F %T Attention:The current database is statndby,ready to switch master database! $LOGFILE pg_ctl promote -D $PGDATA sed -i s/primary_conninfo/#primary_conninfo/ $PGDATA/postgresql.auto.conf pg_ctl restart -D $PGDATA elif [ $? eq 0 ];then echo -e date %F %T success:The current standby database successed to switched the primary PG database ! $LOGFILE exit 0 else echo -e date %F %T Error: the standby database failed to switch the primary PG database !,pelease checked it! $LOGFILE exit 1 fi脚本fault.shkeepalived 进入错误状态时执行的脚本。#!/bin/bash LOGGFILE/etc/keepalived/log/pg_db_fault.log PGPORT5432 PGMIP192.168.18.11 echo -e date %F %T Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database! $LOGFILE export PGPIDnetstat -anp|grep $PGPORT |awk {printf $7}|cut -d/ -f1 service keepalived stop kill -9 $PGPID if [ $? eq 0 ];then echo -e date %F %T Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database! $LOGFILE service keepalived stop exit 1 fi备节点配置备库配置1、把master的keepalived.conf文件和sh脚本文件复制到备机。2、修改keepalived.conf的router_id为pg_xc2priority为90即可。四、keepakived启动及状态查看1、MASTER启动keepalived 服务器# service keepalived start2、MASTER查看keepalived 状态# systemctl status keepalived[rootclup-host-2422 keepalived]# systemctl status keepalived.service ● keepalived.service - LVS and VRRP High Availability Monitor Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2026-05-26 14:47:43 CST; 22min ago Process: 64028 ExecStart/usr/sbin/keepalived $KEEPALIVED_OPTIONS (codeexited, status0/SUCCESS) Main PID: 64029 (keepalived) Tasks: 2 (limit: 11145) Memory: 1.6M CGroup: /system.slice/keepalived.service ├─64029 /usr/sbin/keepalived -D └─64030 /usr/sbin/keepalived -D3、查看后台日志# tail -f /var/log/messages4、查看虚拟ip(ifconfig看不到虚拟IP)# ip a[postgresclup-host-2423 pgdata]$ ip a 1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens10: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:c5:a7:7a brd ff:ff:ff:ff:ff:ff altname enp0s10 inet 192.168.18.11/20 brd 10.197.175.255 scope global dynamic noprefixroute ens10 valid_lft 3333sec preferred_lft 3333sec inet 192.168.18.100/32 scope global ens10 valid_lft forever preferred_lft forever五、主备切换进入主库把主库停了pg_ctl stop -D $PGDATA查看keepalived日志tail -f /var/log/messagesMay 26 15:28:49 clup-host-2422 Keepalived[71251]: Starting VRRP child process, pid71252 May 26 15:28:49 clup-host-2422 systemd[1]: Started LVS and VRRP High Availability Monitor. May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Registering Kernel netlink reflector May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Registering Kernel netlink command channel May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Opening file /etc/keepalived/keepalived.conf. May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: (/etc/keepalived/keepalived.conf: Line 22) Truncating auth_pass to 8 characters May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Assigned address 192.168.18.11 for interface ens10 May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Registering gratuitous ARP shared channel May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: (VI_1) removing VIPs. May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: (VI_1) Entering BACKUP STATE (init) May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: VRRP sockpool: [ifindex( 2), family(IPv4), proto(112), fd(11,12)] May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: VRRP_Script(check_pg_alived) succeeded May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: (VI_1) Changing effective priority from 70 to 100 May 26 15:29:44 clup-host-2422 su[71403]: (to postgres) root on pts/1 May 26 15:29:59 clup-host-2422 Keepalived_vrrp[71252]: Script check_pg_alived now returning 1 May 26 15:30:05 clup-host-2422 systemd[1]: Starting system activity accounting tool... May 26 15:30:05 clup-host-2422 systemd[1]: sysstat-collect.service: Succeeded. May 26 15:30:05 clup-host-2422 systemd[1]: Started system activity accounting tool. May 26 15:30:09 clup-host-2422 Keepalived_vrrp[71252]: VRRP_Script(check_pg_alived) failed (exited with status 1) May 26 15:30:09 clup-host-2422 Keepalived_vrrp[71252]: (VI_1) Changing effective priority from 100 to 70VRRP_Script(check_pg_alived) failed (exited with status 1)可以看到检测到pg未存活然后优先权-权重变低小于备库的优先权VIP漂移到备库然后我们在主备库查看一下vip在哪# 主库 [postgresclup-host-2422 ~]$ ip a 1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens10: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:c5:a7:79 brd ff:ff:ff:ff:ff:ff altname enp0s10 inet 192.168.18.11/20 brd 10.197.175.255 scope global dynamic noprefixroute ens10 valid_lft 1776sec preferred_lft 1776sec # 备库 [rootclup-host-2423 keepalived]# ip a 1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens10: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:c5:a7:7a brd ff:ff:ff:ff:ff:ff altname enp0s10 inet 192.168.18.12/20 brd 10.197.175.255 scope global dynamic noprefixroute ens10 valid_lft 1939sec preferred_lft 1939sec inet 192.168.18.100/32 scope global ens10 valid_lft forever preferred_lft forever可以看到此时第三方主机通过vip连接到的是新的主库数据库(说明vip和备库切换到主库成功)备库切换到主库需要额外操作完成所有切换(或者在切换脚本实现)可以在pg配置文件修改实现也可以在切换脚本中直接实现切换更新中ing....问题问题一、因为我设置的初始优先权是100权重是30在我关闭keepalived的时候给我加上权重算为130的优先权10020【(VI_1) Changing effective priority from 100 to 130】启动后看到是130-30变成100的优先权还是比备库高当时我的备库的优先权是90导致漂移失败了后面把优先权变成70就成功了问题二、执行脚本出错这个脚本使用psql -At -h localhost -p $PGPORT -U $PGUSER -d $PGDBNAME -w因为这个-w禁止密码提醒在设.pgpass的时候放在的目录是postgres家目录下启动keepalived的时候没有成功用到.pgpass发现是要用到root下所以把文件复制到/root.pgpass文件localhost:5432:keepalived:keepalived:keepalived 192.168.18.11:5432:keepalived:keepalived:keepalived 192.168.18.12:5432:keepalived:keepalived:keepalivedKeepalived主备竞选规则Keepalived的主备角色是可以改变的其改变策略是1、如果weight值设置为正整数当主库的vrrp_script脚本执行成功其权值为priorityweight而备库的权值为priorityweight的值。那么此时主库的和大于备库的和。2、如果weight值设置为正整数当主库的vrrp_script脚本执行失败其权值为priority设置的值;而备库的权值为priorityweight如果其和大于主库的priority则备库会变成主库。注意weight设置的值必须要大于主备priority之间的差。比如主库priority设为100备库priority设为80weight则要设置大于100-8020的差。通过观察/var/log/messages可以看到权值的变化。后续更新自动切换脚本CLup简介https://www.csudata.com/clup/manual