背景介绍

因维护需要对Redis集群A的某个Slave节点进行重启维护,集群A大约四百多个节点,完成重启后,业务陆续反馈集群A部分读写失败,集群B数据丢失不可访问,A,B两个Cluster集群分别有300多个节点,两个超大Redis集群同时出现故障。

处理过程

两个Redis集群分别存储着几TB的数据,并且业务重度依赖于Redis服务,当Redis集群故障时业务同时受损,处理过程如下,因为是对A集群进行维护操作,当A业务反馈业务受损时并未意识到B集群也出现故障,着眼于A集群当前的故障分析,集群A部分Slot槽位未被分配,使用Redis官方提供的redis-trib工具fix失败,随后针对A集群未分配的slot进行了cluster delslot 和cluster setslot操作,操作完成后业务A反应服务并未恢复,同时收到了业务B反馈集群访问无数据,部分读写失败。综合分析后发现,集群A与集群B的元信息混乱,A集群与B集群“合并”,执行cluster info发现集群有700多个节点,总结点数等于A集群节点数+B集群节点数,基于现状判断此现状制定了如下处理方案:

  • 尝试修复A集群,同时做好重建A集群的准备
  • 重建B集群,业务配合重新初始化B集群数据

在实际执行过程中,当A集群的备用集群新建完成后并未及时切换到新集群,低估了Redis超大集群元信息修复的难度和复杂度,导致故障时间的延长,以下是修复A集群所做的工作:

  • 删除掉A集群中没有分配slot的master节点
  • 删除掉A集群中之前不属于A集群的300多个Slave节点
  • 分析对比A集群中node节点信息不一致的Slot及node节点
  • 手动fix元信息不一致的node节点失败
  • 重建A集群

原理分析

本次故障的根本原因是两个集群使用相同的实例,导致两个集群的拓扑信息互相交换拓扑信息乱掉,
CLUSTER MEET命令被用来连接不同的开启集群支持的Redis节点,以进入工作集群,基本的思想是每个节点默认都是相互不信任的,并且被认为是未知的节点,以便万一因为系统管理错误或地址被修改,而不太可能将多个不同的集群节点混成一个集群。因此,为了使给定的节点能将另一个节点接收到组成 Redis Cluster 的节点列表中,这里只有两种方法:

  • 系统管理员发送一个CLUSTER MEET命令强制一个节点去会面另一个节点
  • 一个已知的节点发送一个保存在 gossip 部分的节点列表,包含着未知的节点。如果接收的节点已经将发送节点信任为已知节点,它会处理 gossip 部分并且发送一个握手消息给未知的节点

在本次的案例中,就是触发了第二条规则,Redis Cluster中 需要形成一个完整的网络(每个节点都连接着其他每个节点),但是为了创建一个集群,不需要发送形成网络所需的所有CLUSTER MEET命令。发送CLUSTER MEET消息以便每个节点能够到达其他每个节点只需通过一条已知的节点链就足够了。由于在心跳包中会交换 gossip 信息,将会利用gossoip信息中包含的信息来创建节点间缺失的链接。所以,如果我们通过CLUSTER MEET链接节点 A 和节点 B ,并且节点 B 和 C 有链接,那么节点 A 和节点 C 会发现他们握手和创建链接的方法。
另一个例子:如果我们想象一个由四个分别叫 A,B,C,和D 的节点组成,我们可能只发送以下一组命令给节点 A :

1
2
3
CLUSTER MEET B-ip B-port
CLUSTER MEET C-ip C-port
CLUSTER MEET D-ip D-port

作为A知道广播心跳信息的副作用,它将会在发送的心跳包中包含gossip部分,这将允许其他每个节点彼此都创建一个链接,即使集群很大,也能在数秒钟之内形成一个完整的网络。而且CLUSTER MEET不必相互执行,如果发送命令给A以加入B ,那么就不必也发送给B以加入A。并且集群之间交换gossip信息时会使用meet包和gossip信息包,两种信息包的格式完全一样,唯一的区别在于,MEET包强制使接收消息包的节点确认发送消息包的节点为可信任的。

故障模拟追踪

  • 集群A初始状态
1
2
3
4
5
6
7
8
a19aba0786f82818e94d101de5920afefe82b7b2 192.168.17.136:6380 slave fb0e649e5708cf48cd7aa6095f317e46c1421337 0 1540210772603 3 connected
22150a5ae29b0a502cec1453ee5247df9e04e7e8 192.168.17.136:6379 myself,master - 0 0 2 connected 5461-10922
ba1d2b004dbc0a9d66c915a58a8a1214ff862d26 192.168.17.171:6381 slave 22150a5ae29b0a502cec1453ee5247df9e04e7e8 0 1540210771601 2 connected
1fba1402f46edd3fa5d7433261ace5c857c12ce6 192.168.93.82:6380 slave c5d1bae337c49a765fd61c388ba3910c9e34022e 0 1540210766594 1 connected
fb0e649e5708cf48cd7aa6095f317e46c1421337 192.168.17.171:6379 master - 0 1540210768596 3 connected 10923-16383
3d7eaeb39afd9b95e20da53f319fa12863ce5ea2 192.168.17.136:6381 slave 22150a5ae29b0a502cec1453ee5247df9e04e7e8 0 1540210769599 2 connected
ffcbd7d9a110ef6770ff6187438f745846871b4f 192.168.17.171:6380 slave 22150a5ae29b0a502cec1453ee5247df9e04e7e8 0 1540210767095 2 connected
c5d1bae337c49a765fd61c388ba3910c9e34022e 192.168.93.82:6379 master - 0 1540210770601 1 connected 0-5460
  • 集群A关闭一个Slave节点

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    //集群A关闭掉一个Slave节点
    c5d1bae337c49a765fd61c388ba3910c9e34022e 192.168.93.82:6379 master - 0 1540210890437 1 connected 0-5460
    3d7eaeb39afd9b95e20da53f319fa12863ce5ea2 192.168.17.136:6381 slave 22150a5ae29b0a502cec1453ee5247df9e04e7e8 0 1540210888435 2 connected
    fb0e649e5708cf48cd7aa6095f317e46c1421337 192.168.17.171:6379 myself,master - 0 0 3 connected 10923-16383
    1fba1402f46edd3fa5d7433261ace5c857c12ce6 192.168.93.82:6380 slave c5d1bae337c49a765fd61c388ba3910c9e34022e 0 1540210889437 1 connected
    ba1d2b004dbc0a9d66c915a58a8a1214ff862d26 192.168.17.171:6381 slave,fail 22150a5ae29b0a502cec1453ee5247df9e04e7e8 1540210828604 1540210827302 2 disconnected
    ffcbd7d9a110ef6770ff6187438f745846871b4f 192.168.17.171:6380 slave 22150a5ae29b0a502cec1453ee5247df9e04e7e8 0 1540210889437 2 connected
    22150a5ae29b0a502cec1453ee5247df9e04e7e8 192.168.17.136:6379 master - 0 1540210886931 2 connected 5461-10922
    a19aba0786f82818e94d101de5920afefe82b7b2 192.168.17.136:6380 slave
    fb0e649e5708cf48cd7aa6095f317e46c1421337 0 1540210887432 3 connected

    //部分日志
    29065:S 22 Oct 12:21:52.989 . --- Processing packet of type 1, 2520 bytes
    29065:S 22 Oct 12:21:52.989 . pong packet received: 0x7fdc75c68800
    29065:S 22 Oct 12:21:52.989 . GOSSIP a19aba0786f82818e94d101de5920afefe82b7b2 192.168.17.136:6380 slave
    29065:S 22 Oct 12:21:52.989 . GOSSIP fb0e649e5708cf48cd7aa6095f317e46c1421337 192.168.17.171:6379 master
    29065:S 22 Oct 12:21:52.989 . GOSSIP 3d7eaeb39afd9b95e20da53f319fa12863ce5ea2 192.168.17.136:6381 slave
    29065:S 22 Oct 12:21:53.089 . Connecting with Node ba1d2b004dbc0a9d66c915a58a8a1214ff862d26 at 192.168.17.171:16381
    29065:S 22 Oct 12:21:53.089 . I/O error reading from node link: Connection refused
  • 新建B集群(将192.168.17.171的node.conf 清理后加入B集群,如存在的node.conf 则不能加入B集群)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//集群B拓扑
e71583731ec437dc82b862f613c7415fbdc696b7 192.168.17.171:6382 master - 0 1540211963366 1 connected 0-5460
ec8beb0b2bee99d604fb51772c04482eab830555 192.168.93.82:6382 master - 0 1540211965371 5 connected 10923-16383
807b014766a88a8415d28693d102af6e592f43ab 192.168.17.171:6383 slave a68eda5e3e7a38233320148cccad4068b858af8f 0 1540211962363 3 connected
a68eda5e3e7a38233320148cccad4068b858af8f 192.168.17.136:6382 myself,master - 0 0 3 connected 5461-10922
14151d57aa7f22bb63ef72882b4a28fa9805620e 192.168.93.82:6383 slave ec8beb0b2bee99d604fb51772c04482eab830555 0 1540211964370 6 connected
b0f1c2e636eb0e73d26f3dd26d046959dc188bb4 192.168.17.136:6383 slave e71583731ec437dc82b862f613c7415fbdc696b7 0 1540211961361 4 connected
//集群A拓扑
集群A
a19aba0786f82818e94d101de5920afefe82b7b2 192.168.17.136:6380 slave fb0e649e5708cf48cd7aa6095f317e46c1421337 0 1540212153724 3 connected
22150a5ae29b0a502cec1453ee5247df9e04e7e8 192.168.17.136:6379 myself,master - 0 0 2 connected 5461-10922
ba1d2b004dbc0a9d66c915a58a8a1214ff862d26 :0 slave,fail,noaddr 22150a5ae29b0a502cec1453ee5247df9e04e7e8 1540210828103 1540210827703 2 disconnected
1fba1402f46edd3fa5d7433261ace5c857c12ce6 192.168.93.82:6380 slave c5d1bae337c49a765fd61c388ba3910c9e34022e 0 1540212152223 1 connected
fb0e649e5708cf48cd7aa6095f317e46c1421337 192.168.17.171:6379 master - 0 1540212154726 3 connected 10923-16383
3d7eaeb39afd9b95e20da53f319fa12863ce5ea2 192.168.17.136:6381 slave 22150a5ae29b0a502cec1453ee5247df9e04e7e8 0 1540212155729 2 connected
ffcbd7d9a110ef6770ff6187438f745846871b4f 192.168.17.171:6380 slave 22150a5ae29b0a502cec1453ee5247df9e04e7e8 0 1540212156730 2 connected
c5d1bae337c49a765fd61c388ba3910c9e34022e 192.168.93.82:6379 master - 0 1540212152725 1 connected 0-5460

此时两个集群均正常工作,重启A集群任意节点不会发生meet合并操作,可手动meet两个集群

将B集群的某节点信息加入到A集群某节点的node.conf 中,node name保持一致,重启A集群节点,此时B集群meet到A集群,日志如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
32016:S 02 Nov 19:07:29.988 # Configuration change detected. Reconfiguring myself as a replica of 44001fb54734a7fbd7d6941e841fc428d3fe9a9f
32016:S 02 Nov 19:07:29.988 # Connection with master lost.
32016:S 02 Nov 19:07:29.988 * Caching the disconnected master state.
32016:S 02 Nov 19:07:29.988 * Discarding previously cached master state.
32016:S 02 Nov 19:07:30.920 * Connecting to MASTER 192.168.17.171:6379
32016:S 02 Nov 19:07:30.920 * MASTER <-> SLAVE sync started
32016:S 02 Nov 19:07:30.920 * Non blocking connect for SYNC fired the event.
32016:S 02 Nov 19:07:30.920 * Master replied to PING, replication can continue...
32016:S 02 Nov 19:07:30.921 * Partial resynchronization not possible (no cached master)
32016:S 02 Nov 19:07:30.922 * Full resync from master: 500dc623866f255ccb6d9ce2947ea19547a60499:757
32016:S 02 Nov 19:07:30.944 * MASTER <-> SLAVE sync: receiving 18 bytes from master
32016:S 02 Nov 19:07:30.945 * MASTER <-> SLAVE sync: Flushing old data
32016:S 02 Nov 19:07:30.945 * MASTER <-> SLAVE sync: Loading DB in memory
32016:S 02 Nov 19:07:30.945 * MASTER <-> SLAVE sync: Finished with success
32016:S 02 Nov 19:07:31.626 # Cluster state changed: fail
32016:S 02 Nov 19:07:34.500 # Cluster state changed: ok

关键代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* If this is a MEET packet from an unknown node, we still process
* the gossip section here since we have to trust the sender because
* of the message type. */
if (!sender && type == CLUSTERMSG_TYPE_MEET)
clusterProcessGossipSection(hdr,link);
...
//发送者是个未知节点并且是meet消息
if (!sender && type == CLUSTERMSG_TYPE_MEET){
//将消息gossip信息中的节点更新到自己nodes字典中
clusterProcessGossipSection(hdr,link); clusterSendPing(link,CLUSTERMSG_TYPE_PONG)
...
if (node->link == NULL) {
int fd;
mstime_t old_ping_sent;
clusterLink *link;

fd = anetTcpNonBlockBindConnect(server.neterr, node->ip,
node->port+REDIS_CLUSTER_PORT_INCR, REDIS_BIND_ADDR);
...
old_ping_sent = node->ping_sent;
clusterSendPing(link, node->flags & REDIS_NODE_MEET ?
CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);