Rejoining Old Primary Node as Secondary in SQL Server Always On Availability Group Post Forced Failover

cbeh67ev  于 2024-01-05  发布在  SQL Server
关注(0)|答案(2)|浏览(200)

Encountering synchronization issues after a forced failover in a three-node SQL Server Always On Availability Group setup.

I am facing a challenge in a SQL Server Always On Availability Group setup with a three-node configuration. Initially, node-1 was the primary node, and nodes-2 and -3 were secondary nodes.

A network failure occurred, causing node-1 to disconnect from the other two nodes. To address this, I performed a forced failover to node-2 using the command ALTER AVAILABILITY GROUP AG_NAME FORCE_FAILOVER_ALLOW_DATA_LOSS. Consequently, node-2 became the new primary node, and write operations continued.

Meanwhile, node-1, unaware of the forced failover, also continued its write operations independently.

Upon recovery from the network failure, attempting to rejoin node-1 to the availability group as a secondary node using the standard commands (ALTER AVAILABILITY GROUP [AG_NAME] SET (ROLE = SECONDARY); and ALTER DATABASE [AG_DB_NAME] SET HADR RESUME;) did not resolve the synchronization issue as outlined in Microsoft's documentation.

While it is acceptable to lose data on the old primary node, I am seeking a solution to rejoin it to the new primary without resorting to a fresh node approach, which involves dropping the entire database. Considering the substantial data size (5-6TB), such an approach is not feasible.

Any recommendations or solutions to address this issue would be highly appreciated. Thank you.

oyt4ldly

oyt4ldly1#

Since node 1 continued to take writes after node 2 was forced to be the primary instance, you very likely have a forked history scenario. The last thing I'd try before doing a full restore from backups is: can you restore differential/t-log backups (all specifying norecovery !) from node 2 onto node 1? If yes, do that until you've restored enough t-log such that node 1 is caught up to node 2. Once that's the case, you should be able to join that database into the AG.

But I suspect that you will not be able to do that. Trying to restore that differential backup will tell you pretty quickly whether you can or not. If not, your only option is to restore the entire database from backups onto node 1.

Lastly, regardless of the path you take, I'd do alter availability group [«your AG»] remove replica [node1]; while you're doing the restores. Why? As far as node2 is concerned, node1 could come back at any time and is holding t-log for that eventuality. You know better and that it's going to take a long time for that to happen. It's better to remove the replica now so that your active t-log doesn't exhaust your free disk space and add it back in as you're closer to being able to do so (i.e. you've done all the t-log restores you need to do).

nnsrf1az

nnsrf1az2#

Considering the substantial data size (5-6TB), such an approach is not feasible.

That’s the only way.

If you can't restore a database from a backup you can't operate. So if this is really "not feasible" you don't have a supportable, production-ready solution.

Operating a 6TB database requires infrastructure that can not only handle the normal day-to-day transaction processing, but also can handle these kinds of one-time expensive operations.

相关问题