I have observed the similar situation couple of times in the past one year. Everything is normal from event viewer, server component state, service status, database copy health but the database activation will fail every time when I try to activate it from one of the nodes in 4 node DAG cluster. The databases are easy to do all possible operations in other 3 nodes, but one node. The issue occurs after a reboot, but will not resolve even if we wait for so longer or do another round of reboot.
Errors received are listed below,
An Active Manager operation failed. Error: The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error:
A server-side administrative operation has failed. The Microsoft Exchange Replication service may not be running on server server.domain.local. Specific RPC error message: Error 0x6ba (The RPC server is unavailable) from cli_RpcsGetCopyStatusWithHealthState [Server: server.domain.local]
[Database: DB01, Server: PAM Server]
Failed to mount database " DB01". Error: An Active Manager operation failed. Error: The database action failed. Error: The Microsoft Exchange Replication service may not be running on server server.domain.local. Specific RPC error message: Error 0x6ba (The RPC server is unavailable) from cli_AmMountDatabaseDirect3 [Database: DB01, Server: PAM Server]
The error is very common, and you will be flooded with multiple resolutions however all will have longer procedure such as (not limited to),
- Verify the health of AD
- Look at the event log for AD replication errors
- Check the integrity of database
- Verify the cluster quorum status
- And so on
Solution Worked for me (both occasions):
Try to move the PAM to different healthy node available in the DAG (not to the activation failed server), and try to repeat the operations. When you only have 2 node DAG, you may have no choice other than trying other possible solutions as we mentioned earlier this post. However, it is worth checking if you have more than one IP assigned to the DAG members in the production network range.
Move-ClusterGroup "Cluster Group" -Node NODENAME
Verify the PAM role has moved post the above command,
Get-DatabaseAvailabilityGroup -Status -Identity DAGNAME | fl name,primaryActiveManager
Once you confirm that the PAM role is currently hold by another healthy node, try to mount/activate database on server where we observed issues. It should just work fine!
Additional Info: I found this when we use more than one IP Address assigned a single network card on Exchange for relay/HA purposes. If we reassign the IPs to other nodes of the cluster, a communication gap between PAM and the node can occur. But, there are no harm in moving the PAM before you take major steps to make the cluster communication restored.