How to configure VM Monitoring in Windows Server 2012

How to configure VM Monitoring in Windows Server 2012

Windows Server 2012 Hyper-V: VM Monitoring

Windows Server 2012 RC Logo

Overview

Do you have a large number of virtualized workloads in your cluster? Have you been looking for a solution that allows you to detect if any of the virtualized workloads in your cluster are behaving abnormally? Would you like the cluster service to take recovery actions when these workloads are in an unhealthy state? In Windows Server 2012, there is a great new feature, in Failover Clustering called “VM Monitoring”, which does exactly that – it allows you monitor the health state of applications that are running within a virtual machine and then reports that to the host level so that it can take recovery actions. You can monitor any Windows service (such as SQL or IIS) in your virtual machine or ANY ETW event occurring in your virtual machine. When the condition you are monitoring gets triggered, the Cluster Service logs an event in the error channel on the host and takes recovery actions.

In this blog, I will provide a step by step guide of how you can configure VM Monitoring using the Failover Cluster Manager in Windows Server 2012.

Note: There are multiple ways to configure VM Monitoring. In this blog, I will cover the most common method. In a future blog, I will cover the many different flexible options for configuring VM Monitoring.

In Windows Server 2012 Failover Clustering Microsoft offers a new feature called VM Monitoring. This feature allows you to monitor the health of applications running inside the guest operating system of a Hyper-V Virtual Machine. Now how does this exactly work and what is happening in case a service is failing.

When a monitored service fails the Recovery features of the service will take action.

Service RecoveryIn this case for the first failure the service will be restarted by the Service Control Manager inside the guest operating system, if the service fails for a second time the service will again be restarted via guest operating system. In case of a third failure the Service Control Manager will take no action and the Cluster service running on the Hyper-V host will takeover recovery actions.

 

VM-Monitoring-Application-Monitoring-Sequence

 

The Cluster Service monitors the service thought periodic health checks, when the Cluster Service recognizes a failed service he will change the status of the Virtual Machine to unhealthy. This will trigger some recovery actions.

  • A Event log entry with Event ID 1250 will be created on the host Event log. This event can be monitored by Monitoring software like System Center Operations Manager or other tools. This will also allow to run other action or trigger System Center Orchestrator Runbooks.
  • The Virtual Machine State will be changed to “Application in VM Critical”
  • And the Virtual Machine will be restarted on the same node if the service fails again the Virtual Machine will be restarted and failed over to another node in the cluster.

Of course you can configure the Recovery Settings in the Cluster.

VM Monitoring - Application Monitoring Recovery

Configuring VM Monitoring

Pre-requisites

Before you can configure monitoring from the Failover Cluster Manager on a Management Console the following pre-steps are required:

1)      Configure the guest operating system running inside the virtual machine

a)      The guest operating system running inside the virtual machine must be running Windows Server 2012

b)      Ensure that the guest OS is a member of a domain which is same as the host or a domain with a trust relationship with the host domain.

2)      Grant the cluster administrator permissions to manage the guest

a)      The administrator running Failover Cluster Manager must be a member of the local administrators group in the guest

3)      Enable the “Virtual Machine Monitoring” firewall rule on the guest

a)      Open the Windows Firewall console

b)      Select “Allow an app or feature through Windows Firewall”

c)       Click on “change settings” and enable the “Virtual Machine Monitoring” rule.

Note:

You can also enable the “Virtual Machine Monitoring” firewall rule using the Windows PowerShell® cmdlet Set-NetFirewallRule:

 Set-NetFirewallRule -DisplayGroup “Virtual Machine Monitoring” -Enabled True

Configuration

VM Monitoring can be easily configured using the Failover Cluster Manager through the following steps:

1)      Right click on the Virtual Machine role on which you want to configure monitoring

2)      Select “More Actions” and then the “Configure Monitoring” options

3)      You will then see a list of services that can be configured for monitoring using the Failover Cluster Manager

  

Note:

You will only see services listed that run on their own process e.g. SQL, Exchange. The IIS and Print Spooler services are exempt from this rule. You can however setup monitoring for any NT service using Windows PowerShell® using the Add-ClusterVMMonitoredItemcmdlet – with no restrictions:

 Add-ClusterVMMonitoredItem –VirtualMachine TestVM -Service spooler 

How does VM Monitoring work?

When a monitored service encounters an unexpected failure, the sequence of recovery actions is determined by the Recovery actions on failure for the service. These recovery actions can be viewed and configured using Service Control Manager inside the guest. In the example below, on the first and second service failures, the service control manager will restart the service. On the third failure, the service control manager will take no action and defer recovery actions to the cluster service running in the host.

The cluster service monitors the status of clustered virtual machines through periodic health checks. When the cluster services determines that a virtual machine is in a “critical” state i.e. an application or service inside the virtual machine is in an unhealthy state, the cluster service takes the following recovery actions:

1)      Event ID 1250 is logged on the host

a.       This event can be monitored with tools such as System Center Operations Manager to trigger further customized actions 

2)      The virtual machine status in Failover Cluster Manager will indicate that the virtual machine is in an “Application Critical” state.

Note:  

  •          Verbose information is logged to the Cluster debug log for post-mortem analysis of failures.
  •          The StatusInformation resource common property for a virtual machine in “Application Critical” state has the value 2 as compared to a value of 0 during normal operation. The Windows PowerShell® cmdlet Get-ClusterResource can be used to query this property.

Get-ClusterResource “TestVM” | fl StatusInformation

3)      Recovery action is taken on the virtual machine in “Application Critical” state

a.       The virtual machine is first restarted on the same node

Note: The restart of the virtual machine is forced but graceful

b.      On the second failure, the virtual machine restarted and failed over to another node in the cluster.

Note: The decision on whether to failover or restart on the same node is configurable and determined by the failover properties for the virtual machine.

 

That’s the VM Monitoring feature in Windows Server 2012 in a nutshell!

 

Subhasish Bhattacharya                                                                                                                
Program Manager                                                                                                           
Clustering & High Availability                                                                                       
Microsoft

Also refer :  Guest Clustering and VM Monitoring in Windows Server 2012

Understanding Quorum Configurations in a Failover Cluster – TechNet

Understanding Quorum Configurations in a Failover Cluster

Applies To: Windows Server 2008 R2

This topic contains the following sections:

For information about how to configure quorum options, see Select Quorum Options for a Failover Cluster.

How the quorum configuration affects the cluster

The quorum configuration in a failover cluster determines the number of failures that the cluster can sustain. If an additional failure occurs, the cluster must stop running. The relevant failures in this context are failures of nodes or, in some cases, of a disk witness (which contains a copy of the cluster configuration) or file share witness. It is essential that the cluster stop running if too many failures occur or if there is a problem with communication between the cluster nodes. For a more detailed explanation, see Why quorum is necessarylater in this topic.

ImportantImportant
In most situations, use the quorum configuration that the cluster software identifies as appropriate for your cluster. Change the quorum configuration only if you have determined that the change is appropriate for your cluster.

Note that full function of a cluster depends not just on quorum, but on the capacity of each node to support the services and applications that fail over to that node. For example, a cluster that has five nodes could still have quorum after two nodes fail, but the level of service provided by each remaining cluster node would depend on the capacity of that node to support the services and applications that failed over to it.

Quorum configuration choices

You can choose from among four possible quorum configurations:

  • Node Majority (recommended for clusters with an odd number of nodes)

    Can sustain failures of half the nodes (rounding up) minus one. For example, a seven node cluster can sustain three node failures.

  • Node and Disk Majority (recommended for clusters with an even number of nodes)

    Can sustain failures of half the nodes (rounding up) if the disk witness remains online. For example, a six node cluster in which the disk witness is online could sustain three node failures.

    Can sustain failures of half the nodes (rounding up) minus one if the disk witness goes offline or fails. For example, a six node cluster with a failed disk witness could sustain two (3-1=2) node failures.

  • Node and File Share Majority (for clusters with special configurations)

    Works in a similar way to Node and Disk Majority, but instead of a disk witness, this cluster uses a file share witness.

    Note that if you use Node and File Share Majority, at least one of the available cluster nodes must contain a current copy of the cluster configuration before you can start the cluster. Otherwise, you must force the starting of the cluster through a particular node. For more information, see “Additional considerations” in Start or Stop the Cluster Service on a Cluster Node.

  • No Majority: Disk Only (not recommended)

    Can sustain failures of all nodes except one (if the disk is online). However, this configuration is not recommended because the disk might be a single point of failure.

Illustrations of quorum configurations

The following illustrations show how three of the quorum configurations work. A fourth configuration is described in words, because it is similar to the Node and Disk Majority configuration illustration.

noteNote
In the illustrations, for all configurations other than Disk Only, notice whether a majority of the relevant elements are in communication (regardless of the number of elements). When they are, the cluster continues to function. When they are not, the cluster stops functioning.

Cluster with Node Majority quorum configurationAs shown in the preceding illustration, in a cluster with the Node Majority configuration, only nodes are counted when calculating a majority.

Cluster with Node and Disk Majority quorumAs shown in the preceding illustration, in a cluster with the Node and Disk Majority configuration, the nodes and the disk witness are counted when calculating a majority.

Node and File Share Majority Quorum Configuration

In a cluster with the Node and File Share Majority configuration, the nodes and the file share witness are counted when calculating a majority. This is similar to the Node and Disk Majority quorum configuration shown in the previous illustration, except that the witness is a file share that all nodes in the cluster can access instead of a disk in cluster storage.

Cluster with Disk Only quorum configurationIn a cluster with the Disk Only configuration (No Majority: Disk Only), the number of nodes does not affect how quorum is achieved. The disk is the quorum. However, if communication with the disk is lost, the cluster becomes unavailable.

Why quorum is necessary

When network problems occur, they can interfere with communication between cluster nodes. A small set of nodes might be able to communicate together across a functioning part of a network but not be able to communicate with a different set of nodes in another part of the network. This can cause serious issues. In this “split” situation, at least one of the sets of nodes must stop running as a cluster.

To prevent the issues that are caused by a split in the cluster, the cluster software requires that any set of nodes running as a cluster must use a voting algorithm to determine whether, at a given time, that set has quorum. Because a given cluster has a specific set of nodes and a specific quorum configuration, the cluster will know how many “votes” constitutes a majority (that is, a quorum). If the number drops below the majority, the cluster stops running. Nodes will still listen for the presence of other nodes, in case another node appears again on the network, but the nodes will not begin to function as a cluster until the quorum exists again.

For example, in a five node cluster that is using a node majority, consider what happens if nodes 1, 2, and 3 can communicate with each other but not with nodes 4 and 5. Nodes 1, 2, and 3 constitute a majority, and they continue running as a cluster. Nodes 4 and 5, being a minority, stop running as a cluster. If node 3 loses communication with other nodes, all nodes stop running as a cluster. However, all functioning nodes will continue to listen for communication, so that when the network begins working again, the cluster can form and begin to run.

Additional references

Select Quorum Options for a Failover Cluster – TechNet

Select Quorum Options for a Failover Cluster

Applies To: Windows Server 2008 R2

If you have special requirements or make changes to your cluster, you might want to change the quorum options for your cluster.

ImportantImportant
In most situations, use the quorum configuration that the cluster software identifies as appropriate for your cluster. Change the quorum configuration only if you have determined that the change is appropriate for your cluster.

 

 

For important conceptual information about quorum configuration options, see Understanding Quorum Configurations in a Failover Cluster.

Membership in the local Administrators group on each clustered server, or equivalent, is the minimum required to complete this procedure. Also, the account you use must be a domain account. Review details about using the appropriate accounts and group memberships at Local and Domain Default Groups.

To select quorum options for a cluster
  1. In the Failover Cluster Manager snap-in, if the cluster that you want to configure is not displayed, in the console tree, right-clickFailover Cluster Manager, click Manage a Cluster, and then select or specify the cluster that you want.

  2. With the cluster selected, in the Actions pane, click More Actions, and then click Configure Cluster Quorum Settings.

  3. Follow the instructions in the wizard to select the quorum configuration for your cluster. If you choose a configuration that includes a disk witness or file share witness, follow the instructions for specifying the witness.

  4. After the wizard runs and the Summary page appears, if you want to view a report of the tasks that the wizard performed, clickView Report.

Additional considerations

  • To open the failover cluster snap-in, click Start, click Administrative Tools, and then click Failover Cluster Manager. If the User Account Control dialog box appears, confirm that the action it displays is what you want, and then click Yes.

Additional references

Understanding Quorum in a Failover Cluster – Clustering and High-Availability

Understanding Quorum in a Failover Cluster – Clustering and High-Availability – MSDN Blogs

 

Understanding Quorum in a Failover Cluster

Hi Cluster Fans,

This blog post will clarify planning considerations around quorum in a Failover Cluster and answer some of the most common questions we hear.

The quorum configuration in a failover cluster determines the number of failures that the cluster can sustain while still remaining online.  If an additional failure occurs beyond this threshold, the cluster will stop running.  A common perception is that the reason why the cluster will stop running if too many failures occur is to prevent the remaining nodes from taking on too many workloads and having the hosts be overcommitted.  In fact, the cluster does not know your capacity limitations or whether you would be willing to take a performance hit in order to keep it online.  Rather quorum is design to handle the scenario when there is a problem with communication between sets of cluster nodes, so that two servers do not try to simultaneously host a resource group and write to the same disk at the same time.  This is known as a “split brain” and we want to prevent this to avoid any potential corruption to a disk my having two simultaneous group owners.  By having this concept of quorum, the cluster will force the cluster service to stop in one of the subsets of nodes to ensure that there is only one true owner of a particular resource group.  Once nodes which have been stopped can once again communicate with the main group of nodes, they will automatically rejoin the cluster and start their cluster service.

For more information about quorum in a cluster, visit: http://technet.microsoft.com/en-us/library/cc731739.aspx.

Voting Towards Quorum

Having ‘quorum’, or a majority of voters, is based on voting algorithm where more than half of the voters must be online and able to communicate with each other.  Because a given cluster has a specific set of nodes and a specific quorum configuration, the cluster will know how many “votes” constitutes a majority of votes, or quorum.  If the number of voters drop below the majority, the cluster service will stop on the nodes in that group.  These nodes will still listen for the presence of other nodes, in case another node appears again on the network, but the nodes will not begin to function as a cluster until the quorum exists again.

It is important to realize that the cluster requires more than half of the total votes to achieve quorum.  This is to avoid having a ‘tie’ in the number of votes in a partition, since majority will always mean that the other partition has less than half the votes.  In a 5-node cluster, 3 voters must be online; yet in a 4-node cluster, 3 voters must also be online to have majority.  Because of this logic, it is recommended to always have an odd number of total voters in the cluster.  This does not necessarily mean an odd number of nodes is needed since both a disk or a file share can contribute a vote, depending on the quorum model.

A voter can be:

  • A node
    • 1 Vote
    • Every node in the cluster has 1 vote
  • A “Disk Witness” or “File Share Witness”
    • 1 Vote
    • Either 1 Disk Witness or 1 File Share Witness may have a vote in the cluster, but not multiple disks, multiple file shares nor any combination of the two

Quorum Types

There are four quorum types.  This information is also available here: http://technet.microsoft.com/en-us/library/cc731739.aspx#BKMK_choices.

Node Majority

This is the easiest quorum type to understand and is recommended for clusters with an odd number of nodes (3-nodes, 5-nodes, etc.).  In this configuration, every node has 1 vote, so there is an odd number of total votes in the cluster.  If there is a partition between two subsets of nodes, the subset with more than half the nodes will maintain quorum.  For example, if a 5-node cluster partitions into a 3-node subset and a 2-node subset, the 3-node subset will stay online and the 2-node subset will offline until it can reconnect with the other 3 nodes.

Node & Disk Majority

This quorum configuration is most commonly used since it works well with 2-node and 4-node clusters which are the most common deployments.  This configuration is used when there is an even number of nodes in the cluster.  In this configuration, every node gets 1 vote, and additionally 1 disk gets 1 vote, so there is generally an odd number of total votes.

This disk is called the Disk Witness (sometimes referred to as the ‘quorum disk’) and is simply a small clustered disk which is in the Cluster Available Storage group.  This disk is highly-available and can failover between nodes.  It is considered part of the Cluster Core Resources group, however it is generally hidden from view in Failover Cluster Manager since it does not need to be interacted with.

Since there are an even number of nodes and 1 addition Disk Witness vote, in total there will be an odd number of votes.  If there is a partition between two subsets of nodes, the subset with more than half the votes will maintain quorum.  For example, if a 4-node cluster with a Disk Witness partitions into a 2-node subset and another 2-node subset, one of those subsets will also own the Disk Witness, so it will have 3 total votes and will stay online.  The 2-node subset will offline until it can reconnect with the other 3 voters.  This means that the cluster can lose communication with any two voters, whether they are 2 nodes, or 1 node and the Witness Disk.

Node & File Share Majority

This quorum configuration is usually used in multi-site clusters.  This configuration is used when there is an even number of nodes in the cluster, so it can be used interchangeably with the Node and Disk Majority quorum mode.  In this configuration every node gets 1 vote, and additionally 1 remote file share gets 1 vote.

This file share is called the File Share Witness (FSW) and is simply a file share on any server in the same AD Forest which all the cluster nodes have access to.  One node in the cluster will place a lock on the file share to consider it the ‘owner’ of that file share, and another node will grab the lock if the original owning node fails.  On a standalone server, the file share by itself is not highly-available, however the file share can also put on a clustered file share on an independent cluster, making the FSW clustered and giving it the ability to fail over between nodes.  It is important that you do not put this vote on a node in the same cluster, nor within a VM on the same cluster, because losing that node would cause you to lose the FSW vote, causing two votes to be lost on a single failure.  A single file server can host multiple FSWs for multiple clusters.

Generally multi-site clusters have two sites with an equal number of nodes at each site, giving an even number of nodes.  By adding this additional vote at a 3rd site, there is an odd number of votes in the cluster, at very little expense compared to deploying a 3rd site with an active cluster node and writable DC.  This means that either site or the FSW can be lost and the cluster can still maintain quorum.  For example, in a multi-site cluster with 2 nodes at Site1, 2 nodes at Site2 and a FSW at Site3, there are 5 total votes.  If there is a partition between the sites, one of the nodes at a site will own the lock to the FSW, so that site will have 3 total votes and will stay online.  The 2-node site will offline until it can reconnect with the other 3 voters.

Legacy: Disk Only

Important: This quorum type is not recommended as it has a single point of failure.

The Disk Only quorum type was available in Windows Server 2003 and has been maintained for compatibility reasons, however it is strongly recommended to never use this mode unless directed by a storage vender.  In this mode, only the Disk Witness contains a vote and there are no other voters in the cluster.  This means that if the disk becomes unavailable, the entire cluster will offline, so this is considered a single point of failure.  However some customers choose to deploy this configuration to get a “last man standing” configuration where the cluster remain online, so long as any one node is still operational and can access the cluster disk.  However, with this deployment objective, it is important to consider whether that last remaining node can even handle the capacity of all the workloads that have moved to it from other nodes.

Default Quorum Selection

When the cluster is created using Failover Cluster Manager, Cluster.exe or PowerShell, the cluster will automatically select the best quorum type for you to simplify the deployment.  This choice is based on the number of nodes and available storage.  The logic is as follows:

  • Odd Number of Nodes – use Node Majority
    • Even Number of Nodes
      • Available Cluster Disks – use Node & Disk Majority
      • No Available Cluster Disk – use Node Majority

The cluster will never select Node and File Share Majority or Legacy: Disk Only.  The quorum type is still fully configurable by the admin if the default selections are not preferred.

Changing Quorum Types

Changing the quorum type is easy through Failover Cluster Manager.  Right-click on the name of the cluster, select More Actions…, then select Configure Cluster Quorum Settings… to launch the Configure Cluster Quorum Wizard.  From the wizard it is possible to configure all 4 quorum types, change the Disk Witness or File Share Witness.  The wizard will even tell you the number of failures that can be sustained based on your configuration.

For a step-by-step guide of configuring quorum, visit: http://technet.microsoft.com/en-us/library/cc733130.aspx.

Thanks!
Symon Perriman
Technical Evangelist
Private Cloud Technologies
Microsoft

  • #

    Hi, Can u shed some light on the new switch to start the Cluster /PQ and the nodeweight concept as put on KB 2494036

    Thanks

  • #

    Hi, I would like to know how to migrate the Quorum disk? Meanwhile, downtime is essential for migration?

  • #

    Hi Can i configure 2 FSWs in a cluster (from 3rd and 4th location). This is not to increase the votes but just to have high availability at share level. if the connection to the site3 lost site4 FSW provide a vote and if the connection to the site4 lost site3 FSW will provide the vote.

  • #

    You can only configure 1 FSW per Cluster.

    Please look at the ‘Node & File Share Majority’ above to understand how the Quorum Votes are calculated.

    If you have your Cluster nodes up and running and you loose connectivity to File Share Witness (3rd site) then the cluster would continue to run provided you have enough number of Cluster Nodes up and running.

    Thanks,

    Amitabh

  • #

    None of the MS documentation makes this clear to me. If I have a two-node sql cluster and it’s set as node&disk, then if the disk goes offline, the cluster stays up because the nodes are still running? how is this supposed to work? if the disk is offline then most likely your data drive is too, in which case sql isn’t going to run well. came somebody clear this up for me?

  • #

    The quorum disk is a witness disk with hold extra copy of clus DB. This help the cluster availabbility. Normally this disk is not used for other purpose.  It doesn’t mean the other disks will be offline /failed if if the quorum disk is failed. What/how many  disks  is in your cluster?

–END–