Search

How to Use Windows Virtual Desktop To Improve Security and Control Disaster Recovery Costs







Many of our customers are running production workloads in Azure but have limited budgets for a full second regional deployment. We believe the use of Azure’s regional disaster recovery (DR) by way of automation and Platform-as-a-Service (PaaS) could be beneficial for our customers. Rather than deploying DR in a one-to-one manner with online high-cost storage and server capacity sitting waiting, but unused, we thought to use as much automation and PaaS as possible to see how we could limit DR costs. Active/Active based DR can represent a significant expense stream, when compared to being able to activate and pay for capacity only while testing or during a real disaster. How does Windows Virtual Desktop (WVD) play into all of this? The first and more obvious use case is that a WVD based solution is part of a client’s production environment, and therefore needs to be represented in a cost-effective way in the DR vision for the organization. The second opportunity here was that a WVD deployment consists of a large mix of differing solutions within the Azure ecosystem leaving open a plethora of Azure services to address. An item of interest was Microsoft’s Azure exclusive “Windows 10 multi-session” support, which allows multiple users to work off a single desktop OS at the same time, prior to this solution becoming available only Microsoft server based operating systems could provide multi-session support. Given this backdrop of possible use cases, and knowing how Microsoft continues to evolve WVD licensing approach, we began to experiment with the inner workings and capabilities of WVD and how to protect them.


Deployment

One big goal for both production and DR in our experiment was to take advantage of as much PaaS as possible, we also wanted to get a deep dive for the DR options around each of these services. While Windows Virtual Desktop is a PaaS service, there are several other services and solutions required to create a usable product for end-users. These services include storage for both software, scripts, user profile data, active directory for authentication and authorization, automation such as ARM (Azure Resource Manager) templates, Azure Automation Accounts, and Group Polices. There are also some additional considerations we’ll get into later for this deployment due to conditional access polices requirements. Before digging deeper into Azure regional DR, we’re going to need to go into details of the production deployment. During our experiment we had to resolve some unexpected issues as we progressed, a few of these avenues had limitations that drove us in other directions, and we will highlight these accordingly. Additionally, below is a simple high-level diagram of our design, this should be helpful as we get into the details of the major components as well as some of the options for DR.






Active Directory

Prior to deploying any resources or Host Pool worker nodes (VDI hosts), authentication and authorization needed to be addressed for both production and DR regions. Like many organizations, we’re dealing with Azure Active Directory as our identity source. In the overall design, this is a point of failure we had to accept, if Azure Active Directory goes down then several other Azure and Microsoft based resources would be unavailable, protecting our deployment from that level failure would provide little benefit for our assumed scenarios.

On top of Azure Active Directory, WVD needs a Windows Server Active Directory join for Host Pool worker nodes, syncing this directory with Azure AD is a requirement for WVD to supply a seamless login experience. We made use of Azure Active Directory Domain Services (AAD DS) for this function, Azure Active Directory Domain Services is a PaaS service supplying many of the services and functions traditional Active Directory can provide. Important to note, AAD DS does not allow for support with MSIX due to lack of computer object sync support with Azure AD, and that was something we didn’t know until later in our testing, had we gone traditional AD we would not have been limited. It’s worth noting here that MSIX is a modern application packaging format that allows for layered applications to be added to WVD images. The MSIX issue aside, what is great about AAD DS is there’s no servers to manage, Microsoft does all that for you, this is just as beneficial for DR applications as there’s no management overhead for added regions. For each region you need to supply domain services for your add replica-sets within AAD DS and that’s it, each replica-set deployment will include all the resources needed for the PaaS service to function in that defined region. This set-up significantly limited our need for additional supporting servers to manage.. As a quick added note here, we did require a SKU for AAD DS to be higher than Standard, such as Enterprise or Premium, to support replica-sets. Luckily upgrading an already deployed Standard SKU AAD DS is just a drop-down menu choice. Just be aware of the added SKU requirements when doing a cost analysis against any POC or pilot which may not have included the DR requirements up front. If you’re interested in more details, below is an ARM template example of an AAD DS deployment with the required “Enterprise” SKU parameter defaulted to include replica-set support:

{

"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",

"contentVersion": "1.0.0.0",

"parameters": {

"apiVersion": {

"type": "string"

},

"sku": {

"type": "string",

"defaultValue": "Enterprise"

},

"domainConfigurationType": {

"type": "string"

},

"domainName": {

"type": "string"

},

"filteredSync": {

"type": "string"

},

"location": {

"type": "string"

},

"notificationSettings": {

"type": "object"

},

"subnetName": {

"type": "string"

},

"tags": {

"type": "object"

},

"vnetName": {

"type": "string"

},

"vnetAddressPrefixes": {

"type": "array"

},

"subnetAddressPrefix": {

"type": "string"

},

"nsgName": {

"type": "string"

}

},

"resources": [

{

"apiVersion": "2021-03-01",

"type": "Microsoft.AAD/DomainServices",

"name": "[parameters('domainName')]",

"location": "[parameters('location')]",

"tags": "[parameters('tags')]",

"dependsOn": [

"[concat('Microsoft.Network/virtualNetworks/', parameters('vnetName'))]"

],

"properties": {

"domainName": "[parameters('domainName')]",

"filteredSync": "[parameters('filteredSync')]",

"domainConfigurationType": "[parameters('domainConfigurationType')]",

"notificationSettings": "[parameters('notificationSettings')]",

"replicaSets": [

{

"subnetId": "[concat('/subscriptions/', subscription().subscriptionId, '/resourceGroups/', resourceGroup().name, '/providers/Microsoft.Network/virtualNetworks/', parameters('vnetName'), '/subnets/', parameters('subnetName'))]",

"location": "[parameters('location')]"

}

],

"sku": "[parameters('sku')]"

}

},

{

"type": "Microsoft.Network/NetworkSecurityGroups",

"name": "[parameters('nsgName')]",

"location": "[parameters('location')]",

"properties": {

"securityRules": [

{

"name": "AllowPSRemoting",

"properties": {

"access": "Allow",

"priority": 301,

"direction": "Inbound",

"protocol": "Tcp",

"sourceAddressPrefix": "AzureActiveDirectoryDomainServices",

"sourcePortRange": "*",

"destinationAddressPrefix": "*",

"destinationPortRange": "5986"

}

},

{

"name": "AllowRD",

"properties": {

"access": "Allow",

"priority": 201,

"direction": "Inbound",

"protocol": "Tcp",

"sourceAddressPrefix": "CorpNetSaw",

"sourcePortRange": "*",

"destinationAddressPrefix": "*",

"destinationPortRange": "3389"

}

}

]

},

"apiVersion": "2019-09-01"

},

{

"type": "Microsoft.Network/virtualNetworks",

"name": "[parameters('vnetName')]",

"location": "[parameters('location')]",

"apiVersion": "2019-09-01",

"dependsOn": [

"[concat('Microsoft.Network/NetworkSecurityGroups/', parameters('nsgName'))]"

],

"properties": {

"addressSpace": {

"addressPrefixes": "[parameters('vnetAddressPrefixes')]"

},

"subnets": [

{

"name": "[parameters('subnetName')]",

"properties": {

"addressPrefix": "[parameters('subnetAddressPrefix')]",

"networkSecurityGroup": {

"id": "[concat('/subscriptions/', subscription().subscriptionId, '/resourceGroups/', resourceGroup().name, '/providers/Microsoft.Network/NetworkSecurityGroups/', parameters('nsgName'))]"

}

}

}

]

}

}

],

"outputs": {}

}


Azure Storage Account

An Azure Storage Account is a construct which houses Azure storage services like Azure Files and Blob Storage, this was a core part of the deployment and we made use of it in multiple ways. First, we used Azure Files for FSLogix profile storage, FSLogix creates a dedicated virtual differencing disk for each user to allow for a roaming profile as the user moves between worker nodes over time. This disk is layered on top of the running WVD worker node image to supply all the user desktop customizations and a consistent experience no matter what worker node a user is assigned. Microsoft has a detailed process on how to provision FSLogix on Azure Storage Accounts File Shares. We had no issues getting this up and running based on the documentation but did need to make some tweaks with GPOs and the added Administrative Templates supplied by Microsoft. The second core use was Azure Blob Storage to store any static executables, files, or scripts needed during the automated creation of the WVD worker nodes, most of these we called by way of Group Policy.


As with anything which requires resiliency, Microsoft recommends using geo-redundant storage for anything needing protection beyond the region. Using this type of storage account allows for replication not only locally within a region 3 times over but additionally to a remote region. This enables failover with minimal data loss with asynchronous replication between regions. For use with FSLogix profile storage, this impact should be limited. So far, we’ve made use of OS images directly from Microsoft’s public images, if customized OS images become a requirement for future use cases, storing them on Azure Blob storage protected in the same manner will allow for protection of these images as well.

We ran across a gotcha that you may want to keep an eye on related to DR and continued protection during a failover of an Azure Storage Account. By default, after any geo-replicated storage account is failed over, the account’s replication will revert to local redundancy only. In our instance we had to re-configure the account for geo-redundant storage after a failover completed to retain protection and enable failback.


Automation

For our deployment we wanted to make use of images directly from the Microsoft Image Gallery. Microsoft creates a Windows 10 Enterprise multi-session + M365 Apps image ready for use with WVD. Using this image directly from Microsoft provided us with many benefits, but the semi-ephemeral way we planned to create and manage our WVD worker nodes meant we had to automate nearly every custom configuration beyond the base image. We went into this aware that custom images stored on storage blobs are supported for WVD, but an objective we had was to recreate worker nodes from scratch in in lieu of an image update process, we really wanted to limit the care and feeding of gold imaging. While some month-to-month updates may get pushed directly to the running worker nodes, most updates we planned to incorporate from automated node recreation with use of Microsoft’s monthly maintenance of their Image Gallery images.


There is no single “best” way to make all the gears come together to create the automation needed to support the various requirements we had for our deployment. We went forward with a mixed brew of both traditional system management with some added cloud-based automation. This required a few differing options and thought processes around DR.

The automation we used for deployment and updating of the worker nodes themselves was all based off ARM templates. The deployment of ARM templates can happen many ways and many organizations may already have invested in some higher-level automation. For our experiment, having basic ARM templates stored in a secure location met the requirements. As such we used GitHub as a repository, knowing we could still make use of GitHub Actions or any other CI/CD like solution take it to another level, but this was not a requirement yet.

With our ARM templates secured and protected in a code repository we found ourselves in a spot that we could spin up worker nodes on the fly, but what about application installation and configuration? For these tasks we went with a traditional solution, Active Directory Group Policy. With Group Policy we were able to configure “run-once” scheduled tasks and apply additional item level targeting to prevent installation if targeted paths exist already, this prevented repeated re-installation of applications from GP updates. As we covered in the prior section around Active Directory, the DR design here has already been established and this includes the replication of all Group Policy Objects (GPOs) to all replica-sets. While the GPOs can be replicated, there may be differing settings you would like to see defined in a DR site, which was a requirement in our deployment. We addressed this issue by created dedicated OUs for each of our planned VDI production and DR sites and then tailoring a second set of Group Polices slightly for DR, not all use-cases may require this. By updating our ARM templates for the WVD worker node deployments for DR, we had the ability to target specific OUs which had slightly modified configurations to address the new regional requirements. In both our production and DR regions we make use of Azure Key Vault (AKV) to store any credentials required for AD join of the worker nodes within our ARM templates.

While these automation tools covered a lot of our needs, it wasn’t the end of our automation and we still had a few additional requirements to address. Most important for cost savings was the powering up and down of worker nodes based on time. The workforce for our use-case was US east coast based and very much 9-5, this meant we could use automation to assist to keep these systems powered off during off-hours to keep costs down. To automate this requirement, we made use of Azure Automation Account Runbooks. There are many pre-built Runbooks available in the gallery directly from Microsoft, these are very easy to deploy and include VM power control and other automation options. For DR, Automation Accounts can easily be built ahead of time and can make use of tagging and resource group placement for application against resources that may not exist yet as was our case. Additionally, in our stress testing we found that adding video card acceleration to our worker nodes significantly improved performance and increased user density with limited additional cost occurred. This required specific drivers to be installed on the worker nodes after deployment. For anyone who’s managed video card drivers in the past, these are consistently updating and maintaining an up-to-date copy of the drivers can be a task in of itself. To solve this, Azure provides VM extensions to automate this process, below is an example which we added to the Resources section of our WVD worker node ARM templates which deploys the latest drivers for the AMD video cards we used for our Azure NVv4 virtual machines.

{

"name": "myExtensionName",

"type": "extensions",

"location": "[resourceGroup().location]",

"apiVersion": "2015-06-15",

"dependsOn": [

"[concat('Microsoft.Compute/virtualMachines/', myVM)]"

],

"properties": {

"publisher": "Microsoft.HpcCompute",

"type": "AmdGpuDriverWindows",

"typeHandlerVersion": "1.0",

"autoUpgradeMinorVersion": true,

"settings": {

}

}

}


Connectivity and Conditional Access

A few roadblocks we had to work out during initial testing were related to existing conditional access polices in place with Microsoft Intune requirements. Intune support for Windows 10 multi-session was not available at the time of our experiment, so we needed to create some work-around polices to allow access to Office and SharePoint data for WVD worker nodes. First, we needed a mechanism to bypass existing conditional access requirements we had for Intune. There are a few options around policy excluding we could make use of and after reviewing pros and cons we found the best option was to create a NAT Gateway for this deployment with a dedicated public IP assigned. By doing this we created a single source IP which we used as part of exclude options within the existing policies that needed to be bypassed due to lack of support for Windows 10 multi-session with Intune. Once that was validated, we created a policy targeted for a specified cloud application, for our case this was “Windows Virtual Desktop”, this allowed us to apply additional custom policies directly for this service without impacting any existing deployed assets. One additional consideration we needed here for DR was to pre-build the NAT gateway and public IP addressing in the DR site. This allowed us to pre-populate the IP addressing for DR in our bypass policies to ensure our access policies functioned as expected in the event of a disaster failover.







Management Server

Microsoft requires a server OS to install several of the Active Directory administrative tools, because our worker nodes are truly Windows 10 multi-session, they do not support the installation of the required roles for these tools. As a solution, we created a small virtual machine instance which we powered on as needed to manage AAD DS, Group Policy, and other management functions requiring a server OS. While nothing on this server was required for DR, a “like” management system was needed to manage the deployment within a secondary region in the event of a disaster. The best option for us was to create a new dedicated virtual machine in the DR site, realistically this should be created on-the-fly during or after a failover process to keep you from having to maintain and pay for another server in a remote site. This might be an impact to RTO, if this is a concern in your deployment it might be best to maintain that system in DR or use a tool like Azure Site Recovery to protect a machine for failover as well.


User Entry Point

With Windows Virtual Desktop, all the clients subscribe to a service URL or make use of the web client URL, these URLs are the same for all users regardless of region. This allowed us to maintain all our existing client configurations in the event of a failover. Additionally, because we made use of ARM templates (stored in Github), base images (directly from Microsoft gallery), and Group Policy (being replicated as part of replica-sets in AAD DS), overall disaster protection of the core Windows Virtual Desktop configurations and images are inherently protected.


Ending Thoughts

The use case for this type of DR model certainly extends beyond WVD but investigating it with WVD as the centerpiece proved out the concept of cost-effective DR that combines limited replication with automated provisioning. Leveraging PaaS that includes availability and recovery features, as well as utilizing Azure’s native software defined infrastructure are two best practices that will set you up for success in both production and disaster recovery environments. Additionally, although it was not a primary objective during our investigation, it is also clear that WVD can be an enabler for moving latency sensitive applications from on-prem into the cloud.


In an always evolving landscape you might find you are already entitled to WVD by way of your Microsoft licensing. Consistent with their approach of bundling capability into their subscription licensing, Microsoft has included the user connection licensing for Windows Virtual Desktop in many of Microsoft 365 product suites, as well as adding in