Welcome!

Industrial IoT Authors: Pat Romanski, Elizabeth White, Maria C. Horton, Liz McMillan, Ed Featherston

Blog Feed Post

Lync 2013 High Availability deep dive: Architecture

With the release of Lync 2013 the story of high availability has changed significantly. Since these changes contain some new technologies I think it would be best to break this article out into multiple parts. For this first part, we will discuss the Architecture and that will be followed by the end user experience. There is a lot to discuss so let's get started.

Windows Fabric

One of the major changes is how Lync provides highly available services with a pool. In Lync 2010 this responsibility was handled by Cluster Manager. With Lync 2013 we now leverage the Windows Fabric to provide high availability to these services. The Windows Fabric services allows Lync to increase our scale for both On-Premise and Online deployments. Windows Fabric will handle the following:

  • Election of Primary, Secondary, and Backup Secondary replica
  • Failover Management including Fast Failover with full service
  • Replication between primary and secondary replicas
  • Manages and Monitors which servers are functional and participating in pool operations
  • Services for MCU Factory, Conference Directory, Routing Group, LYSS
  • Load Balancing within a pool

Data Replication within the Pool

  1. Persistent User Data – synchronous replication to all replicas
    1. Presence, Contacts and Groups, and User Voice settings
    2. Minimal Conference data including the Active Conference Roster and MCUs
    3. Transient User Data – not replicated across all replicas
      1. Presence changes due to user activity including Calendar, Inactivity, and Phone Calls.

The fabric will also lazily write this data to the backend end databases. Doing this lazily means when there are free cycles we will use them to replicate this data instead of impacting the server when its busy doing other replication to front ends. The backend data can be used for both rehydration of front end servers (restarts) and disaster recovery (pool failovers and failbacks).

The following ports must be opened between the Lync front end servers within a pool for the fabric replication process to be successful.

  • Federation (TCP/5090) - Windows Fabric ring communication.
  • Lease Agent (TCP/5091)
  • Client Connection (TCP/5092) - These are not user clients but Lync Services that register to Windows Fabric.
  • Runtime Service (TCP/5093)
  • Replication (TCP/5094)

 

Quorum

In order to function the Windows Fabric needs to achieve a status of Quorum. This means a certain number of servers (Voters) must be able to communicate with each other to ensure that the pool is capable of maintaining consistency (Quorum). There are 2 different pool states in which we must maintain\achieve quorum which differ in how many voters must be available.

  1. Cold Pool Startup\Quorum Lost – when a pool is first starting or after quorum has been lost, that pool must have roughly 85% of its servers available to achieve quorum1.
  2. Operating Pool – after a pool has been started we need at least 50% of the servers available to maintain quorum2.

1 This means that when defining a pool within topology builder only add the servers that will be deployed at that time, not in the future. If you put 6 servers in a pool within topology builder and only deploy 3, we will never achieve quorum because we don’t have approximitley 85% of the servers operational. Thus, the front end services will never start.

2 If quorum is lost then all Front End services will be shutdown to protect the cluster from data loss.

To identify the Quorum voters a given servers Windows Fabric identifies check C:\ProgramData\Windows Fabric\Settings.xml. This file is updated when the Windows Fabric Host service starts up, thus any changes to pool members requires a service restart on all Front Ends in the pool, one at a time. The To-Do list in topology builder lets you know that after publishing the topology. If this isn't completed your servers will have a skewed view of the topology which could cause issues.  Depending on the number of servers within the pool, this will dictate which voters are required.

  1. Even # Server Pool (Figure 1) – each server will have 1 vote and the SQL instance will have 1 vote. The SQL server will act as a tiebreaker.
  2. Odd #Server Pool (Figure 2) – each server will have 1 vote.

Figure 1: Even number of servers in Front End Pool

Figure 2: Odd number of servers in Front End pool

 

Routing Groups

When users are provisioned for Lync they are assigned to Routing groups. Each routing group will contain 3 replicas: primary, secondary, and backup secondary which together are called a replica set. User services for all users will be handled by their primary replica. All users in the same routing group will be homed on the primary replica. If the primary replica fails, all users in that Routing Group will failover to the same secondary replica.

Routing groups will get rebalanced when Front End servers are added or removed from the pool. If the primary replica goes down another will be promoted to primary. Depending on both user count and how the primary became unavailable will dictate when the Windows Fabric rebuilds another replica. The replica knows the pools user count and if it decides creating another replica is too expensive of an operation it might wait 15-30 minutes. Also, if the primary server was taken down gracefully for patching (stop-cswindowsservice –graceful), Windows Fabric might wait longer than if the server crashed or was just powered off. We don’t have any control over how Windows Fabric handles these operations.

In regards to SBA and SBS as it pertains to routing groups each unit gets its OWN unique routing group (by design). Those would be in addition to the routing created for users homed on that pool. This means that if you have 3000 users on an SBS all of those users will have the same primary, secondary, and backup secondary replicas. The reason this is important is because this needs to be factored into a design when building out a highly available infrastructure that can support all Lync services.

 

Quorum Loss Recovery

If we lose all replicas within a routing group before Windows Fabric has a chance to rebuild more the replica set will go into quorum loss. When that routing group is in this type of quorum loss users will NOT be able to login. If 2 replicas are lost (1 remaining online) in a short period of time we will also go into quorum loss but users will be logged in with limited functionality mode (no contact list or presence but can still IM). The routing group will stay in that state until either the servers come back online or we manually run Reset-CsPoolRegistrarState with the QuorumLossRecovery option.  Remember these scenarios will not affect all users but only the users that were a part of that routing group.

You can verify the routing groups are functioning correctly by running the Get-CsPoolFabricState command (Figure 3). If there are routing groups in QuorumLoss you will see services that reference "no primaries" (Figure 4).

 

Figure 3: Get-CsPoolFabricState healthy

Figure 4: Get-CsPoolFabricState with routing groups in QuorumLoss

In order to recover from quorum loss you will need to run the Reset-CsPoolRegistrarState cmdlet with a –ResetType of QuorumLossRecovery (Figure 5).

Figure 5: Reset-CsPoolRegistrarState QuorumLossRecovery

The Reset-CsPoolRegistrarState cmdlet has multiple parameters which can be used for various changes to the cluster.

Reset-CsPoolRegistrarState –PoolFqdn <pool name fqdn> –ResetType {FullReset | QuorumLossRecovery | ServiceReset | MachineStateRemoved}

  • FullReset1 - cluster changes, any modification either bringing a pool into or out of 1-2 server pool, and changes to upgrade domains. In addition, this parameter rebuilds the local Lync Server databases. This type of reset can be potentially long and resource-intensive.
  • QuorumLossRecovery - Reloads user data from the backup store for any routing groups currently in quorum loss. (A quorum loss occurs when neither a database nor its replicas are available.) Data not yet written to the database could be lost when you do this type of reset.
  • ServiceReset1 - The RtcSrv and fabricHostSvc services are stopped and restarted. A service reset will be performed if the ResetType is not specified.
  • MachineStateRemoved - Removes the specified server from the pool. This type of reset should be used only when the server in question (or its databases) have been permanently lost.

1 Both the FullReset and ServiceReset will cause services to restart which will cause outages. This is why it's not a good idea to add front end servers to the pool during your production hours. When you add server it states you must restart front end services on all front ends in the pool.

To verify which replicas a user's data resides on we can run the Get-CsUserPoolInfo cmdlet (Figure 3). This is a useful command that will show us a ton of useful info about our registrars. If there aren't any servers listed then the users routing group might be in quorum loss.

Figure 6: Get-CsUserPoolInfo cmdlet

 

Upgrade Domains

Upgrade Domains are logical groupings of servers on which software maintenance such as upgrades and updates are performed at the same time. If you don’t adhere to upgrade domain guidelines for server patching, you will impact routing groups and potentially the pool itself. A routing group will never have two of its three servers in a single upgrade domain.

Depending on how many Front End servers are in a pool when it's created will dictate how many servers will be in each upgrade domain. The algorithm used to generate the upgrade domains is implemented by Topology Builder and can be viewed in the tbxml file after creating the topology. It is not supported to manually edit the tbxml file and change the upgrade domains. In general creating more servers in the pool will result in more servers per upgrade domain. This why it is critical to plan out your front ends in the design phase instead of adding servers here and there. If servers have to be added due to growth then there is nothing that can be done to decrease the amount of upgrade domains windows fabric creates.

So how do we monitor these upgrade domains and incorporate them into our patch process? We can use the Get-CsPoolUpgradeReadinessState cmdlet to display this data (Figure 4). There are a couple different upgrade domain states we might see which are listed below and how to handle each one.

  1. Busy – wait 10 minutes and check again. The fabric might still be balancing services which can cause this to read busy
  2. Busy 3x or InsufficientActiveFrontEnds1– there might be an insufficient amount of active front end servers in the pool or something is wrong with pool\upgrade domain. If this is the case check out Event Viewer to see what issues are occurring on these servers.
  3. Ready – Drain (Stop-CsWindowsService), Patch, and then reboot the server

After you have rebooted the server, just wait a couple of minutes before running the Get-CsPoolUpgradeReadinessState cmdlet again to find out if the pool is ready to take another upgrade domain down. Remember you should only patch ONE upgrade domain at a time.

1 If your front end pool only has 1 servers you will always get InsufficientActiveFrontEnds. In this case you know there will be an outage so you can just patch and restart that server.

 

Figure 7: Get-CsPoolUpgradeReadinessState with expanded properties

 

Best Practices "Amount of Servers in Pool"

Our recommendation is to build pools with a minimum of 3 servers. Although we support 2 server pools there are considerations with this configuration which everyone should be aware of.

  • The first issue is centered on how servers are rebooted and patched. If you following our best practice recommendation (http://technet.microsoft.com/en-us/library/gg412996.aspx) for patching it states both front end servers should be rebooted at the same time. If at any time both servers are taken offline at the same time you must bring them up in reverse order. In other works if you reboot server 1 then server 2 you must bring server 2 back up FIRST. The reason for this is because server 2 would be the last node to write quorum info to the backend and the cluster would not make quorum unless he was started first. If reverse order is not possible, use the Reset-CsPoolRegistrarState cmdlet with the QuorumLossRecovery parameter we discussed above.
  • Also keep in mind the fact that you will only have 2 copies of your data (Primary and Secondary).

 
 Troubleshooting Service Startup

  • If a routing group is down, RtCSrv will not start. (32169 - Server startup is being delayed because fabric pool manager is initializing). Use Reset-CsPoolRegistrarState cmdlet with the QuorumLossRecovery parameter
  • If a pool has gone down and stopped services, it requires 85% of servers to start back up.

 

SQL Queries

List Routings Group Names, Users in those groups, and the primary server for each routing group ordered by home server (Figure 5). This should be run against a front end server rtclocal sql instance dbo.Resource table.

Select FrontEnd.Fqdn, RoutingGroupAssignment.RoutingGroupName, Resource.UserAtHost

From ResourceDirectory

INNER JOIN Resource

ON ResourceDirectory.ResourceId=Resource.ResourceId

INNER JOIN RoutingGroupAssignment

ON ResourceDirectory.RoutingGroupId=RoutingGroupAssignment.RoutingGroupId

INNER JOIN FrontEnd

ON RoutingGroupAssignment.FrontEndId=FrontEnd.FrontEndId

ORDER BY FrontEnd.Fqdn

 

Figure 8: SQL query to list routing groups, primary server, and users

Thanks to both Bryan Nyce and Daniel Hernandez for helping document this popular technical information request. Also thanks to Tommy Mhire for providing some of this excellent material.

Read the original blog entry...

More Stories By Richard Schwendiman

My name is Richard Schwendiman and I am currently working for Microsoft as a (PFE) Premier Field Engineer specializing in both Exchange and Lync. I have been working as an IT Consultant for 13+ years focusing on a wide array of Infrastructure technologies. These technologies include Messaging, UC, Networking, Platforms, Active Directory, Virtualization, etc... I am currently certified as an MCSE (Microsoft Certified Systems Engineer), MCSE Messaging 2013, MCSE Communications 2013, MCSA 2012, MCITP Enterprise Messaging, MCTS-Lync, CCNA (Cisco Certified Network Associate), Commvault, CCNP (Cisco Certified Network Professional), and JNCIA-ER (Juniper Enterprise Routing). I am hoping that through this blog I can bring knowledge from the field and keep everyone informed about our ever changing Industry. Please feel free to email me any questions, comments, or concerns pertaining to this blog or any technology related things. Thanks and look forward to providing some good content. http://blogs.technet.com/b/rischwen/

@ThingsExpo Stories
Organizations planning enterprise data center consolidation and modernization projects are faced with a challenging, costly reality. Requirements to deploy modern, cloud-native applications simultaneously with traditional client/server applications are almost impossible to achieve with hardware-centric enterprise infrastructure. Compute and network infrastructure are fast moving down a software-defined path, but storage has been a laggard. Until now.
In his general session at 19th Cloud Expo, Manish Dixit, VP of Product and Engineering at Dice, discussed how Dice leverages data insights and tools to help both tech professionals and recruiters better understand how skills relate to each other and which skills are in high demand using interactive visualizations and salary indicator tools to maximize earning potential. Manish Dixit is VP of Product and Engineering at Dice. As the leader of the Product, Engineering and Data Sciences team at D...
DXWorldEXPO LLC announced today that the upcoming DXWorldEXPO | CloudEXPO New York event will feature 10 companies from Poland to participate at the "Poland Digital Transformation Pavilion" on November 12-13, 2018.
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on...
The best way to leverage your CloudEXPO | DXWorldEXPO presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering CloudEXPO | DXWorldEXPO will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at CloudEXPO. Product announcements during our show provide your company with the most reach through our targeted audienc...
JETRO showcased Japan Digital Transformation Pavilion at SYS-CON's 21st International Cloud Expo® at the Santa Clara Convention Center in Santa Clara, CA. The Japan External Trade Organization (JETRO) is a non-profit organization that provides business support services to companies expanding to Japan. With the support of JETRO's dedicated staff, clients can incorporate their business; receive visa, immigration, and HR support; find dedicated office space; identify local government subsidies; get...
DXWorldEXPO LLC announced today that All in Mobile, a mobile app development company from Poland, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. All In Mobile is a mobile app development company from Poland. Since 2014, they maintain passion for developing mobile applications for enterprises and startups worldwide.
@DevOpsSummit at Cloud Expo, taking place November 12-13 in New York City, NY, is co-located with 22nd international CloudEXPO | first international DXWorldEXPO and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
As data explodes in quantity, importance and from new sources, the need for managing and protecting data residing across physical, virtual, and cloud environments grow with it. Managing data includes protecting it, indexing and classifying it for true, long-term management, compliance and E-Discovery. Commvault can ensure this with a single pane of glass solution – whether in a private cloud, a Service Provider delivered public cloud or a hybrid cloud environment – across the heterogeneous enter...
DXWorldEXPO LLC announced today that ICC-USA, a computer systems integrator and server manufacturing company focused on developing products and product appliances, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City. ICC is a computer systems integrator and server manufacturing company focused on developing products and product appliances to meet a wide range of ...
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smart...
Headquartered in Plainsboro, NJ, Synametrics Technologies has provided IT professionals and computer systems developers since 1997. Based on the success of their initial product offerings (WinSQL and DeltaCopy), the company continues to create and hone innovative products that help its customers get more from their computer applications, databases and infrastructure. To date, over one million users around the world have chosen Synametrics solutions to help power their accelerated business or per...
In an era of historic innovation fueled by unprecedented access to data and technology, the low cost and risk of entering new markets has leveled the playing field for business. Today, any ambitious innovator can easily introduce a new application or product that can reinvent business models and transform the client experience. In their Day 2 Keynote at 19th Cloud Expo, Mercer Rowe, IBM Vice President of Strategic Alliances, and Raejeanne Skillern, Intel Vice President of Data Center Group and ...
Founded in 2000, Chetu Inc. is a global provider of customized software development solutions and IT staff augmentation services for software technology providers. By providing clients with unparalleled niche technology expertise and industry experience, Chetu has become the premiere long-term, back-end software development partner for start-ups, SMBs, and Fortune 500 companies. Chetu is headquartered in Plantation, Florida, with thirteen offices throughout the U.S. and abroad.
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
Bill Schmarzo, author of "Big Data: Understanding How Data Powers Big Business" and "Big Data MBA: Driving Business Strategies with Data Science," is responsible for setting the strategy and defining the Big Data service offerings and capabilities for EMC Global Services Big Data Practice. As the CTO for the Big Data Practice, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He's written several white papers, is an avid blogge...
"We are a well-established player in the application life cycle management market and we also have a very strong version control product," stated Flint Brenton, CEO of CollabNet,, in this SYS-CON.tv interview at 18th Cloud Expo at the Javits Center in New York City, NY.
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
Most people haven’t heard the word, “gamification,” even though they probably, and perhaps unwittingly, participate in it every day. Gamification is “the process of adding games or game-like elements to something (as a task) so as to encourage participation.” Further, gamification is about bringing game mechanics – rules, constructs, processes, and methods – into the real world in an effort to engage people. In his session at @ThingsExpo, Robert Endo, owner and engagement manager of Intrepid D...