Keys to Avoiding Data Center Downtime
Ask any data center facility manager what keeps them up at night, and the answer is usually the same: fear of downtime.
That’s because with each minute of data center downtime, the risk of serious loss increases. Around a third of all reported outages cost more than $250,000, with many exceeding $1 million.
The good news? An Uptime Institute report shows that most IT service disruptions are preventable, with 80% of respondents admitting that their most recent outage could have been avoided. In this post we’ll review some of the most common causes of data center downtime, such as power outages, human error, security issues, IT cooling, and a few simple steps you can take to mitigate your risks.
Causes of Downtime in IT Data Centers
- Power Outages. Look at the biggest public service outages tracked by Uptime Institute in a recent survey and you’ll see that power failures were responsible for nearly 40% of them (and 33% of respondents said they’d had this type of failure within the past year). This makes power outages the leading cause of downtime; network problems and IT system failure are close behind (however, if you take third-party cloud, colocation and hosting providers as a whole, this group ranks #2 among top causes).
- Human Error. The Uptime Institute reported that nearly 70% of all data center outages are the direct or indirect result of human error. These errors range from the very simple mistakes (inadvertently pulling cords out of equipment or overloading a circuit) to more complex issues (decisions regarding equipment layout or inadequate procedures and/or training). As an example, data centers that don’t follow cable management best practices are vulnerable to downtime when power cables are packed so tightly that power flow is restricted, or when the wrong cable for the application is used.
- Security Issues. With their massive stores of data and applications, data centers are prime targets for hackers. Ransomware, external access services, application attacks and distributed denial of service (DDoS) are common methods used to compromise systems. In some data centers, equipment can be remotely controlled and configured, making it possible for hackers to interrupt power. Those looking to compromise data might also try to attack the physical structure of the IT enclosure itself. One cyber security software company suggests that “While the data center contains the highest concentration of sensitive data and critical business applications, it tends to have the weakest security controls, leaving much of this highly sensitive and business-critical data vulnerable to cyber attack.”
- IT Enclosure Cooling Issues. The importance of proper data center and IT enclosure cooling can not be overstated. When you look at the primary cause of the most serious outages between January of 2016 and June of 2018 (Uptime Institute, Risk & Resiliency research, June 2018), cooling was responsible for 20% of these incidents. Servers and processors generate a significant amount of heat (and more every day as densities increase), creating ongoing risk of overheating and failure if proper cooling methods and equipment aren’t applied. Conversely, too much cooling can create moisture and/or condensate that also could lead to failure caused by short-circuiting and corrosion at the IT appliances.
Use the Right Cooling to Improve Data Center Uptime
There are three general types of cooling: room-based, row-based (sometimes referred to as in-line) and rack-based. On a continuum, room-based is the furthest from the sources of heat generated and rack-based is the closest – right in or attached to the IT enclosure itself. It’s critical for data center facility managers and IT managers to carefully consider both the most efficient cooling methods available today, and the vendor partner with which they’ll work to find the right solution for their needs.You can read more about these cooling methods by reading our post, “The State of Data Center Cooling & 2020 Trends.”
Additional Tips for Preventing Data Center Downtime
The risks of downtime require vigilance across the organization. Some threats are more severe, while others can be minimized with some simple steps, such as:
- Understanding where issues are happening – and where you have opportunities to improve – by monitoring the performance of all your equipment. Analytics software can target both data-center-level and server-level performance metrics, and insights from the data can be used to ensure optimum uptime.
- Making sure servers have the processing power and storage capacity aligned with customers’ needs. “Right-sizing” servers improves overall data center performance and minimizes the likelihood of failure.
- Grouping equipment and IT enclosures in a way that puts equipment with similar heat load densities together. This makes it simpler to target cooling and prevent overheating.
- Doing some “housekeeping.” Over time, dust and debris collect and can affect airflow, decreasing the efficiency of the cooling system. Both can cause static electricity to build up, too, leading to short circuits that can take down the system. Clean above and below IT enclosures to avoid these issues.
- Learning from others’ failures. Most types of system failure have happened and are well documented, giving you insights that allow you to take precautionary measures.
- Asking your data center’s service providers (cloud, hosting and colocation providers and carriers) to provide you with detailed risk/resiliency reports. This step can highlight potential risks and indicate the best steps to prevent a similar fate.
The wrong cooling approach and equipment puts you at risk for downtime. But how do you know what’s right for your facility? You talk to the experts. Rittal is the world’s leading enclosure manufacturer, with an unparalleled combination of expertise, flexibility and engineering. To learn more about the different cooling and the applications they’re most appropriate for, read this post: “2020 Trends in Data Center Enclosures.”