We all know that IT downtime is expensive and damaging to organizations and their productivity. Yet, despite ongoing investments in technology, system outages still bedevil many enterprises, including cloud-based environments. In 2018, for example, cloud outages at companies such as Microsoft, Amazon Web Services, and Visa reminded us that even well-maintained IT environments are vulnerable to system disruptions. An extreme example is the 2017 Delta Airlines 5-hour outage that caused the cancellation of 280 flights and cost the company $150 million dollars.
With numerous IT assets spread across on-premise and cloud environments, downtime today can cause a lot more harm than mere customer inconvenience. System outages can directly impact productivity, throughput, profitability and customer attrition. You only need to experience one system outage to realize how much harm it can cause to a company’s reputation and standing in the marketplace. But rather than considering IT downtime as an inevitable cost of doing business, senior IT managers can proactively take steps to minimize their occurrences and impact. And quantifying the cost of system outages can justify spending to remediate them.
How Much do System Outages Really Cost?
Catastrophic airline outages aside, just how costly is downtime to the typical enterprise? There are direct costs for system diagnosis and repair, as well as indirect costs such as the loss of organizational productivity and damage to corporate reputation. Costs may also include damage to (or loss of) mission-critical data and other assets, legal and regulatory impact, and repair costs for core business processes and systems. Add in the lost revenue related to business disruptions and missed sales opportunities, and you can see how even one hour of downtime can cost several hundred thousand or million dollars.
To tally up estimated costs, analyst firms have surveyed clients who experience system downtime and can quantify its impact. Cost estimates will vary by industry, size of the organization, and region. For example, according to Statista, in 2017/2018 the average cost of server downtime was approximately $300,000-400,000 per hour, while 44% of survey responders reporting costs of $1M/per hour or more. And those costs, and associated business impact, would be higher if you experience an unplanned outage during peak traffic time.
The more complex, virtualized and interconnected system environments become, the longer it takes to diagnose and resolve unplanned outages. The siloed, disparate nature of most IT hybrid system infrastructures have caused Mean-Time-To-Recovery (MTTR) rates to escalate along with costs. In fact, ITIC’s Reliability and Hourly Cost of Downtime Trends Survey confirmed that 81% of organizations report the cost of unplanned downtime typically exceeds $300,000/hour, with monetary costs exceeding millions of dollars per minute in extreme cases.
The ITIC study also shows that downtime costs vary between industries and enterprise size. For example, large enterprises with over 1,000 employees could see the costs associated with a single of hour of downtime to exceed $5 Million in nine specific industries, including Banking/Finance; Government; Healthcare; Manufacturing; Media & Communications; Retail; Transportation and Utilities.
Total Cost of System Downtime by Industry
(summary Ponemon Institute study)
- Financial Services $994,000
- Healthcare $918,000
- eCommerce $909,000
- Industrial $761,000
- Retail $758,000
- Hospitality $514,000
- Public Sector $476,000
Take a look at the following charts (Ponemon Institute study). The first chart shows reveals the relatively consistent breakdown of cost categories associated with business disruption. The second chart illustrates that the average shutdown duration hasn’t changed in the last 6 years.
Why System Outages Happen
A 2018 study by Information Technology Intelligence Consulting points to human error and security issues topping the list of causes for unplanned downtime, with network interruptions another contributing factor, and outdated processes can lengthen the time it takes to resolve outages. Pinpointing the root cause of system outages is complicated by manual, time-consuming correlation of massive amounts of siloed operational data. Proactive planning, automation, and better resources can help minimize human error and hardware/software failures. For example, automated real-time correlation of data between business, application, and infrastructure components can help predict potential system issue so they can be resolved before impacting the business.
Proactively Manage Assets to Minimize System Downtime
To prevent downtime, you must be able to effectively monitor and maintain your assets in real time, and arm your IT staff with tools to help make sense of IT complexity. Artificial Intelligent systems for IT Operations (AIOps) can transforms IT Ops by significantly reducing human errors and the tedious repetition of cumbersome manual processes. By providing real-time full-stack data correlation and visualization of the entire system environment IT staff can gain actionable insights to optimize system performance and meet customer expectations. Viewing and understanding application dependencies can also help employees forecast the potential impact of system changes before they are implemented. This allows you to carefully plan transitions and migrations so they don’t affect the performance of business-critical applications.