Automation of Cloud Outage Notification

Cloud

In this digital era, it has become inevitable for organisations to maximise their budgets and move necessary applications to the cloud at a reasonable cost. Many organisations in the recent past have taken the big leap to move their applications to the cloud. With this trend, there is a huge need to design your system to handle the outages and the downtime. Though there are multiple ways to combat the cloud outages, yet unexpected failures can cause a threat to business continuity. It’s always our responsibility to design applications and systems architecture for failure.
Typically, software architects handle a cloud outage by either automatically or manually turning off the service in question until the issue is resolved. This blog explains about public cloud outage notification automation to respective stakeholders for a project/product/program hosted in public cloud environments. It also talks about the need for public cloud notification automation in public cloud environment

Need for Outage Notification Automation

  1. On day to day basis, you encounter public cloud outage at various level like
    1. Service Level
    2. Region Level
    3. Global Level
  2. Your Team works in distributed environment and in remote location
  3. Your Product/Project/program Team members belong to various internal department
  4. Each notification has to be monitored and rerouted to internal teams who has to work on alternate actions
  5. Some of the cases, you need to send the notification immediately to avoid any business impact
  6. Also, you need to avoid turnaround time without having information about outage at all levels
  7. Your Operation team in different time zone and the actual outage occurs in other time zone

Solution
Based on above scenarios mentioned in summary section we propose outage notification automation using 3rd party Monitoring tool or Monitoring Server to route the specific outage to the respective team using template which will ensure your Outage Monitoring Server to carry out the notification task successfully.

Outage Notification Automation Workflow and Diagram

Diagram

Pre-Requisites

  • Public Cloud Outage Notification for service, region and global level through RSS Feed from AWS or Azure or Google cloud
  • Notification Aggregator Template to group and categorize various stakeholders against each notification
  • Monitoring Server which runs its monitoring agent to capture and read RSS feed and dispatch the message to respective stakeholders from template

Implementation
Once the subscription for all the levels w.r.t the respective public cloud is completed,

  1. Configure the monitoring server with monitoring agent (Daemon) to run 24/7
  2. Configure monitoring agent to read the desired RSS feed
  3. Configure the agent to route to respective stakeholders using Notification through Email configuration using SMTP or other protocols as per the policy

Conclusion
This solution is very simple yet useful and reliable when it comes to notification and troubleshooting during your outage or incidents. This alert mechanism also helps you to narrow down the incident and confirm the same. While writing this paper, we can witness more notification services and automation enabled by public service providers ,however this automations is more simple and can be created without big efforts.

ABOUT THE AUTHOR
Madhavan Srinivasan

MADHAN RAMAKRISHNAN

Director -Devops Practice/Program| Newt Global