Openstack is a distributed services architecture to orchestrate cloud infrastructure. To build an enterprise grade orchestration engine and serve the enterprises, one needs to keep up with the committed SLAs.
In this blog, I will discuss the challenges of achieving the high availability SLAs demanded of enterprise-grade Openstack solutions.
High-availability services are deployed on multiple nodes. These nodes are named based on the services they host. For example, controller nodes host API and scheduler services, storage nodes serve the storage disk needs, network nodes provide connectivity between virtual machines (VMs) and internet and compute nodes are where the VMs are created.
Since each of these services get hosted on multiple nodes, identifying broken ends and root causes of issues is becoming challenging. One needs to look at logs of each node hosting that service and find where the error is.
Once you know the error, it is possible to identify where the root cause is. Unfortunately, this requires looking through each node again, which is a time-consuming task and can jeopardise meeting the high availability SLA.
Finding a solution
Current debugging solutions commonly aggregate the logs from all the nodes and services in one location and have a Graphical User Interface (GUI) to look through the data. Given the volume of logs which must be viewed, however, the search process deployed by the GUI can be significantly enhanced, yielding more useful results in a fraction of the time.
To this end, Tata Communications has developed a cloud inspector framework, which makes debugging the Openstack based cloud infrastructure much easier. We have added two new fields to each VM workflow as they are created – a request tracking identifier and a dynamic logging level. So, when the API receives this request, it saves these two new keys for further processing.
This makes it easier to search the logs for all the entries related to a specific action, which in turn makes it far easier to identify the root cause of a bug, understand the service node from which it originated and ultimately fix the issue.
Inspecting the cloud
Moving forward, we will see services such as Cloud Inspector integrated with our IT Service Management (ITSM) interface, so that whenever an action fails, the system can filter log entries using the request tracking identifier and help users understand what needs to be debugged.
As these systems become highly automated and technologies such as AI enable platforms to self-maintain and heal, IT managers and decision makers can invest their time in more strategic areas than manually searching for errors and bugs in the IT infrastructure.
Read one of our previous blogs on modernising your cloud infrastructure.