IT service management has traditionally been a process-driven discipline, supported by the growing use of automation. In 2017, the increased popularity of cloud services offers up even more opportunities for greater use of automation – let’s call it “ITSM as code.”
In the ITIL Practitioner Guide from AXELOS, there's rightly a large focus on the soft skills required to manage modern IT services. That’s because we often focus too much on IT service management (ITSM) processes and technology, and then we underestimate how much the people are the real ITSM glue.
While ITIL provides best practice to ITSM professionals on how to improve service via policies, processes, metrics, and controls, cloud service providers (CSPs) such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) now offer capabilities that in effect enable this ITIL guidance to be programmatically applied to cloud services. Let’s call it “Service Management as Code.”
|CloudFormation||Azure Resource Manager||Deployment Manager|
In the AWS Well-Architected Framework white paper, one of the five recommended pillars of the framework is “operational excellence.” This pillar aligns nicely with the ITIL Practitioner guidance, and all the design principles and best practices contained within it are actionable in code.
And while the above is an AWS paper, its practices are applicable to other CSPs and in this blog I want to outline five ways in which CSPs enable Service Management as Code.
With CSPs, the metrics and controls for all services are available via a programmatic application programming interface (API). Thus, instead of a human using the familiar graphical user interface (GUI) console via a web browser, clicking and typing to control services, the human can now use a command line interface (CLI) to program the web services. This can be contained in a script or program and used repeatedly.
So, if you select any technology today, always ensure that it has a good set of API capabilities such as being able to integrate with other cloud services and to be consumed itself.
Example: the creation of a virtual machine in AWS EC2 can be done via multi-page screens in the browser, or via a one-line script on the command line.
When code is used to translate policies and controls into API commands, this code can be version controlled, access controlled, and it's 100% clear what the interpretation of a policy should be. Code can also be replayed for investigation or test purposes.
As such, configuration management and responses to operational events are excellent candidates for codifying procedures.
Example: one of the most eye-opening examples of this is the fault-injection approach of Netflix called Chaos Monkey. This is code that programmatically creates a series of failures in production to ensure resilience works. The same programmatic approach can be used to monitor and correct configurations (AWS Config), and now AWS Lambda offers “function-as-a-service” where code runs in response to operation (and other) events.
Align the programmable operations to business objectives, for example reducing the signal-to-noise in metrics.
There are many monitoring services in AWS covering API calls, logging, security access, and more. These should be programmatically and incrementally turned on only if they align to clear business goals. The rule of thumb is: if you don’t know how to action an alert, there shouldn’t be an alert. How do you know that it’s important? It must be aligned to a business outcome.
Example: an important metric with on-line services is the response/wait time for clients – as impatient customers will give up on a website if it's too slow. Using monitoring and response services like AWS Cloudwatch it's possible to monitor across the entire application and identify bottlenecks and slowdowns that impact website response times.
Use programmable Trusted Advisor services to continually employ the most cost-effective resources.
Think of those old-school, manual, consultant-led health checks that used to assess your IT environment against “best practices.” This is now automated in AWS with the Trusted Advisor service – which will check your services against best practices for cost optimization, performance, security, and fault tolerance.
Example: human error is a common cause of system outage. Trusted Advisor checks Identity and Access Management configurations to ensure principles such as least-privilege are in place, reducing the risk and consequences of human error.
Where once upon a time, systems administrators ran scripts to configure servers and application stacks consistently, AWS has taken this one step further with configuration-as-a-service in AWS CloudFormation. It can control many AWS cloud resources, meaning that you can now version control your AWS cloud services just like you do to software.
In ITSM terms, services like CloudFormation allow you to programmatically define a business service as a collection of integrated cloud services that can be repeatedly and reliably reproduced in testing or investigation scenarios. And it can be driven by all the familiar enterprise configuration and release tools such as: Powershell DSC, Chef Server, Puppet, Ansible Tower, Red Hat OpenShift, Docker Datacentre, Spinnaker, etc.
So, the choice is yours. You could manage cloud services like on-premise services – via humans interpreting documented procedures and clicking and typing into many graphical user interfaces. But you shouldn’t, unless you are happy with insufficient speed and the risk of adverse service impact.
Cloud services are programmatic and, as such, you can – and should – use code, scripts, and cloud services to codify your ITSM practices.