Environment management in general comes with its own set of complications, but what I am interested in knowing is how other Systems/Operations departments are handling auto deployments. Several of the points in the articles above are moot for us and have been for some time. Access to production systems is heavily limited to our Systems Administrators and many of our products are deployed with scripts triggered by team city builds. Code or configuration files are not deployed manually to our end to end test environment, and when they are, generally an auto-deployment script needs to be touched to alter the outcome rather than a human touching a file. Don’t get me wrong, there are still blunders that occur from one environment to another “oh I forgot that we needed to add that file so I didn’t add it to the production release… oops”. However, my main concern is what to do when an auto deployment script behaves badly.
First, let us discuss what auto deployment is to our teams. When code is ready to release to a test or production environment, an auto deployment script will be executed to place the code on the correct server(s). In the case where code is going to production, a Systems Administrator is assigned a ticket to take a few preparation steps, then run the script to deploy the code. Depending on the environment and product, one or all of the following things could happen during an auto-deployment run:
- Disable/enable monitors
- Disable/enable nodes in a cluster
- Run database scripts
- Svn checkout or switch a directory or file
- Copy files from one location to another
- Start/stop services
- Start/stop web containers
The first time I ran an auto-deployment release, I felt naked. I was so used to knowing every piece of what was actually required for a product to become live that pressing a button felt completely foreign. I worried that something would be missed, and I just wouldn’t know. So what should you do in a case like this? How do you mitigate the risk of the system being compromised? You’re running a script that you know little about other than you just typed several passwords into a config file on a secure server and it’s using your permissions to execute. However, because you have done this manually before, you know what the behavior and outcome of the script should be, so verify it.
The first few times we were vigilant. Check to make sure the node is out, review all the database scripts prior to release, check log files during and after the release, etc. Over time though, we started to become more comfortable with the release runs, and this is when the problems started to arise.
Auto-deployment scripts in recent months have been known to produce any one of the following symptoms:
- Didn’t run the database scripts, or ran the incorrect database scripts
- Didn’t remove the node from the cluster or didn’t put it back in
- Deployed the wrong the version
- Deployed the wrong application
- Failed because of a spelling mistake
- Showed “all green” in our monitoring system, but still caused a serious underlying problem
All these problems are compound, they are the responsibility of both the development and systems teams to realize and mitigate. The real root of the solution is to treat the auto-deployment script as exactly what it is: code. What do you do with code? You test it, you unit test it, and you make sure that it throws exceptions and failures when it doesn’t do what it is supposed to do. Most of the above problems could easily be avoided if the script was intelligent enough to know when it has made a mistake.
So, this is my list of remedies for saving an operations team from feeling naked when learning to run auto deployment scripts.
- Get involved, have the script demoed to you and understand what it does before you run it. Ensure that every time it changes, you receive a new demo.
- Don’t be naive, yes, maybe one day this script will run without someone watching over it, but before that happens, sys admins and developers should work out the kinks.
- Review database scripts with a developer before the script is executed then check to ensure that one of the intended operations occurred.
- Any time a problem is encountered post-release that is found to be the fault of the script, have the script altered to create a FAILURE state if it ever happens again. Additionally, review the script for other possible failures and have those adjusted too.
- Check log files, review the script output and review the logs of your application containers.
- Ask developers to check log files too, you will not always catch an error.
- Consider having the script pause at critical stages (i.e. between nodes, etc) so that things can be inspected before they go too far.
- Improve your monitoring systems. Create monitoring situations for the errors that get caused by the releases.
One day, after you have executed a script 10 times and it has not made a mistake, you will trust it. The important thing to remember is that the scripts will continue to evolve, so stay involved and help test every change. After all, as Systems Administrators or Operations staff, it is your job to ensure that the system is operational and take responsibility or alert the appropriate team when it is not.
Systems Administration Team Lead
When I started my career in IT I was a UI developer. The most I had learned in my formal post-secondary education was that projects consisted of functional requirements gathering, documentation, development, acceptance and release. I lived in a tiny bubble. Code was written and tested, and then it was magically deployed to a customer’s website by some process that I was obviously not involved in.
As my skills and responsibilities evolved, so did my aptitude for understanding everything that was going on in the development and release cycles. I wanted to know what was supposed to happen before, during and after a software release. When I read this post about non-functional requirements from Leading Answers, it reminded me how much I have learned over the past several years as well as how much I enjoy seeing things from several points of view.
Some people (of course not all…) find comfort in understanding more aspects of their development projects.
If you think that might be you, consider getting out of your bubble to explore something a little different.
By: Melanie Cey