The project I've been working on for the last year and a half has finally gone into production. That is, it was already running along in production, but it wasn't steering the big furnaces it was meant to control yet. (These furnaces heat up big slabs of steel to be able to roll them down to thin sheets of steel).
I'm proud to say that up until now, the switch from the old, Fortran-based system to the new, .NET-based system has gone without major glitches. Apart from some remarks on the UI by the end-user, there were no errors or exceptions, and the furnace is happily heating up steel. Of course, this has been a collaborative effort, involving several other developers and applications. But here's an overview of what steps I took to achieve this smooth transition. It's not a guarantee that nothing will go wrong of course, but it can be a small guide to interesting steps to take in critical applications.
I code test-driven. This gives me a cleaner, loosely-coupled design, making it easier to refactor or re-architecture throughout the project. I changed several core mechanisms with limited effort or problems.
For example, in the beginning input came from several external sources, all at the same time. This caused race conditions at times. I then changed to a locking approach, and finally to a queuing mechanism. All the underlying parts could be re-used and remained under test.
An extra bonus is that 75% of the code (give or take) is verified to be working. When refactoring, tests did break. If I wouldn't have had these tests, I wouldn't have known until we were in production, or maybe even ever.
I also didn't limit myself to unit-tests. There are integration tests (mainly database-stuff), but also end-to-end tests. For this, I introduced the great SpecFlow framework. This was also handy for simulating situations that we couldn't reproduce in reality (because we can't ask the company to change the way they're running the furnace just for testing some piece of software).
Don't trust external sources
The application to control these furnaces receives 20+ message types from 6+ sources. Any of these inputs could contain invalid data, could go missing, etc. If these situations lead to the wrong values sent to the furnaces, costs could go up considerably. If it leads to nullreference exceptions, nothing might be sent, leading to the steel slabs exiting the furnace incompletely heated.
This would all lead to support calls in the middle of the night, which I want to avoid :)
So, if possible, I check for external input and refuse or correct where appropriate. If necessary, there are fallback values to use.
Make the system heal itself
There are cases where invalid data may come through. Or the application or the server might crash and restart. In that case, we would have possibly missed some messages. In short, there are situations where the system could end up in an invalid state.
You can't foresee all these types of situations, but running the system in a QA environment for some time should point to several possible problems. Instead of just fixing them by hand, I tried to find ways in which the system could fix itself automatically, as fast as possible.
A request to an external service goes unanswered? Can't stop the furnace, so use fallback values, and keep requesting until we do get a response (or it is no longer necessary).
Missing messages could pose potential problems? Use queues and/or check with the external systems at regular intervals to see if our system is still in sync.
Of course, these situations are also logged. But it means I can fix them when I arrive in the morning, instead of having to scramble out of my bed at night.
Run the system in production ASAP
A slight nuance here. While we've been running in production for about 5 months, the furnaces did nothing with the commands our system sent out. But it allowed us to run along with real data, on the real servers, with real situations, etc.
It also made the switch easier. There were some flags to be set, and we were good to go.
This all reads like everything went fine. This is not the case. There were some remarks from the end-user, our switch was postponed with one month and for a few minutes we were sending duplicate commands. But all in all, I'm very happy with how it went. Plus, I wasn't woken up by a support call in the middle of the night.
Running a project like this also requires some infrastructure and organization:
- I have enough freedom to introduce new, valuable technologies such as SpecFlow.
- There is a nice system set up to do rapid and easy deployments. With just a few clicks, I can deploy a new release. At certain times, I was deploying 2-4 times a day.