Shubh's Musing: Software Engineering - Operational Requirements

Over the years I have observed that functional requirements always get higher priority compared to non-functional requirements (NFR). Some of these NFRs cover day to day operational aspects of any software product. I plan to write about scalability and availability pattern in future but I am attempting to list some of the key operational patterns.

Automated Deployment: In the recent times, continuous delivery of software has gained a lot of momentum. One of the requirements for delivering software quickly is to have the deployment process completely automated, more like a single click deployment. A precursor to automating deployment is to setup a continuous integration environment. There are tools like Hudson, Jenkins, Teamcity etc which can be used for this. Tools like puppet and chef can not only help with automated deployment but infrastructure automation as well.

Monitoring and Alerting: A system should be able to alert on events identifying an outage or degradation of its performance. This means that all critical events should be identified and configured to generate an alert. These alerts should not only be tested but validated in production environment. Tools such as CA Unicenter, Zabbix, Nagios, Ganglia etc can be used for this purpose. Most load balancers can also be used for the application health check.

Fault tolerance: The key idea behind fault tolerance is that a system should be resilient and continue to function not only under normal conditions but also under unexpected circumstances. Failures are inevitable. Developing fault tolerant system also involves a change in mindset during design, development and testing phase. It means asking question - "What are all the ways this can go wrong?". Every component of the system must be reviewed to determine failure points, impact and recovery from failure. FMEA exercise can help with such an analysis. Most software systems have integration points and as mentioned in Release It, every integration point will eventually fail in some way, and you need to be prepared for that failure.

Throttling: This pattern refers to the idea that a system should protect itself and not harm other systems it is integrated with. This is achieved by setting a throttling parameter beyond which the system will reject additional requests. In essence -

Applications should throttle client requests
Database must throttle total requests
Throttled events must be logged and monitored

Timeouts: A system should protect itself by adding timeout to all external connections. Well placed timeout provides fault isolation which prevents other systems, subsystem or device from impacting your system. Timeouts can also be relevant within a single application. It is essential that any resource pool that blocks threads must have a timeout to ensure that threads are eventually unblocked whether resources become available or not. Related to timeouts is a mechanism for retries. Most explanations for a timeout involve problems in the network or the remote system that won't be resolved right away. Immediate retries may not be helpful as it may result in another timeout. In such cases it is recommended to queue the operation and retry later.

Logging: A system informs about its activity through logging while not impairing its operation.Logging must ensure that not only critical events are logged but key metrics are available to infrastructure that consumes and reports on these events. Things to consider -

Decouple application from logging resources
Application processing continues in the event of logging resources being unavailable. This can be achieving through Asynchronous logging.
Automate log rotation and archiving
Error codes and its description is defined and documented

Instrumentation: A system should expose key system and application metrics as well as errors which could be utilized for tracking and alerting. It also helps in troubleshooting issues and root cause analysis Instrumentation helps identify highs and lows of system from which historic patterns can be created. Data from the instrumentation can be used for capacity planning too. Appropriate thresholds can be set on these metrics so that proactive actions can be taken when critical limit is reached. Two broad categories of instrumentation are

System Metrics: These are metrics for the infrastructure components such as hardware, OS, databases. The metrics are for CPU, memory, disk space, network utilization, file descriptor utilized.
Application Metrics: The metrics here could include response times, error count, heap/shared memory usage, threads, inbound/outbound connections etc

Testing: Apart from the manual and automated application testing, additional system level tests should be executed. These are

Performance Tests
Load Tests
Soak Test
Destructive Tests
Testing alerts, timeouts and throttling

This is by no means a complete list but key non-functional requirements that should be taken into consideration in software development life cycle.

Shubh's Musing

Wednesday, September 18, 2013

Software Engineering - Operational Requirements

No comments:

Post a Comment