Wednesday, September 18, 2013

Software Engineering - Operational Requirements

Over the years I have observed that functional requirements always get higher priority compared to non-functional requirements (NFR). Some of these NFRs cover day to day operational aspects of any software product. I plan to write about scalability and availability pattern in future but I am attempting to list some of the key operational patterns.

Automated Deployment: In the recent times, continuous delivery of software has gained a lot of momentum. One of the requirements for delivering software quickly is to have the deployment process completely automated, more like a single click deployment. A precursor to automating deployment is to setup a continuous integration environment. There are tools like Hudson, Jenkins, Teamcity etc which can be used for this. Tools like puppet and chef can not only help with automated deployment but infrastructure automation as well.

Monitoring and Alerting: A system should be able to alert on events identifying an outage or degradation of its performance. This means that all critical events should be identified and configured to generate an alert. These alerts should not only be tested but validated in production environment. Tools such as CA Unicenter, Zabbix, Nagios, Ganglia etc can be used for this purpose. Most load balancers can also be used for the application health check.

Fault tolerance: The key idea behind fault tolerance is that a system should be resilient and continue to function not only under normal conditions but also under unexpected circumstances. Failures are inevitable. Developing fault tolerant system also involves a change in mindset during design, development and testing phase. It means asking question - "What are all the ways this can go wrong?". Every component of the system must be reviewed to determine failure points, impact and recovery from failure. FMEA exercise can help with such an analysis.  Most software systems have integration points and as mentioned in Release It, every integration point will eventually fail in some way, and you need to be prepared for that failure.

Throttling: This pattern refers to the idea that a system should protect itself and not harm other systems it is integrated with. This is achieved by setting a throttling parameter beyond which the system will reject additional requests. In essence -
  • Applications should throttle client requests
  • Database must throttle total requests
  • Throttled events must be logged and monitored  
Timeouts: A system should protect itself by adding timeout to all external connections. Well placed timeout provides fault isolation which prevents other systems, subsystem or device from impacting your system. Timeouts can also be relevant within a single application. It is essential that any resource pool that blocks threads must have a timeout to ensure that threads are eventually unblocked whether resources become available or not. Related to timeouts is a mechanism for retries. Most explanations for a timeout involve problems in the network or the remote system that won't be resolved right away. Immediate retries may not be helpful as it may result in another timeout. In such cases it is recommended to queue the operation and retry later.

Logging: A system informs about its activity through logging while not impairing its operation.Logging must ensure that not only critical events are logged but key metrics are available to infrastructure that consumes and reports on these events. Things to consider -
  • Decouple application from logging resources
  • Application processing continues in the event of logging resources being unavailable. This can be achieving through Asynchronous logging.
  • Automate log rotation and archiving
  • Error codes and its description is defined and documented
Instrumentation: A system should expose key system and application metrics as well as errors which could be utilized for tracking and alerting. It also helps in troubleshooting issues and root cause analysis Instrumentation helps identify highs and lows of system from which historic patterns can be created. Data from the instrumentation can be used for capacity planning too. Appropriate thresholds can be set on these metrics so that proactive actions can be taken when critical limit is reached. Two broad categories of instrumentation are
  1. System Metrics: These are metrics for the infrastructure components such as hardware, OS, databases. The metrics are for CPU, memory, disk space, network utilization, file descriptor utilized.
  2. Application Metrics: The metrics here could include response times, error count, heap/shared memory usage, threads, inbound/outbound connections etc

Testing: Apart from the manual and automated application testing, additional system level tests should be executed. These are
  • Performance Tests
  • Load Tests
  • Soak Test
  • Destructive Tests
  • Testing alerts, timeouts and throttling
This is by no means a complete list but key non-functional requirements that should be taken into consideration in software development life cycle.     

Wednesday, August 28, 2013

Software Architecture - Message Queues

I strongly believe that a key aspect of software engineering is applying proven architectural patterns. Lately I have wanted to write about some of these patterns which address many engineering concerns in any enterprise scale software. One such pattern is Asynchronous processing through message queues.

There is ton of information available on internet on the usage of message queues. I have been in many discussions about software design where the topic of message queue would come up and someone would say - "Why should we use message queues?". Here is my attempt to address this question.

Asynchronous Interaction: Many times the messages or events don't need to be processed immediately in real time and can be delayed. It means that critical processing can be done in real time and non-critical processing can be delayed. This could potentially help in reducing response time for the synchronous processing. Message queues enable the asynchronous processing by allowing messages to be put into a queue which can be processed later. Notifications such as sending e-mails is one such example. 

Decoupled Architecture: Decoupled software components allow each component to evolve and scale without impacting other components. Message queues form an intermediary between the two components which share which agree to a data (message type) based interface.

Reliability through Guaranteed Delivery: Messaging infrastructure provides guaranteed delivery of messages. In the event of failures, messages are not lost and can be recovered and reprocessed.

Scalability: Message queues decouple the producer and consumer components. This allows scaling up the rates at which messages are added to the queue or processed. By adding new processes, the system can scale without requiring additional code changes.

Resiliency: Message queues provides isolation between various components. This means that the entire system doesn't go down if some parts fail. Systems can be designed such that the critical components can continue processing in the event of failures in the other parts of the system.

Throttling: In order to protect the system from getting overloaded, it is necessary to throttle the request processing. Typically when a throttle limit is reached, the application denies request processing. This may not be acceptable in a HA system. In such cases, requests can be queued when a throttle limit is reached. The queued requests can then be processed when the load on the system is reduced.

Throughput: Message queues allows possibility of concurrent execution of the processes. This means that the system throughput can be tweaked through addition of processes. However there is some tension between throughput and reliability. In order to increase message throughput, it is recommended to turn off message persistence.

Ordering: Driven by the business needs, many applications require message to processed in a sequential manner. Most message brokers allow message ordering through server side or client side mechanisms.

Event Driven Processing: Messaging frameworks provides a mechanism to implement event driven architectures. Such systems typically consist of event emitters (agents) and event consumers (sinks). Messaging frameworks naturally fit into this model.

This is not a complete list but some of the key benefits of message based solutions.    

Saturday, March 16, 2013

Agile software development manager

Agile software development methodologies like SCRUM have become mainstream in the industry now. Scrum is a framework which defines a set of roles and events that the team follows with a goal to deliver working software at the end of each short iterations. The framework defines 3 roles:
  1. Product Owner
  2. Scrum Master
  3. Team
The team consists of developers, testers, business analyst, architects and DBA. Product Owner is responsible for the vision and manages priorities of the product features. Scrum Master is the owner of the process that the team follows. Although Scrum has defined these roles, it doesn't mention anything about the role of Development/Engineering Managers and Project Managers. The role of a project manager in an agile environment is described well in this book. Being a development manager, I wanted to have a clear understanding of a manager's role in an agile environment. Many have talked about it here and also in books such as Management 3.0. Based on my experience and understanding, I have attempted to describe the responsibilities of a development manager below.

Delivery of Software Releases: Depending on the organizational structure, accountability of the software releases lies either on the project manager or the development manager. The development manager ensures that the Scrum team follows the release plan and the release is ready to be shipped/deployed for the customers. The project managers help with budgeting, risk management, milestone tracking and co-ordination of the releases. Although the scrum master is responsible for removing bottleneck, the person may need help from management. The manager needs to provide such help.

Staffing: The development manager makes sure that the development team is fully staffed with the right people with the right skills sets. I have found this to be truly challenging given the importance of having the team with the right team members. He/she must work with HR and recruitment department to ensure that consistent hiring process is followed.

Manage Environment and Relationships: Agile movement has changed how we develop software. Instead of the traditional command and control approach to the concept of having a self organizing team. The team commits to a sprint goal and tried to achieve it at the end of the sprint. The team decides how it wants to achieve the sprint goal. In this case, the development manager ensures that a safe and fun environment exists where creativity and innovation comes out. He/she manages the conflict and ensures effective collaboration and communication takes place within and outside the team.The development manager works with the Scrum master to build a team where people trust each other and enjoy working together.

Manage processes and practices: The development manager is responsible for instituting the process which fosters better collaboration and visibility within and outside the team. In many organizations, the scrum team is responsible for getting the software ready to be shipped. However there are separate teams for software delivery, training, operations and support. In that case it is important to have clearly defined roles and responsibilities and process in place. RACI matrix helps in defining roles and responsibilities across multiple teams.

Coaching and Performance Improvements: As mentioned in the risk management book by Tom Demarco, one of the risks to any software project is people turnover. There are two major factors that contribute to people turnover; factors that push people out and factors that pull people to other companies. There is not much that can be done about the later but the push factors can be controlled. The manager needs to invest time on coaching team members and help them with their career path. He/she needs to actively support and encourage team members, share career opportunities, describe what it takes to get promoted, offer candidate and actionable feedback and lead by example. Development manager also evaluates the performance and provides inputs for improvements.

Technology Radar: The development manager keeps in touch with the changing technology landscape. He/she doesn't need to be expert in the technology, this is best suited for architects and leads, but should be comfortable with current and upcoming technology that would impact the product. The manager tries to ensure that the team members are aware of the newer technology that would solve business problems.

Reports and Metrics: People frown when the subject of metrics come up. This is mainly because the metrics don't get used properly in organizations. Metrics when used properly provides means to make continuous improvements. This is well described in this article about use of metrics. The development manager identifies essentials metrics and reports which can not only be used to make decisions but also provide scope for continuous improvements.

This, by no means, is the complete list. However at a high level it covers the areas where the development manager should focus on, especially in an agile environment.