This is the second post in a series in preparation for my presentation at the Lean Kanban Netherlands 2012 conference, about Enhanced Risk Management in Kanban via the Theory of Constraints, that I will deliver on October 26 in Utrecht. As described in the previous post, Critical Chain Project Management in the Theory of Constraints, the purpose of this series is propaedeutic to understanding how the ideas of the Theory of Constraints can be applied in contemporary software processes, and in particular to the Kanban Method. This series will provide some foundational knowledge in the areas of:

  • Schedule Management
  • Buffer Management
  • Risk Management
  • Root Cause Analysis
  • People Factors
  • Continuous Improvement

The previous post introduced Schedule Management; in this post we will learn more about Buffer Management and Risk Management.

Finding Herbie

The principal idea of the Theory of Constraints is the simple concept that there is always one constraint that limits the throughput of any system (the “weakest link of the chain”). It is not a surprise that all risk management practices revolve around finding and managing the constraint too.

“Herbie” was a character in “The Goal” [GOLDRATT-1992] — the business novel where the Theory of Constraints was first described. Herbie was a little boy, albeit overweight; and he was the cause of a line of young scouts moving slowly on a hike. Herbie’s rise to fame is that he represented the first constraint that the main character of the novel, Alex Rogo, managed to identify.

Since then, “Finding Herbie” is a colloquial way of saying: “Let’s find the constraint.”

Quite obviously, finding the constraint is very important in the TOC; but it is not always obvious how you can find your Herbie! Using a Kanban board can help identify constraints in your work flow, but it is of limited value in finding constraints in your overall process. We will discover how to do the latter.

The Five Focusing Steps and Kanban

In manufacturing and in other physical processes, constraints are easy to identify, because (typically) there will be work in process piling up in front of the constraint, and voids after the constraint. In processes where most activities are immaterial (like software development), the very nature (immaterial!) of the process makes it difficult to get similar visible clues that give away the constraint.

The recent success of Kanban for Software is a good way to make the work flow visible. Kanban makes it easier to find the piles of work in process, and act once they are identified. Limiting work in process is another means to make the occurrences of such piles more immediately recognizable. It is just natural to apply the Theory of Constraints’s Five Focusing Steps (5FS) precisely to this situation.

For an introduction of the 5FS, see the earlier post: Theory of Constraints and Software Engineering. For an example of this kind of thinking, where the 5FS are applied to a Kanban board, see for instance [CHARLTON-2011].

Find the Constraint in the Process, and Not Only in the Work Flow

Unfortunately, this way of looking at a Kanban board — with the intent of finding the bottlenecks and the constraints in the work flow — is very simplistic; and more often than not misleading. The “piles” that form in a Kanban board are not necessarily indicative or revealing of the real constraints in your process.

To find the real process constraint, a more systematic approach is needed.

Note that while the Critical Chain is indeed the constraint in the project network, it is not necessarily the constraint in the process that the organization employs to produce software. The process constraint can have a much bigger impact, with more larger consequences on the long term bottom-line (“The Goal”), than the constraint in the single project. Similarly, the constraints revealed on a Kanban board, do not necessarily reveal the constraint of the process.

Identifying the constraint in the project network will allow you to deliver that single project faster; but identifying the constraint in your overall process will allow you to improve the way you deliver all your software projects. It is a stepping stone towards process improvement.

The approach we are about to investigate will reveal the process constraint; but in order to do so, we must know more about how TOC deals with Buffer Management. Buffer Management is the second foundational knowledge topic we will cover, as summarized in the following figure (that was presented in the previous post).

Elements of CCPM

Buffer Management in Critical Chain Project Management

We have already seen how CCPM scheduling places a single buffer at the end of the project network. We have seen how important it is to appropriately size the buffer. We have seen how this results in project plans that are much shorter than comparable plans made with the Critical Path Method. Now we will focus on how the buffer can be used in practice.

Buffer Consumption

Buffer management plays a critical role, and the key is the concept of Buffer Consumption or Buffer Burn Rate. The Buffer Burn Rate is defined by [SULLIVAN-2012] as follows:

The rate at which the project buffer is being consumed […] The rate is calculated as the ratio of the percent of penetration into the project buffer and percent of completion of the critical chain.

High buffer consumption is a sure sign that something is wrong. Fortunately, early warning signals can be inferred.

An example given by Sullivan is this:

If the project buffer is 40% penetrated and the critical chain is only 20% complete, the buffer burn rate is 40/20 = 2.0. The project manager has a warning that there is a problem, and if it continues, it will possibly jeopardize the project due date.

Buffer Usage

The project buffer is a safety instrument that is constantly monitored. It protects the project from disruptions that might happen when the activities on the Critical Chain are performed. (Note: Feeding Buffers also protect the Critical Chain, but from problems in a non-Critical Chain path.) It protects the due date. It is used to coordinate resources on the Critical Chain (and have them ready when needed), and to prioritize work. Any task consuming the buffer is given the highest priority. [WOEPPEL-2005] describes it clearly:

Project execution is THE most important part of achieving success […] Monitoring and responding to the condition of the buffers is the key to that. Rather than responding to individual tasks, the project team responds to the condition of the buffers. […] The [buffer burn] ratio tells us when a project is in danger of not being completed on time. […] By identifying which tasks are creating the highest buffer burn ratio, the project manager knows which tasks to focus on right now.

Monitoring the buffer is an operational activity performed during the project’s execution. The Critical Chain is used for planning the project; but the buffer is used for managing execution of the project. The buffer prompts self-expediting, assigns priority of resources, and solicits management actions when necessary.

The ability to have a running, leading status indicator — the buffer consumption ratio — is one of the strongest contributions of CCPM. Note well that this indicator does not report the amount of work done. This is very different from most other project management methods, which tend to report project status in terms of work done (“We are 90% done!” “Yeah! Right!”).

This indicator represents work done in relation to how much time has been set aside (the buffer) to absorb unforeseen problems.

The extent to which this margin is consumed is an indication of the project’s health or illness. Consequently, there are many operational advantages, which all enable improved risk management.

For example, one such operational advantage relates to frequency of reporting. [LEACH-2004] observers: “for buffer management to be fully useful, the buffer monitoring time must be at least as frequent as the shortest task duration.” If this is granted, then the frequency of status reporting can be much smaller than in traditional project management. If the shortest planned task is measured in hours, then reporting can be made on an hourly basis; rather than weekly as is typical in most project environments. This can give even earlier signals about problems. Consider how this relates to the recent move towards “Continuous Deployment”!

Buffer Zones, Thresholds and Signals

The project buffer is divided into three zones. The zones are often represented in Green, Yellow and Red; like a traffic control light. (You can also see this in the figure presented in the previous post, where the project buffer shows the three colored zones.) Typically these zone are sized to one third of the buffer; though relative sizes may be changed dynamically in the more advanced applications. (Naturally, this relates to the topic about how to appropriately size a buffer.)

The three zones give a more granular control about knowing when to act. In [COX-2010], C. Spoede Budd and J. Cerveny offer a crucial insight: the three zones are representative, respectively, of Expected Variation, Normal Variation, and Abnormal Variation. Monitoring buffer consumption with respect to the three zones gives visible and actionable signals:

  • Expected Variation (Green Zone): Everything is working “according to plan.” The green zone absorbs inherent task uncertainty. No special action is required. In fact any interference in this zone is most likely counter-productive, as it would produce what [DEMING-1993] referred to as “tampering”: a waste of productive time that causes loss of focus.

  • Normal Variation (Yellow Zone): Everything is under control; but prepare for action. The yellow zone absorbs the inherent uncertainty in task duration prediction. Time is consumed to cover task overages: prepare plans to recover lost time, but take no action yet (to avoid tampering). Focus on understanding what is causing time consumption, and what can be done about it.

  • Abnormal Variation (Red Zone): When the red zone is reached, you must act. Implement the plans prepared while buffer consumption was in the yellow zone. Most likely, unique events outside the normal course of the project’s operations have caused the problem.

Dividing the project buffer into three zones provides a powerful tool for anticipating and acting on risks. Other project management practices don’t detect the problems until later. [LEACH-2004] concludes:

Through this mechanism, buffer management provides a unique anticipatory project-management tool with clear decision criteria.

Buffer Fever Charts

To visualize the status of a buffer, you typically use a Fever Chart, where you plot the buffer consumption (as a percentage) towards the project completion (again, as a percentage). For instance, you would have a diagram that looks like this:

Buffer Fever Chart

In the figure, the blue line represents the progress of the project’s execution. You can see it started off in the yellow zone, ran into problems when it penetrated into the red zone. Problems were addressed, and execution went back into the green zone. Towards the end, the project went back into the yellow zone, and finished only slightly overdue.

(Note: this figure is only illustrative. The actual placement and slope of the two threshold lines depend on how you have sized the buffer and the three zones.)

Common Cause and Special Cause Variation

It is here that we see the connection with common and special cause variation. [DEMING-1982] identifies Common Cause and Special Cause variation: common cause variation is inherent in the process itself, while special cause variation has external origins.

In simple words: “common cause” variation is caused by something that is expected: like your being less efficient on a Monday morning. It does not come as a surprise. You know it is there. You try to deal with it. “Special cause” variation is caused by special events: like you being unable to work because you broke a leg. It is a surprise. You have to manage an emergency.

One of the most devastating errors in risk management is confusing the two kinds of risks. The Theory of Constraint clearly distinguishes between them, and provides leading indicators that enable fast risk detection, categorization and mitigation.

The key point is this: The yellow zone is there to absorb common cause variation, while the red zone is there to absorb special cause variation.

This is an extremely important insight, with far reaching consequences! The Theory of Constraint, unlike most other project management and general management methods — but like [SHEWHART-1986] — makes the important distinction between Common Cause and Special Cause variation.

It is this handling of common and special causes that gives the Theory of Constraint an edge, and it can be successfully applied to and combined with other approaches (remarkably to Kanban, as we will see).

Buffer Control Charts, Trend Analysis and Trigger Points

You can realize further refinements via Control Charts. In fact, [LEACH-2004] suggests to “plot trends of buffer utilization. The buffer measure then becomes, in essence, a control chart and can use similar rules.” One advantage is that “trending buffer data preserves the time history of the data and shows the trend of buffer consumption vs. project time.” This certainly helps improving control (of work vs. time).

Furthermore, you can then use Trend Analysis and Trigger Points. For instance, four points in a row trending towards a threshold might be enough to take action. You detect the oncoming trouble even earlier. Use judgment when deciding whether to act upon such signals or not; but at least you now have the possibility of considering such early signals.

Empirical Processes and Trend Analysis

If your process is not under statistical control, then — as noted by [SHEWHART-1986] — trend information is even more so important.

This is particularly relevant if these methods are combined with Agile methodologies. Most Agilists claim software development cannot be put under statistical control (probably rightly so!).

Hence the usage of trending data is even more important.

It is not coincidental that Scrum uses Burn Down charts, and Kanban employs Cumulative Flow diagrams; both of these kinds of charts reveal trends in an empirical process! All such charts can be used to reveal emerging trends and take decisions based on such trends.

With the Theory of Constraints’ buffer management, such decisions can be taken with better insight about the nature of the problems that you might be facing (i.e. common or special cause variation). We start to see, that maybe this buffer management has something to offer to those who view software as a dynamic, complex adaptive system, that can only be controlled via empirical processes.

Empirical processes, that are not under statistical control, have to rely on trend analysis; and the Theory of Constraints’ buffer management techniques are indeed a tool that allows us to do so when using Agile/Lean approaches.

Risk Detection and Classification, Constraint Identification and Continuous Improvement

So what happens when trends penetrate through the various zones of the buffer? We can start thinking about why this is happening while in the yellow zone, and then act in time as soon as the red zone is penetrated. Whenever the buffer penetration or trending lines raise a red signal, then is the time to act, before problems become critical. This is possible due to the leading nature of these signals. These signals indicate that a risk is about to materialize; or, at least, the materialization’s effect are about to impact the overall project schedule negatively.

To facilitate risk management, when you encounter such a signal, identify the Triggering Reason. Do this systematically, by attributing Reason Codes and keeping track of them.

By deliberately finding a reason, you are actually identifying a risk that is about to become a problem. The signals announce risk materialization impact. Any time you discover a new reason, you have uncovered a new Unmanaged Risk. Whenever you decide to take action, annotate the corresponding reason code, and document the trigger condition, and the action taken (preventive, mitigation, avoidance, etc.).

Typically you will find it is about one of the following two situations:

  • A one time occurrence of a buffer penetration reason is likely due to Special Cause variation: use common risk analysis techniques to establish if that is the case and if exceptional action is required. If a reason is not due to exceptional circumstances, then it is likely due to Common Cause variation: but you have to resort to occurrence frequency analysis to decide if action is required.

  • A recurring reason indicates a systematic process problem due to Common Cause variation. It gives you the opportunity to initiate Process Improvement.

Notice this last finding: we have a systematic way to identify where we can improve the process. Compare this to pharaonic “process improvement” initiatives, like CMMI and similar ones, where all and everything is “improved” all the time — yet more “working software” is not delivered, despite all the “improvements.”

This is an instance where you can see the Theory of Constraints in action, giving you focus and leveraging power.

Root Cause Analysis

You can then use Root Cause Analysis to focus improvement initiatives where they have the most effect. Always record the causes of problems. Subsequently you can perform a Pareto Analysis to find the most common or expensive problems. Yes! This is how you find the process constraint, even when the process is immaterial. This is the constraint that prevents your software production process/organization/system from delivering as desired.

The TOC philosophy is clear: not everything is worth improving; only the most common or expensive problems. Focus efforts where they can have the most effect.

By doing this systematically, many common cause variations (inherent and systematic problems to your process) can be eliminated or reduced; and preventive measures taken for the improvement of the entire process. The non-recurring reason codes indicate special cause variation, and need urgent action once confirmed.

Naturally, a reason code by itself is not sufficient to trigger an action. While all the reasons producing a yellow zone penetration should be recorded too, they should not induce you to take action. You must record all the reasons producing red zone penetration. They might not be the real cause of the incident; there might be Concomitant Causes, Cumulative Effects, and so on. Apply common risk management wisdom!

This is when Root Cause Analysis is needed to identify the ultimate reason or reasons, and to ensure the process improvement effort can be more focused and effective. In the next post we will see what tools the Theory of Constraints has to offer to allow you to perform root cause analysis. Thereafter we will tie all this together, and see how we can improve risk management in a Kanban process.

Stay tuned!