Charles Perrow takes a particularly negative view of the possibility of safe management of high-risk technologies in Normal Accidents: Living with High-Risk Technologies. His summary of the Three Mile Island accident is illustrative: “The system caused the accident, not the operators” (12). Perrow’s account of TMI is chiefly an account of complex and tightly-coupled system processes, and the difficulty these processes create for operators and managers when they go wrong. And he is doubtful that the industry can safely manage its nuclear plants.
It is interesting to note that systems engineer and safety expert Nancy Leveson addresses the same features of “system accidents” that Perrow addresses, but with a greater level of confidence about the possibility of creating engineering and organizational enhancements. A recent expression of her theory of technology safety is provided in Engineering a Safer World: Systems Thinking Applied to Safety (Engineering Systems) and Resilience Engineering: Concepts and Precepts.
In examining the safety of high-risk industries, our goal should be to identify some of the behavioral, organizational, and regulatory dysfunctions that increase the likelihood and severity of accidents, and to consider organizational and behavioral changes that would serve to reduce the risk and severity of accidents. This is the approach taken by a group of organizational theorists, engineers, and safety experts who explore the idea and practice of a “high reliability organization”. Scott Sagan describes the HRO approach in these terms in The Limits of Safety:
The common assumption of the high reliability theorists is not a naive belief in the ability of human beings to behave with perfect rationality, it is the much more plausible belief that organizations, properly designed and managed, can compensate for well-known human frailties and can therefore be significantly more rational and effective than can individuals. (Sagan, 16)Sagan lists several conclusions advanced by HRO theorists, based on a small number of studies of high-risk organizational environments. Researchers have identified a set of organizational features that appear to be common among HROs:
- Leadership safety objectives: priority on avoiding altogether serious operational failures
- Organizational leaders must place high priority on safety in order to communicate this objective clearly and consistently to the rest of the organization
- The need for redundancy. Multiple and independent channels of communication, decision-making, and implementation can produce a highly reliable overall system
- Decentralization -- authority must exist in order to permit rapid and appropriate responses to dangers by individuals closest to the problems
- culture – recruit individuals who help maintain a strong organizational culture emphasizing safety and reliability
- continuity – maintain continuous operations, vigilance, and training
- organizational learning – learn from prior accidents and near-misses.
- Improve the use of simulation and imagination of failure scenarios
The genuinely important question here is whether there are indeed organizational arrangements, design principles, and behavioral practices that are consistently effective in significantly reducing the incidence and harmfulness of accidents in high-risk enterprises, or whether on the other hand, the ideal of a "High Reliability Organization" is more chimera than reality.
A respected organizational theorist who has written on high-reliability organizations and practices extensively is Karl Weick. He and Kathleen Sutcliffe attempt to draw some useable maxims for high reliability in Managing the Unexpected: Sustained Performance in a Complex World. They use several examples of real-world business failures to illustrate their central recommendations, including an in-depth case study of the Washington Mutual financial collapse in 2008.
The chief recommendations of their book come down to five maxims for enhancing reliability:
- Pay attention to weak signals of unexpected events
- Avoid extreme simplification
- Pay close attention to operations
- Maintain a commitment to resilience
- Defer to expertise
Maxim 2 addresses the common cognitive mistake of subsuming unusual or unexpected outcomes under more common and harmless categories. Managers should be reluctant to accept simplifications. The Columbia space shuttle disaster seems to fall in this category, where senior NASA managers dismissed evidence of foam strike during lift-off by subsuming it under many earlier instances of debris strikes.
Maxim 3 addresses the organizational failure associated with distant management -- top executives who are highly "hands-off" in their knowledge and actions with regard to ongoing operations of the business. (The current Boeing story seems to illustrate this failure; even the decision to move the corporate headquarters to Chicago, very distant from the engineering and manufacturing facilities in Seattle, illustrates a hands-off attitude towards operations.) Executives who look at their work as "the big picture" rather than ensuring high-quality activity within the actual operations of the organization are likely to oversee disaster at some point.
Maxim 4 is both cognitive and organizational. "Resilience" refers to the "ability of an organization (system) to maintain or regain a dynamically stable state, which allows it to continue operations after a major mishap and/ or in the presence of a continuous stress". A resilient organization is one where process design has been carried out in order to avoid single-point failures, where resources and tools are available to address possible "off-design" failures, and where the interruption of one series of activities (electrical power) does not completely block another vital series of activities (flow of cooling water). A resilient team is one in which multiple capable individuals are ready to work together to solve problems, sometimes in novel ways, to ameliorate the consequences of unexpected failure.
Maxim 5 emphasizes the point that complex activities and processes need to be managed by teams incorporating experience, knowledge, and creativity in order to be able to confront and surmount unexpected failures. Weick and Sutcliffe give telling examples of instances where key expertise was lost at the frontline level through attrition or employee discouragement, and where senior executives substituted their judgment for the recommendations of more expert subordinates.
These maxims involve a substantial dose of cognitive practice, changing the way that employees, managers, and executives think: the importance of paying attention to signs of unexpected outcomes (pumps that repeatedly fail in a refinery), learning from near-misses, making full use of the expertise of members of the organization, .... It is also possible to see how various organizations could be evaluated in terms of their performance on these five maxims -- before a serious failure has occurred -- and could improve performance accordingly.
It is interesting to observe, however, that Weick and Sutcliffe do not highlight some factors that have been given strong priority in other treatments of high-reliability organizations: the importance of establishing a high priority for system safety in the highest management levels of the organization (which unavoidably competes with cost and profit pressures), the organizational feature of an empowered safety executive outside the scope of production and business executives in the organization, the possible benefits of a somewhat decentralized system of control, the possible benefits of redundancy, the importance of well-designed training aimed at enhancing system safety as well as personal safety, and the importance of creating a culture of honesty and compliance when it comes to safety. When mid-level managers are discouraged from bringing forward their concerns about the "signals" they perceive in their areas, this is a pre-catastrophe situation.
There is a place in the management literature for a handbook of research on high-reliability organizations; at present, such a resource does not exist.
(See also Sagan and Blanford's volume Learning from a Disaster: Improving Nuclear Safety and Security after Fukushima.)