By Michael D. Lam/Overhaul & Maintenance

This article appeared in Overhaul & Maintenance’s May 2009 issue.

We start, appropriately enough for a mystery, with a riddle:

Q: What do you get when you cross Murphy’s Law (“Whatever can go wrong, will”) with Lance’s Law (“If it ain’t broke, don’t fix it”)?

A: You get No Fault Found, something that seems broke but isn’t–or is&sometimes–or was, but not any more and no one knows why.

This might be funny if No Fault Found (NFF), more properly defined as a reported fault that can’t be reproduced or confirmed in a line or shop environment, or one for which no cause can be found, were not so common or so costly.

NFF afflicts many industries: automotive, consumer electronics, mobile telephony and others. But perhaps none is as persistently “plagued” as aviation, says Mitch Klink, newly elected chairman of the AMC (formerly the Avionics Maintenance Conference). NFF is a “complex issue,” according to Klink, who also is a member of the ARINC working group that wrote Report 672 (“Guidelines for the Reduction of No Fault Found”). The elusive error is one that “easily costs the world’s airlines tens of millions of dollars every year.”

While some peg the prevailing rate for NFFs at closer to 30%, Klink says avionics, which account for three-quarters of the industry’s NFFs, register an average rate of 50%–odds no better than a coin flip–with some components as low as 10-20% and others as high as 80-90%. If so, then aviation’s overall NFF rate has not changed in 20 years.

Why no progress? Klink and others blame the collision of an irresistible force–the ever-increasing sophistication and complexity of aircraft technology (the dense workings of the components themselves, the size and intricacy of the software programs that govern them, and the proliferating interrelationships among various systems and sub-systems)–with an immovable object: the commercial pressures that cap the time available to line maintenance technicians for troubleshooting.

The upshot is that the aviation industry, like Alice of Wonderland fame, finds itself running hard but getting nowhere fast. Is there some answer to No Fault Found? Or is the industry forced to follow the Red Queen’s advice to Alice: “If you want to get somewhere else, you must run at least twice as fast as that!”

Hold Your Applause, or What is the Sound of One Hand Clapping?

“Is there any other point to which you would wish to draw my attention?”

“To the curious incident of the dog in the night-time.”

“The dog did nothing in the night-time.”

“That was the curious incident,” remarked Sherlock Holmes.

–Sir Arthur Conan Doyle, The Memoirs of Sherlock Holmes (1893)

As pesky and puzzling as any particular NFF might be, the general phenomenon is even more peculiar. After all, an NFF arises from the absence of evidence. How does a no-thing give rise to a some-thing? As Donald Rumsfeld once correctly pointed out, absence of evidence is not evidence of absence. What’s more, the non-occurrence of an expected event is evidence, possibly highly revelatory proof, that something is amiss. Witness the Great Detective’s take on the curious incident of the dog in the night-time. However, the meaning of No Fault Found, which amounts to what statisticians call confirmation of the null hypothesis, although highly suggestive, proves nothing. The meaning of No Fault Found, at least at first, is not at all clear.

A No Fault Found can be divided into two parts. The first is a surface problem, namely a discrepancy between a positive report of a fault and a negative attempt to verify the fault. The other is the underlying cause of the discrepancy. Put another way, a determination of No Fault Found is a mixed message: an indication of a possible fault plus the revelation of a real error.

The error can arise anywhere along two related causal chains, the one that culminates in the positive report and the other that concludes with the negative test. This means the source of the error lies somewhere within at least one of three domains: (i) the report or reporter, (ii) the implicated component, relevant system, sub-system and software, or some interaction among these and the operating environment, or (iii) the test, test equipment or tester.

Until the source of the error is identified–and the NFF unraveled–there are several ways to interpret the ambiguity. Either something is indeed wrong with the implicated unit, or it is okay. If it turns out to be in acceptable working order, there may still be a fault somewhere–or there may be none, which means the report of the fault is at fault.

Pity the poor AMT. Given these uncertainties, the initial maintenance action entails a potentially fateful trade-off. If he takes the reasonable precaution of pulling the suspected line replaceable unit (LRU), he may not only be occupying himself unproductively by handling a healthy item and incurring the associated extra costs, he also may (unwittingly) neglect to replace some other component that is more likely to fail. But should he err on the side of parsimony and decide against removing any part–after all, no fault was validated–he may be raising the aircraft’s operational risk.

The Sign of 672

When tests designed to uncover a malfunction in a modern aircraft don’t do their job, increasing confusion instead of imparting clarity, the alternative explanations available to an investigator would overwhelm the contemplative capabilities of a chess grandmaster. Consider the case of a recurring autopilot malfunction AMC’s Klink once dealt with.

The mechanism controlling the activity worked, he says, like “a big logic gate,” with 16 inputs. All had to be “good” in order for the aircraft “to achieve a dual land.” Each of these inputs emerged from a tangle of sub-systems and subcomponents, “any one of which might be having intermittent problems,” Klink said. An attempt to exhaustively sketch the fault tree for this one problem–generated by a single autopilot function, just one of many–would quickly ramify, Klink agrees, in “layer upon layer,” creating a proliferation of branches approaching the size of a giant redwood.

Clearly, a complete enumeration of all the conceivable sources of all varieties of No Fault Found remains a practical impossibility. So when ARINC and AMC researched and wrote Report 672, they did the next best thing. They elevated the level of analysis from particulars to kinds, producing two related 20-cell matrices, each the result of the intersection of four domains (design/production, flight operations, line operations and shop operations) and five categories (system/components, testing, training, communication and documentation), one for types of sources/causes (64 in total), the other for the corresponding recommended remedies (79 in all).

Although Report 672, officially released last June, carries the word “guidelines” in its title, it is less a manual for mechanics seeking immediate help with an NFF than a blueprint or logic map for identifying and locating the roots of No Fault Found, and taking the appropriate remedial or preventive steps to minimize it. The report is aimed at organizations, not individuals, and is intended to be applied company-wide.

Ideally, Klink says, efforts to implement Report 672 should be “spearheaded at a very high level.” Only then can all responsible parties, especially reliability departments, be energetically engaged. Everyone, he says, should “read the document, get a general understanding, then drill down through the categories and domains to isolate the main sources of No Fault Found.” This means comprehensively scouring and honestly evaluating every aspect of operations the Report highlights. Finally, he says, organizations need to collectively “develop a corrective action plan” to address the significant shortcomings that have been revealed.

Klink insists that Report 672 be thoroughly but flexibly adopted: “We try to stress examining all the variables to come up with your best chance for a deliverable solution.” But, he added, “You’re going to have to take what we suggest and tailor it for your operation to get it to work.”

Troubleshooting Troubleshooting

“How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”

–Sir Arthur Conan Doyle, The Sign of the Four (1890)

Technologies, techniques, practices and policies for reducing No Fault Found are plentiful but particularized. Report 672 was written in full recognition of the fact that there is no simple answer or algorithm, no single solution and no general method for mitigating No Fault Found beyond, perhaps, constant vigilance. There are, however, topics that warrant wider discussion. One is troubleshooting, the valiant efforts made by aviation’s frontline NFF-fighters.

There may not be a comprehensive response to No Fault Found, but there is one Big Idea. In his own way, Kevin Gulliver, president of Florida-based Nida Corp., a provider of computer-based training curriculum and equipment for troubleshooting, helps convey that idea in a practical, hands-on way something of the philosophy that informs Report 672: holism. It sounds New Age, but “holism” here just means keeping the big picture in mind.

Gulliver said, “We used to teach down at the component level,” but not anymore. “We’re right down the street from Rockwell Collins&When you look inside some of the boxes they make, they’re jam-packed with miniature components. Nobody’s going to unsolder that in the field. So what we’re seeing is more systems training, teaching students to get an understanding of how systems fit together.”

The aim is to teach AMTs to be better diagnosticians. The aviation industry cannot afford technicians who are merely what Gulliver calls “black box changers”–mainly, as he said, “because there are so many black boxes on board.” Instead, “what we’re trying to get them to do is to understand that sometimes what appears to be the fault, isn’t.” This amounts to acquiring an accurate fault model, a mental picture of what can go wrong, implicit or explicit, that maps reality and so guides one to make the right inference, grab the right component, run the right tests.

Technicians cannot rely on automated test equipment to do their thinking for them. As the NFF rate shows, they’re not sufficiently reliable. The best corrective is a better understanding of the underlying mechanisms and their interconnections.

Take an example from medicine. The human body, like an airplane, is a system of systems, hence complex. Without a detailed understanding of human physiology, who would guess that acute pain near the left shoulder tip indicates, about one-fourth of the time, a ruptured spleen? This heuristic, called Kehr’s sign, is entirely unintuitive, a classic example of what is called “referred pain,” which is felt at a different place than the injured body part. (The connection here is the phrenic nerve, which runs from the diaphragm, near the spleen, on one end and cervical spinal nerves C3, C4, and C5, which serve the shoulder, on the other.) Similarly, what chance does an AMT have of successfully troubleshooting a fault if he doesn’t know enough about the plane’s “nervous system” to see the link between its symptoms and their source?

If the aviation industry is to limit NFFs, big-picture systems thinking–an understanding of far more than the immediate object or task at hand–needs to be everyones job. AMTs, for instance, need more complete information from pilots and crew. Klink says he’s “spent a lot of time trying to coach pilots into giving more detailed squawks.” Simple things, such as phase of flight and time of day, would be useful clues for time-limited troubleshooters. Surely, “on climb out, the autopilot disconnected” is more to work with and not much more demanding than what Klink calls, “the old ‘autopilot inop.'” And, who knows, it might make the difference between solving a No Fault Found or not.

Localized Medicine

Another approach, not new but gaining popularity, is a tactic borrowed from epidemiology: the quarantine. Also known as save-on-shelf or ship-or-shelve, quarantine programs (discussed in Report 672, Appendix E) are proven, Klink says, “to reduce the negative impact of a No Fault Found.” The purpose of quarantine, like troubleshooting, is fault isolation. But each goes about it in a different way, at a different level and a different pace.

How does quarantine work? Take Klink’s example of an autopilot disconnect on climb-out. The pilot writes up his squawk, leading the AMT to remove the flight control computer. Immediate tests fail to confirm the reported complaint, thus: a condition of No Fault Found. The flight control computer is taken to a designated quarantine area, where it sits while the subsequent performance of the aircraft is carefully observed.

What unfolds is something like a natural experiment. If the aircraft is trouble-free for an interval of, say, five days or five flight legs, the quarantined item looks increasingly culpable. So, “even though you couldn’t reproduce [the fault] in a test environment,” Klink said, “you feel like you’ve successfully isolated the issue.”

If, however, the problem recurs, removal of the unit evidently failed to fix it. Either its replacement also is flawed or, more likely, neither unit is responsible. When completely exonerated, the suspect unit is returned to serviceable status and moves back to the stockroom as usable inventory. The quarantine has spared it the duress of needlessly entering the repair cycle. A host of other expenses are dodged as well, as Klink says: “In addition to reducing the No Fault Found service order, by not clogging the repair cycle with LRUs that aren’t broken, you don’t have to buy as many spares to cover your fleet.” Klink says he’s “aware of a quarantine program in use at a top-tier airline that was able to reduce its overall shop No Fault Found rate from 58% to 38%…at an annual savings of over $2 million per year–just for not having to process NFF components through the repair process.”

Watching the Detectives

According to ARINC Report 672, “complete elimination of NFF is not a realistic expectation.” There always will be a certain degree of residual statistical error due to random fluctuations in the measurement apparatus or regime. This kind of error is simply the cost of being in the game. But that leaves systematic, non-random error–difficult but not impossible to remove. Realistically, though, “if we could get all of our LRUs down into the 20% and 30% No Fault Found range, we’d be happy,” Klink says.

Because there are so many potential contributing causes set in so many different scenarios, the struggle against No Fault Found is now and probably always will be a multi-front war of attrition. Klink affirms what others have said: “There is no silver bullet.”

But the absence of silver ammo may be a silver lining for some. Klink is philosophical:

“I’ve been in the business 23 years and every day I learn something new. I enjoy what I do.”

People think of aviation maintenance as routine, and much of it is. “Months of tedium punctuated by moments of terror” is the way war has long been described. Aviation maintenance is more like detective work. You could call it forensic engineering; both take a routine approach to the unusual. As Klink said, “It’s challenging. I get exposure to things that I hadn’t been exposed to before. To me, that’s rewarding.”

The public’s fascination with forensics has led to a proliferation of TV shows, including CSI: New York, CSI: Miami, and CSI: Las Vegas. What’s next? Could it be CSI: MRO?