And you thought that server crash at the office the other week was bad. Here are ten IT disasters that caused spectacular problems.
When computers go wrong, they often go spectacularly wrong. In most cases, catastrophic failures are blamed on “computer bugs”, although human error is normally the root cause for embarrassing failures.
No-one really knows where the term “bug” comes from. Some say it dates back to Thomas Edison’s pioneering work in the 19th century; others claim it refers to a moth that was evicted from the computer of one Grace Murray Hopper at Harvard University in 1947. What we do know is that bugs can make a mess of the best-laid plans.
Here, we look at ten IT disasters that highlight the precariously fickle nature of computers. Although our top ten mistakes may have caused lasting damage to the finances or reputations of those involved, nobody was physically harmed in the making of this list.
1 Gas pipe piracy
By the early 1980s, the Soviet Union was searching for better technology for its industrial control systems, and the simplest and cheapest way to develop state-of-the-art software was by pinching it from the West. This, however, was to prove a costly mistake.
The USA – deeply troubled by the emergence of Russia’s gas pipelines as a major economic beating-stick to wield over Europe – learned of the Soviets’ intentions to steal software, and thought the opportunity of giving them bogus code was too good to pass up. In a classic piece of espionage that would earn a standing ovation from John Le Carré, the CIA uncovered a KGB operation to harvest technical details and set up a counter-intelligence sting.
Working on a tip-off from a French connection who had defected from Russia, US agents planted a specially modified version of the pipeline software at the Canadian company the KGB was targeting. The time bombs in the software were so cunningly hidden that the code passed Russian inspection and went into the master control system for a pipeline designed to carry over 40 billion cubic metres of gas a year to Europe.
The software wreaked havoc with the pipes. As valves, pumps and turbines turned on and off at random, internal pressure reached bursting point, rupturing the pipe and causing an explosion that could be seen from space. Remarkably, no-one was hurt, but the tactic was the sort of copyright protection the record industry would kill for.
2 Pentium flunks long division
Q: What do you get when
you cross a Pentium PC with a
A: A mad scientist.
This is just one of the science community’s in-jokes after a glitch in Intel’s Pentium processors meant they spat out incorrect answers to calculations.
Intel had been promoting Pentium chips heavily in the early summer of 1994. Professor Thomas Nicely was using one of the early models to run a program that generated prime numbers, and their twin, triplet and quadruplet prime relatives. Taxing work at the best of times, but actually impossible if your calculator is on the blink.
Nicely noticed anomalies in the results of his research, but it took him five months to trace the problem, and he was understandably apoplectic to learn that the mistake came from his state-of-the-art processor.
Unlike previous CPUs from Intel, the 486DX and Pentiums included a floating-point unit (FPU) – also known as a maths co-processor – that was used for calculating maths problems using floating-point numbers (numbers too large to be represented as integers).
At the heart of the problem were errors and missing tables in the FPU’s on-chip instructions for division, meaning that in certain circumstances sums were miscalculated. For example, dividing 4195835 by 3145727 yielded 1.33374 to six significant figures, instead of 1.33382, an error of 0.006%. This may not be a huge problem when you’re working out how much you owe the ATO, but it’s an absolute showstopper for those conducting mathematical research.
Intel exacerbated the problem by at first playing down the seriousness of the issue. The company claimed “an error is only likely to occur [about] once in nine billion random floating point divides”, and that “an average spreadsheet user could encounter this subtle flaw once in every 27,000 years of use”. Intel even said it would only replace chips for people who could explain their requirements for complete accuracy in their calculations. Once IBM stepped in and stopped shipments of Pentium machines, Intel capitulated and offered full refunds.
With some five million defective chips in circulation, Intel was fortunate that most people didn’t bother replacing their processors, but it still cost the company around $500 million.
3 Mars Climate Orbiter loses plot
Minor mistakes can prove costly on a space mission – in this case, $327 million of your Earth money.
All appeared to be fine when the Mars Climate Orbiter approached the red planet on a mission to collect weather data, back in 1999, but disaster was lurking, all because the mission controllers didn’t know their feet from their metres.
The space craft’s thrusters, which dictated its rate of rotation, and thus direction, were controlled by software that underestimated the effect of the jets by a factor of 4.45. Not coincidentally, that’s the same ratio that links a pound force, which is the standard unit of force in the US, and a Newton, which is the standard metric unit.
During its 286-day voyage, the minor differences in the flight path went largely unnoticed by officials. Even when doubters expressed concern over the orbiter’s trajectory, they were told to prove something was wrong and ignored.
The orbiter was supposed to enter the Martian atmosphere at a high trajectory about 200km above the surface, but actually made its approach much lower. Due to the imperial-metric mash-up, the sums were so far askew that when Ground Control initiated boosters to secure the pod in orbit, all they succeeded in doing was firing it closer to the planet, where it burnt up in the atmosphere.
The situation was compounded weeks later when the Mars Polar Lander disappeared without trace following an unrelated glitch. Experts believe sensors in the probe mistook the vibrations of the landing gear locking into place for touchdown, prompting the engine to switch off while the lander was still several miles above the planet’s surface. Game over.
4 Black day for power programmers
A simple software glitch couldn’t really plunge us into apocalyptic darkness, could it? It sounds like a low-budget movie script or a Daily Tele feature, but for 50 million residents across eight US states and Canada this was the Doomsday scenario triggered by an unglamorous box in Ohio.
On 14 August 2003, the biggest power crisis in American history was actually initiated in the bowels of a Unix-based XA/21 energy-management system. Deep in the four million lines of C code running the system, there was a race condition bug.
Race conditions occur when two separate threads of one operation rely on a single element of code. If the process isn’t properly synchronised, the threads get themselves in a self-perpetuating tangle and crash the entire system. On this occasion, data feeds from several network monitors created a “perfect storm” for the race condition and, in a matter of milliseconds, incoming data overwhelmed a system that should have alerted controllers to problems on the electricity grid.
With the alarm system down, the doughnut-munching controllers remained unaware of relatively minor network events that soon spiralled out of control because they weren’t quickly resolved. Unprocessed events queued up and the primary server failed within 30 minutes, switching all operations to the backup server, which itself failed minutes later.
Oblivious to the impending nightmare, observers did nothing when a power line tripped out after making contact with an unkempt tree, which forced more power onto another overhead power line, causing that one to sag and trip out too. Within an hour, power lines and circuit breakers were tripping left, right and centre, as a power surge cascaded across the north-eastern states.
Tripped-out lines caused a sudden drop in demand, bringing generators offline, which immediately caused a power vacuum that was filled by currents surging in from other plants.
It was the electrical equivalent of rush hour, and a major crash was inevitable. The carnage eventually left 256 power plants offline, causing cellular communication and media distribution. The best form of communication was reported to be laptops using dial-up modems.
5 Sun sparks cold war
In scenes reminiscent of kiddy-hacker film War Games, in 1983 the world teetered on the brink of World War III, thanks not to a Cold War bust-up between US and Russian leaders, but an oversight in a missile detection system.
The Soviets had recently installed an Oko (eye) early-warning system, designed to spot Inter-Continental Ballistic Missile (ICBM) launches and feed information into command centres. While the US had opted for a top-down approach to spotting launches, the Russians chose a high-elliptical, long-range view of the horizon, aiming to spot ICBMs popping their heads over the curvature of the Earth. Ironically, this was meant to prevent false alarms being triggered by natural events on the ground.
However, shortly after the Vodka bars shut in Moscow on 26 September 1983, the sun, satellite and US missile fields were perfectly aligned to produce an intense glare reflected from high-altitude clouds. The reflected sunlight (stronger than normal due to the autumn equinox) poured into the infrared sensors aboard the Cosmos 1382 satellite monitoring the missile fields, imitating the bright light of hot gases in a missile plume.
Whether the fault lay in the image-filtering and sensing software or the capture hardware is unknown, but the result was a big flashing red light on Soviet screens alerting them to five nukes coming their way. For Lieutenant Colonel Stanislav Petrov, this was the epitome of “squeaky bum time”. The only thing that stopped him passing the alert up to button-pressing superiors was a “gut feeling” that “when people start a war, they don’t start it with only five missiles”. That takes some guts.
6 Nuclear fallout
Russia doesn’t have the monopoly on terrifying computer-generated false alarms. On 9 November 1979, the US scrambled jets and put their entire nuclear forces on standby in response to their worst nightmare. In command centres across the country, screens were showing a massive Soviet nuclear strike aimed at destroying the US command system and nuclear hardware, prompting the launch of the president’s “Doomsday plane”.
Only once they had checked the raw data coming from their Defense Support Platform satellites did the officials realise that a training tape had been inadvertently loaded into the mainframe running the entire US early-warning program.
7 Windows genuine mistake
In a spectacular PR failure even by its own standards, Microsoft accused thousands of its customers of being criminals after programmers inserted a glitch in the anti-piracy tool, Windows Genuine Advantage (WGA).
Already unpopular with consumers because of its fiddly authentication processes, WGA plumbed new depths in 2007 when it flagged thousands of perfectly legal copies of Windows as pirated. According to Microsoft, the mistake arose after a member of the WGA team incorrectly uploaded bug-ridden pre-production software onto the company’s servers on a Friday afternoon. The company said it uninstalled the code, which it did, but the WGA team didn’t test that the fix resolved the problem before heading to the pub for TGIF drinkies.
The result was that until late on Saturday afternoon, anyone connecting to the WGA servers was told that their copy of Windows was dodgier than a Rolex dealer at Paddy’s Market. Windows XP customers were warned they were using pirated software, with all the legal implications that go with it. Windows Vista customers actually had features switched off until they went through the whole process of reactivating their software.
8 Switchboard meltdown
Network managers at AT&T could only stare in horror as their 72-screen display graphically showed angry red lines tracing the collapse of the company’s telephone system. On a good day, the network carried 70% of the US’s long-distance calls, some 115 million a day. But 15 January 1990 wasn’t a good day.
The problem started in New York, where one of the company’s 114 computer-operated electronic switches (each one capable of handling 700,000 calls an hour) turned itself off for a four-second maintenance reset because it was nearing capacity. The 114 switches were linked via a cascading network and a parallel signalling network to try to find the optimum route for calls, with each switch reporting its status to the rest of the network on an ongoing basis.
When the overburdened New York node switched itself back on after the reset, it sent out a signal that it was back online and ready to receive calls. This should have restored the status quo, but a software defect meant a second identical signal was sent less than ten milliseconds after the first, arriving before the initial signal had been processed.
This created an overwrite problem that sent a second node into a lather, and it closed itself down in a huff. When switch number two came back online, it also sent out contradictory messages, propagating the cycle across the entire network. For some nine hours AT&T was unable to process around 50% of its calls, a snag that cost a reported $60m in lost earnings.
9 Crash-test dummies
Image: Dongliu / Shutterstock.com
Volvo takes its safety record seriously and has been at the vanguard of new technologies geared to reducing accidents. But accidents will happen, and often at the most embarrassing moments. Twice in 2010 alone Volvo gathered the world’s media to show off new safety features. Twice they went spectacularly wrong.
The company was showing off the crash-avoidance system in its S60 when engineers fired the car out of a testing tunnel towards the back of a stationary truck. The car was supposed to foresee the impending collision, but a problem between the control system and the battery meant the shiny new vehicle ploughed into the back of the juggernaut.
Undeterred, the company followed the S60 test with a display of a pedestrian avoidance system, which predictably ended with the simulated deaths of the walking public. Although the system, which uses radar sensors and a camera to spot pedestrians and instigate an emergency stop, did halt the vehicle for nine out of 12 dummies, three others were sent flying like a stack of bowling pins.
Where will it end? Well if you ask security experts, the trend for smarter cars with ever more onboard computing power means it won’t just be Volvos you need to worry about, but actually anyone bearing a grudge. Researchers at the University of Washington recently hacked into several car systems using a variety of attack vectors and said they could “adversarially control a wide range of automotive functions and completely ignore driver input, including disabling the brakes or selectively braking individual wheels on demand”. Terrifying stuff. We’ll stick this one in the keep net for our next computer disasters feature.
10 Plane lost in translation
In a pan-European project to build the world’s biggest passenger plane you might expect the odd linguistic barrier between management and engineers, but you’d hope the computers would speak the same language.
In the spring of 2005, however, just as the Airbus A380 was taking shape in hangars outside Toulouse, engineers came across a jumbo software issue that reportedly cost the company $6bn by delaying the first flight by two years.
The French production facility had been using the latest version of the industry standard design software, CATIA 5, for its CAD designs. The Germans, on the other hand, had worked in CATIA 4, which handles 3D objects differently.
When they matched up their halves of the plane, it was like trying to weld the front of a Ford Mondeo to the back of a Mini Moke. The biggest problem was that the wiring plans were completely incompatible. Subtle differences in the software meant mismatched connections needed rerouting to connect the two disparate halves of the plane.
Even when developers wrote code to translate between the two versions complications remained, with engineers suggesting there was insufficient space to carry power cables far enough away from signal wires to prevent interference. If you’re doing nothing harder than wiring a plug, a couple of late changes to the wiring diagram isn’t an issue, but the A380 contained 530km of cabling, more than 100,000 individual wires and 40,000 connectors.
Also read: Who has the most underappreciated job in IT?
Also read: Who has the most underappreciated job in IT?