Cassy Calvert

Cassy Calvert

Has big business taken its dependence on software too far? Or can it successfully mitigate risk through simple testing techniques? The biggest #EpicFails provide the answer.

It’s finally happened. We’ve allowed software to become so ingrained in our everyday lives that whenever something goes wrong, it hurts. Enterprise has been at the forefront of this with large IT estates and a reliance on process automation meaning that even the smallest outage can quickly bring an entire operation grinding to a halt. Millennials use a great word to describe this: #EpicFail. A total failure where success should be reasonably easy to attain.

Has big business taken its dependence on software too far? Or can it successfully mitigate risk through simple testing techniques?

The biggest #EpicFails provide the answer.

#EpicFail 1 – STARBUCKS

In 2015, an “internal failure” brought down the entire Starbucks Point of Sale system. With their tills not working, baristas had little choice but to hand their drinks out free of charge. It wasn’t long before head office intervened by closing over 8,000 stores, angering a client base deprived of their regular caffeine fix.

The problem was later identified. The Point of Sale table in the database had been erroneously deleted by a daily system refresh.

Modest changes often cause the greatest damage, and in this case a simple Regression check would have saved Starbucks a bean or two. One of the primary uses for Regression testing is to prove the correctness of an application. A basic smoke test, configured to prove the user journey for purchasing a Grande Vanilla Latte, for example, would have immediately flagged the issue.

Every application should have a Smoke test suite. This small regression pack should contain tests that are critical to the application and should be executed in less than an hour. For high transaction environments like Starbucks, it can become easy to develop a very large manual regression pack, which can quickly become too big or exercises the same path through the code each time. Smoke tests should ideally be automated. The pack can be increased to include more critical tests to verify the build or application.

There are several different approaches for Regression testing. Automation and Smoke test packs are key, but they shouldn’t be solely relied upon to locate defects. Other techniques, such as exploratory testing, add richness to the regression approach.

#EpicFail 2 – APPLE MAPS

Apple faced severe criticism from its users when its new mapping solution was introduced in 2012. The service was slated for its unreliability with problems ranging from the Manhattan Bridge resembling a large roller coaster, to views being obscured by clouds, inaccurate location data and warped topology.

As a service, Apple Maps contrasted strongly with the company’s mantra of ‘it just works’. The software was so bad that Apple was forced to issue a rare and humiliating public apology, recommending that customers continued using competing solutions until the issues were resolved.

Apple never disclosed the root cause of the issue. Perhaps, in a case of technology spatial disorientation, the company was blinded by technology and features at the expense of quality.

Data quality and richness should never be overlooked. They help identify issues which normally remain hidden until tests occur. But tests need data, and of the largest challenges of any testing project is sourcing production data. Representative, production-like rich data, is crucial to flagging and locating the defects that are often found in edge cases.

For heavily-used applications like Apple Maps, it is also vital to consider data volume. This is a challenge – not only for the richness of scenarios that a large dataset can provide, but also for the effects of volume, regardless of platform size.

In some cases, depersonalised data can be insecure. It often doesn’t comply with Data Protection legislation either. Instead, by starting with canned data, a dataset can be built by analysing data journeys. When performed early enough, it becomes possible to build a rich set of transitioned data alongside test cases throughout the progression of the project.

In this instance, using a wide and varied group of testers with a range of devices and configurations, would have been effective. The concept is commonly known as “crowd sourcing”. As part of the crowd source, “bug hunt” type sessions could have been held. Exploratory testing via a wide range of mobile devices, including operating systems (either released, or in the beta channel), would have also exposed defects which would otherwise have slipped through the net. Public beta testing is also beneficial as it provides a valuable ‘sneak peak’ into the software, and there is always a willing public ready to take on the challenge.

#EpicFail 3 – EUROPEAN SPACE AGENCY

It may have taken the European Space Agency 10 years to develop and build its new rocket, but Ariane 5 was destroyed just 37 seconds into its maiden flight.

In what was a $7 billion software bug, Ariane’s guidance system tried to convert a 64-bit sideways velocity format into a 16-bit format.  Inevitably, the number became too big, and an overflow error ensued.

The guidance system shut itself down, invoking a failover to the backup system, which had itself failed because it ran the same software and encountered the same issue. Ariane was programmed to automatically self-destruct if it veered off-course, and it was the decision to cease the processor operation which proved fatal.

A review recommended that future missions should prepare a test facility comprising of as much real equipment as technically feasible, while also injecting realistic input data, and perform complete, closed-loop system testing.

Access to environments and production-like systems for testing is crucial.  While any tester would dream of having a fully scaled, production like system, the reality is almost always different. In Greenfield development, testing can be performed on the production environment before a system goes live. This enables the architecture to be proven, facilitates non-functional testing, and verifies the software being developed. Aim to perform this as early in the project as possible – the earlier, the greater the benefits!

#EpicFail 4 – Kiddicare

Kiddicare wasn’t aware that it had fallen victim to a cyberattack until its customers started complaining about the highly personalised phishing messages they were receiving.

The data was stolen from Kiddicare’s online test environment. While security measures relied largely on simple password authentication, real customer data was stored on this test environment, enabling the hackers to obtain the names, phone numbers, mailing and email addresses of 800,000 Kiddicare customers.

Data has never been so valuable, yet enterprise tends to place a lower focus on it than revenue assurance. While it is common practice to use depersonalised data in development and test environments, robust security measures should be applied across all environments.

Basic security and penetration testing would have helped Kiddicare secure its data. The basic objective of penetration testing is to determine security weaknesses by identifying entry points, breaking in to the application, and reporting findings. It is common practice to employ penetration testing on production environments and, depending on needs, it can be automated or manually performed.

There are aids to penetration testing. For example, the checklist of Web application vulnerabilities in the Open Source Security Testing Methodology Manual (OSSTMM) from the Open Web Application Security Project (OWASP) is a framework for testing the security of web applications. It provides technical information on how to use penetration testing to look for specific issues.

Penetration testing might be useful for understanding the resiliency of an application, but if it is performed incorrectly, it becomes of little value and will create a false sense of security.

 #EpicFail 5 – Heathrow Terminal 5

It was designed to make the “Heathrow hassle” a thing of the past, but the “calmer, smoother and simpler airport experience” promised by Heathrow’s flagship Terminal 5 (T5) descended quickly into a full-scale national embarrassment.

Over £175 million was invested in T5’s IT estate. The project involved over 180 specialist suppliers deploying 163 systems, hundreds of interfaces and tens of thousands of user devices. But within hours of its opening, all of T5’s baggage, car parking and security systems failed, leaving 36,584 passengers frustrated and a mountain of 23,205 bags waiting to be reunited with their owners.

During integration testing, in a bid to stop test examples from being delivered to live systems, integration messages were stubbed out. However, on its release to production, the code was erroneously left in place, preventing the system from receiving information about luggage transferring to British Airways from other airlines. Bags were sent for manual sorting, but as the messages backed up, the bags did too, missing their flights.

Official reports put the failures down to ‘inadequate system testing’. With the opening date of the new terminal rapidly approaching, and with test engineers being unable to access full end-to-end systems, the scope of trials were intentionally reduced and several were cancelled altogether.

There are many pitfalls in relying on testing as an ‘end of project’ activity. As demonstrated by T5, it can lead to System Testing becoming squeezed, de-scoped or worse – cancelled!

It is likely that T5 was delivered using Waterfall delivery techniques. By employing Agile principles instead, the project would have been better placed to innovate and rapid change would have been delivered in working systems.

While Agile is a philosophy which requires players to adopt a particular mindset and way of working, small changes can be easily and quickly implemented.

The Waterfall delivery approach often views testing as a largely time-consuming and manual process. However by implementing Continuous Delivery, the project benefits by combining Development and Testing. Continuous Delivery spreads the effort across the delivery. Through analysis, development and testing, quality is built-in from the outset, creating a perception that testing time is reduced – even though it is more robust.

One of the biggest misconceptions about Agile is that it does not require documentation or planning. This is incorrect – necessary and sufficient documentation and planning is still needed. Test scripts should still be written because they define what should be done and will highlight the exact issue should the test fail.

Had T5 combined iterative development and testing, with the pre-production messages supplied by the integration vendors, the stubbed messages oversight would have been identified and remedied before go live.

Will proper testing ever end the #EpicFail?

Sadly, it won’t. We’re living in an era characterised by large-scale change and innovation. With new online services disrupting traditional players, there is a surge of investment in new IT systems. Existing IT investments are being optimised and tweaked too.

Enterprise IT is set to play an even greater role in daily life. There are likely to be more #EpicFails over the coming years. Some will be amusing, others inconvenient, and a few destructive. They will probably be bigger and more spectacular to those we’ve already seen, but by applying the most basic testing principles, their magnitude, impact and severity will be reduced.