All Things Distributed
This article titled “Wartet nicht auf Perfektion – lernt aus euren Fehlern!” appeared in German last week in the “Digitaliserung” column of Wirtschaftwoche.
<img src=”/images/broken-car.jpg”/ width="650">
"Man errs as long as he doth strive." Goethe, the German prince of poets, knew that already more than 200 years ago. His words still ring true today, but with a crucial difference: Striving alone is not enough. You have to strive faster than the rest. And while there's nothing wrong with striving for perfection, in today's digital world you can no longer wait until your products are near perfection before offering them to your customers. If so, you will fall behind in your market.
So if you can't wait for perfection, what should you do instead? I believe the answer is to experiment aggressively with your product development, accepting the possibility that some of your experiments will fail.
Anyone who has listened to, or worked with, management gurus know their mantra: Failure is a necessary part of progress. That's true, but there's often a big gap between the management theory and the reality on the ground. People want to experiment and learn from things that go wrong. But in the flurry of day-to-day business, they're not given enough time to really reflect on the cause of an error and what to do differently next time.
The solution is to find a systematic approach that prevents errors from repeating themselves.
From perfection to anti-fragility
In finding such a systematic way, you first need to distinguish between two types of errors that can happen in your company: those of technology and those of human decision-making. The nice thing is: if you know how to deal effectively with the first, you might end up being better in the second, making better decisions. The financial mathematician and essayist Nassim Taleb offers an interesting take on this issue. He has argued that errors are incredibly valuable because they lead to innovation. He uses the term 'anti-fragility' to make his point. Today's digital business models require smaller, frequent releases to reduce risk. That means the technologies underpinning these new business models must be more than just robust. They must be 'anti-fragile'. The main feature of anti-fragile technology is that it can 'err' without falling apart. In fact, a crisis can make it even better.
At Amazon, we also require our systems and customer solutions to be anti-fragile, and we do that by designing our systems to stand the test of time. Our systems must be able to evolve and become more resilient to failure. They must become more powerful and more feature-rich over time as a result of learning from customer feedback and any failure modes they may encounter while operating the systems.
An example of a German company that has become 'anti-fragile' is HARTING, the world's leading provider of heavy pluggable connectors for machines and plants. HARTING shows how to think a step ahead about the meaning of quality standards in the digital world. Quality and trust are the most important values for this traditional company, and Industry 4.0 and the digital transformation have already been important focus areas for them since 2011. Even though it was hard to accept at first, HARTING has meanwhile realized that errors are inevitable. For that reason, its development switched to agile methods. It also uses the "minimum-viable-product" approach and relies on microservices for its software. Working this way, HARTING can discard things and create new things more easily. All in all, HARTING has become faster.
That can be seen with HARTING MICA, an edge computing solution that enables older machines and plants to get a digital retrofit. The body and hardware still reflect HARTING's standard of perfection. But for the software, the goal is "good enough", because a microservice is neither ever finished nor perfect. As a result, wrong decisions and mistakes can be corrected very quickly and systems can mature faster, approaching the state of antifragility. If the requirements change or better software technologies become available, each microservice can be thrown out and a new one created. That's how you gain speed and quickly digitize old machines and connect them to the cloud within a manageable cost framework.
Taking the dread out of mistakes
If you want to become anti-fragile, more than robust, like HARTING and other companies, you need to proactively look for the weak spots in a system as you experiment. In a system that should evolve, all sorts of errors will happen that you weren't able to predict, especially when systems need to scale into unknown territories. So subject your system to continuous failures and make subsystems artificially fail using tools like Netflix's Chaos Monkey.
If you do all of this, you will start to objectify errors at your company and make dealing with errors a matter of normality. And when errors become 'business as usual', no one will be afraid of taking a risk, trying out a new idea, a new product or a new service and seeing what happens when customers interact with it. That's how you quickly find solutions that really work in the future.
At Amazon, our approach for systematically and constructively dealing with errors is called the "cause of error" method. It refrains from seeking "culprits". Instead it documents learning experiences and derives actions that ultimately improve the availability of our systems.
From root cause to innovation
The method first calls for fixing an error by analyzing its immediate root cause and taking steps to mitigate the damage and restore the initial running state as quickly as possible.But we are not content with that result. We go further, trying to extract the maximum amount of insight from the incident. And this process begins as soon everything is working again for the client.
A key element of our cause-of-error method is asking 5 'Why?' questions (a technique that originated in quality control in manufacturing). This is important because it determines the fundamental root of the problem.
Take the case of a website: Why was it down last Friday? The web servers reported timeouts. Why were there timeouts? Because our web services are overloaded and couldn't cope with the high traffic. Why were the web servers overloaded? Because we don't have enough web servers to handle all requests at peak times. Why don't we have enough web servers? Because we didn't consider possible peaks in demand in our planning. Why didn't we take peaks in demand into account in our planning? By the end of this process, we know exactly what happened and which clients were affected. Then we're in a position to distill an action plan that ensures that specific error doesn't happen again.
Quite often, applying this cause-of-error approach allows us to find breakthrough innovations, in the spirit of Nassim Taleb. That's how the solution Auto Scaling was created, after a certain client segment was fighting with strongly fluctuating hits on their website. When the load increases for a website, Auto Scaling automatically spins up an additional web server to service the rising number of requests. Conversely, when the load subsides, Auto Scaling turns off web servers that are not needed in order to save cost.
What it reveals is: Organizations need to look beyond superficial success. This is true for the development of systems as well as business models. If you want to remain agile in a complex environment, you must follow this path, even if it means leaving the comfort zone. If we transfer these ideas into an organizational context, three aspects might be worth considering:
1. Embrace error as a matter of fact
Jeff Bezos once said about Amazon: "I believe we are the best place in the world to fail." That inspires a lot of our people to experiment, find errors and turn them into something innovative. A statement like this encourages your people to actively look for errors, and to turn them into pieces of innovation. And: reward employees when they find errors. What we have learned from our development work at Amazon is that you need to always look beyond the surface of an error. Some of our best products have been born from errors.
2. Make due with incomplete information
German companies have a tradition of being thorough and perfectionist. In the digital world, however, you need to loosen those principles a bit. Technology is changing so fast; you need to be fast too. Make decisions even if the information you have is not as complete as you would like.Jeff Bezos put his finger on that when he wrote in his most recent letter to shareholders that "most decisions should probably be made with somewhere around 70% of the information you wish you had. If you wait for 90%, in most cases, you're probably being slow. Plus, either way, you need to be good at quickly recognizing and correcting bad decisions. If you're good at course correcting, being wrong may be less costly than you think, whereas being slow is going to be expensive for sure."
3. Praise the value of learning
I've stressed the need for companies to have a systematic approach to how they deal with errors. But your approach will only work if it's part of your overall culture. Make sure you understand your DNAandknow what people are thinking and talking about on the work floor. Openly praising experimentation in product development and encouraging people to find errors will come across as empty rhetoric if your employees really do have reason to fear repercussions for themselves personally if they make mistakes.
It is a matter of leadership to foster and shape a culture of experimentation that is practiced day in, day out.
Whatever companies come up with in order to systematically learn from mistakes, it will make them better in competing in the digital world. And it will give them the freedom and courage to take their systems, solutions and business models to a higher level.