The idea of local maximas was brought home to me recently by this post from an AI conference. The idea is applicable to many things, from adaptive filtering for telecommunications to seeing social trends. It is one of the things people are warned about in any introductory level machine learning or artificial intelligence class.
All optimization or machine learning boils down to optimizing a “cost function” which may be explicit or implicit. What this means is that we have some way of telling the machine what is a good or bad result or some rule by which the machine can figure out what is a good or bad result.
For example, suppose we want to build a machine that can look at satellite pictures and identify tanks moving on the ground. So we will train it by showing it a lot of pictures. Some of which we know have tanks in them. The machine will then be told to classify the pictures.
At the start, the machine will classify them arbitrarily into tank and non-tank pictures. The next step is crucial, we tell it which pictures it got wrong. That is which tank pictures it mis-classified as containing no tanks and which non-tank pictures it classified as having tanks. This allows the machine to refine its criteria for selecting pictures (using predefined rules). After the criteria refinement, it tries to classify the pictures again. Again it makes some mistakes and is told which pictures it got wrong. Again it refines its criteria and so on.
After thousands of such iterations, if the machine is smart enough (sufficiently sophisticated learning algorithm) and the data set is diverse enough (pictures have enough tanks in enough positions), the machine develops a pretty good idea of what to look for when searching for a tank.
In the above example, the cost function would be the number of pictures the machine mis-classified and we would optimize by minimizing this.
There is another class of machine learning (unsupervised) where we don’t have to explicitly tell what the machine is done wrong, but instead give some metric or formula by which the machine figures out for itself how good or bad a job it has done. But for our consideration, lets stick to the type I described above.
In many cases, the cost function may be a mathematical expression. Lets take an example, suppose we have some mathematical cost function,
y = f(x)
and we want to find the highest value this can take. This is an extremely common problem when you consider that the function may represent anything from speed achieved to audio quality to number of dollars saved.
Now one way to find the maximum is to start off at any random point. Move x in an arbitrary direction and see if y becomes larger or smaller. If it becomes larger then move 1 step in that direction and repeat.
For example, consider the plot below, say we start at point A.
Now lets say we check what happens if we decrease x and move in the left direction, we reach A1. This is smaller than A, so we won’t go in this direction. Instead, we see that if we increase x and move to the right on the graph to point A2, then we increase our value of y. So we take 1 step to the right and arrive at point A2. Now again we check both directions and see that moving to the right still increases our value, so we keep moving to the right. This continues till we reach point B. Now if we take any more steps either to the right or left, we get a smaller value (at B1 or B2). So we can say that we are at a maximum point at B. If we shift from here in any direction, the value we want becomes smaller. We are at a peak. So the solution to our maximization problem is point B.
What I have described is pretty much how many of these things are actually done in practice (though there are a bazillion ways of optimizing the process to make it faster, cheaper, robust or easier to work with on other types of data). Note that here I have only one independent variable x and the value of y depends only on the value of x. In the real world, the result may depend on a number of factors hence we may have many variables, y= f(x1, x2, x3 ….).
But there is a fundamental problem with the algorithm I have described above for finding the maximum value. It only finds a peak in the function, it does not guarantee that this is the highest peak.
To illustrate, see the figure below.
Suppose we start at point D. We see that if we move to the right, the function gets bigger, so we move to the right until we reach A. Now if we go further to the right or left, the value becomes smaller, so we declare A to be a peak and stop. We never even consider point C which is an even higher peak.
This is known as the local maxima problem, where our efforts yield a solution that looks optimum but really there is another even better solution which we can’t get to because we have stopped at the smaller peak.
In 2015, Google’s DeepMind produced a program called AlphaGo to play the ancient Chinese strategy game of Go. In Go, each player is assigned a colour and the players take turns to put one black or white piece on the grid like board. The object is to encircle the pieces placed by your opponent. Once a set of pieces are encircled by the other side, they are considered lost and are taken off the board. At the end of the game, the territory on the board held by a player and the number of pieces captured are counted to determine the winner.
The rules of Go are extremely simple. A friend taught me the game in about 10 minutes and we started playing. But the complexity of the game play far exceeds chess. The number of possibilities in a game of Go can exceed the number of atoms in the known universe.
White captures 2 black pieces by surrounding them
In March 2016, AlphaGo defeated professional Go player Lee Sedol of South Korea 4 games to 1. Sedol holds 18 international level Go titles which is the second highest held by anyone in the world today.
The interesting thing is not that machines have beaten humans at another sport, it is the fact that AlphaGo won using a strategy that human masters had dismissed as stupid centuries ago. One Go master said that he would have slapped a student for playing the strategy AlphaGo won with.
For centuries, Go strategy had been stuck on a local maxima, with people not experimenting with other techniques because they were considered inferior. Who knows what other local maximas we are stuck with in our society.