Joel T. Kaardal, Ph.D.
Neural computation and machine intelligence

For the uninitiated, one of the fundamental reasons for the development of the field of machine learning is the overwhelming difficulty of explicitly programming deterministic code to satisfy all contingencies of complex systems. For instance, writing a program that identifies faces is one such problem that has been difficult to solve because it is not clear what logic is necessary to create a good quality face recognizer, especially in environmental settings where lighting/angles/hats can complicate matters. Through the development of deep learning methods, like the convolutional neural network, such difficult tasks are now possible (and facial recognition software has made giant strides in the last decade on social media) by having a computer use numerical optimization techniques to approximate the logic of complex tasks. This is usually done by minimizing prediction error of some objective (or perhaps an "energy") function.

That is all wonderful, but what if we do not know the objective function? An example would be teaching a robot to play air hockey. In teaching the robot to play air hockey, the robot needs to learn a vision task, predict the mechanics of the puck, and a motor task to change the course of the game. It is not clear what objective function must be optimized for the robot to become a competent player. Such problems are the focus of reinforcement learning where, instead of optimizing a known objective function, the parameters of the system are adjusted to maximize reward (and avoid penalties) through trial and error, much like how we may tinker to learn a new skill.

In this application, reinforcement learning is applied to a simple example problem: learning to the play the game tic-tac-toe. Instead of using Q-learning, the value and action functions are approximated by a mid-sized residual neural network. Since this is a simple example, only one residual bypass is used which connects the input layer directly to the output layer. Despite the simplicity of the application, the code is rather lengthy because it includes the game code, network construction code, and various other functions important to the reinforcement learning algorithm. Therefore, the following discussion will focus more on some of the choices that were made in constructing the network and algorithm. With some wiggling of the hyperparameters, I managed to get a >95% win rate against an opponent that makes random feasible choices. Perhaps you can do better.


Tic-tac-toe is a game with finite length where the outcome of the game is not known until the final board state is reached and either one player wins or the game is a draw. Furthermore, the game length requires a minimum of three moves by the "x" player who goes first. Therefore, tic-tac-toe is an example of delayed reward reinforcement learning where the reward/penalty is assessed at the terminal state of the game and not at intermediate states. Rewards are then accumulated for each action taken by the player based on the reward/penalty, \( Y \), at the terminal state of the game:

$$ \small y_t = \gamma y_{t+2} \text{ for } t \in \{ 1+\delta, 3+\delta, 5+\delta, \cdots , \tau-1-\delta \} \\ \small \text{ and } 0 \leq \gamma \leq 1 \text{ where } y_{\tau-1-\delta} = Y. $$

The reward is recorded from every other state for a given player because a player only performs an action every other turn starting from turn \( 1 + \delta = 1 \) if the player moves first ("x") and \( 1 + \delta = 2 \) if the opponent moves first. The variable \( \gamma \) is a discount factor that can be used to diminish rewards that are distant from the \( t \)th state. A discount factor can be helpful to enforce some estimate of how likely a given choice made to reach state \( t \) was causal to the reward. The accumulation of the rewards and penalties for a given state of the game is known as a value function.

Unfortunately, even for such a simple game, a non-parametric estimate of the value function is impractical because the total number of states is the number of all possible board states. Instead, the value function is estimated by a feed-forward neural network with a residual bypass to help limit the influence of the vanishing gradient problem on training of the network weights. The network architecture is shown in Figure 1 with a transduction layer, one hidden layer (the script provides infrastructure for more hidden layers, if desired), and an output layer.

Figure 1: a simple residual network is constructed from \( D = 9 \) transduction units that read in the prior state of the board, one single hidden layer with \( M = 200 \) units, and \( D = 9 \) output units that provide an estimate of the probability that a chosen game board position will lead to a reward.

The transdunction layer receives the board state as input and each unit in this layer has an unknown bias term and unknown nine-dimensional weight vector, one element for each of the nine positions on the game board. The output layer is composed of nine logistic units that correspond to each position on the board and can be used to predict the probability that a given move will (later) lead to a reward. With the exception of the output layer, the other layers are composed of rectified linear units.

The game board is represented by a 3 by 3 matrix where open positions are set to zero. The trained player (the one subject to reinforcement learning) is represented by 1 while the opponent is represented by -1. The state vector, \( \mathbf{s}_t \), is then represented by the game board matrix at time \( t \) (\( t \) is incremented by one each time either the player or opponent makes a move) unrolled into a vector. A game as a whole is then represented by the concatenation of these state vectors into the sample matrix, \( \mathbf{S} = \left[ \mathbf{s}_1, \, \cdots, \, \mathbf{s}_\tau \right] \). If, at the final state the trained player wins, the reward is +1 while if the player loses it is -1. A draw game incurs neither a reward nor a penalty. The rewards calculated for each state over the course of a game can then be calculated using the equation from above and stored in the rewards column vector, \( \mathbf{y}^\mathrm{T} = \left[ y_{1+\delta}, \, \cdots, \, y_{\tau-1-\delta} \right] \).

Once several games have been played, and rewards have been assessed, the network weights may be updated using the batches of matrices \( \{ \mathbf{S} \} \) and vectors \( \{ \mathbf{y} \} \). This is done by taking the gradient of the negative information gain:

$$ \small L = - \frac{\mathbf{1}^\mathrm{T}}{|\{ \mathbf{y} \}|} \sum_{\left(\mathbf{S} \in \{ \mathbf{S} \}, \mathbf{y} \in \{ \mathbf{y} \} \right)} \bigg[ \left( \mathbf{S}_{2+\delta:2:\tau-\delta} - \mathbf{S}_{1+\delta:2:\tau-1-\delta} \right) \circ {\log\mathbf{P}(\mathbf{S}_{2+\delta:2:\tau-\delta}|\mathbf{S}_{1+\delta:2:\tau-1-\delta})} \bigg] \mathbf{y} $$

where \( \mathbf{S}_{1:2:\tau-1} \) is a submatrix of \( \mathbf{S} \) composed of every other column between columns 1 through \( \tau-1 \) (e.g. 1, 3, 5, ...), \( \mathbf{P}(\mathbf{S}_{2+\delta:\tau-\delta}|\mathbf{S}_{1+\delta:\tau-1-\delta}) \) is the vectorized state of the network's output layer with rows corresponding to game board positions and columns corresponding to the predictions at times \( t = 2+\delta:2:\tau-\delta \), the logarithm is treated as an elementwise operation, \( \circ \) is the Hadamard product, \( \mathbf{1} \) is a vector of ones, and \( | \cdot | \) denotes the cardinality operator. If \( \mathbf{w} \) is a vector of all the unknown weights, then the weights are updated using the gradient of \( L(\mathbf{w}) \) via the update

$$ \small \mathbf{w} \leftarrow \mathbf{w} - \boldsymbol{\alpha} \circ \boldsymbol{\nabla} L|_\mathrm{w} $$

where \( \boldsymbol{\alpha} \) is an adaptive learning rate. The adaptive learning rate used is RMSprop with an annealing schedule on the step length to ensure convergence. The algorithm also takes advantage of invariance by noting that the moves and outcome of the game is the same when the game board is rotated at 90, 180, and 270 degrees and with respect to symmetric transformations across the origin (Figure 2)and augments the training data to account for these invariant transformations.

Figure 2: invariant transformations of the game board that will procure the same reward/penalty. The terminal state of a match is boxed in red. To the right, the terminal state is subjected to 90 degree rotations of the game board. Towards the bottom, the terminal state is flipped horizontally and then subjected to 90 degree rotations.

To avoid a self-fullfilling prophecy where the reinforcement causes convergence to a suboptimal solution without exploring other possibilities that may improve play, an epsilon-greedy algorithm is implemented. To give an intuitive example for why this may occur, suppose one is using reinforcement learning to choose between three actions but through an unfortunate initialization the first action is chosen with a probability of 1 (and the rest 0). If choosing the first action still leads to a net reward, the network may end up being stuck choosing the first action and never try other actions. The epsilon-greedy algorithm can help by having the reinforcement learning algorithm choose a random action with \( \epsilon \) probability. Usually, this probability will be high at the beginning of the optimization and decrease to lower probabilities over the course of optimization.

The algorithm makes it possible to have the player play against a random opponent that chooses random feasible board positions with uniform probability, self-play against its own network, network-vs-network play where it competes against another network (in some sense, similar to generative adversarial networks except the networks have symmetric goals), and against a human opponent. The latter choice is not recommended because it takes on the order of 106 games for the network to become (somewhat) competent which likely to test the patience of probably everyone who has ever existed.