Method of finite differences¶

The first thing to be understood is that ARS is a shallow layering algorithm where we're only going to approximate gradients, differing a lot to how we'd calcualte gradients in traditional gradient descent.

Before we'd say that:
$\hspace{2cm} \Delta w = -\alpha \dfrac{\partial\epsilon}{\partial w_{ij}}$ $\hspace{5.3cm}$ Where epsilon = error function, alpha = learning rate and w = weight

But now:
$\hspace{2cm} \Delta w = \dfrac{\alpha}{N} \sum \limits_{k=1}^{N} [r(\pi_{j,k,+}) - r(\pi_{j,k,-})] \delta_k$ $\hspace{1.8cm}$ Where N = Number of directions sampled per iteration, r = reward, v = standard deviation of the
$\hspace{10.2cm}$ exploration noise, $\delta$ = pertubation matrix (explained below)

as
$\hspace{2cm} f^\prime(x) \approx \lim\limits_{h\to 0} \dfrac{f(x + h) - f(x)}{h}$

So how this works is that the gradient of the reward w.r.t the weights is approximated through the calculated difference of rewards of very small perturbations of opposite directions (Method of finite differences). We use the finite difference of the rewards in 2 opposite directions by applying very small perturbations.

Now suppose we had a perceptron with:

3 input values
2 output values
Therefore 3 x 2 = 6 weights between the 6 synaptic connections

Our matrix of weights will then be shaped as:

$\begin{bmatrix} w_{1,1}, & w_{1,2} \\ w_{2,1}, & w_{2,2} \\ w_{3,1}, & w_{3,2} \\ \end{bmatrix}$

And our matrix of positively perturbed weights will be:

$\begin{bmatrix} w_{1,1} + kp, & w_{1,2} + kp \\ w_{2,1} + kp, & w_{2,2} + kp\\ w_{3,1} + kp, & w_{3,2} + kp\\ \end{bmatrix}$

With negatively perturbed weights being:

$\begin{bmatrix} w_{1,1} - kp, & w_{1,2} - kp\\ w_{2,1} - kp, & w_{2,2} - kp\\ w_{3,1} - kp, & w_{3,2} - kp\\ \end{bmatrix}$

$p$ is a random number between 0 and 1, perturbing the weights and $k$ is the exploration noise.

"Parameter noise lets us teach agents tasks much more rapidly than with other approaches. After learning for 20 episodes on the HalfCheetah Gym environment (shown above), the policy achieves a score of around 3,000, whereas a policy trained with traditional action noise only achieves around 1,500." - https://blog.openai.com/better-exploration-with-parameter-noise/

In this program we'll generate 16 instances of positively and negitively perturbed weights so for generalisation purposes, a - p as p is the $16^{th}$ letter in the alphabet. And each matrice of weights on the AI will have it's own sample of episodes which will be averaged out at the end of training.

Updating weights with method of finite differences¶

$w = w_{prev} + \alpha ((Reward_{a-pos} - Reward_{a-neg}) \times \delta{_a}$ $ + (Reward_{b-pos} - Reward_{b-neg}) \times \delta{_b}$ $ + (Reward_{c-pos} - Reward_{c-neg}) \times \delta{_c}$ $\dots$ $ + (Reward_{p-pos} - Reward_{p-neg}) \times \delta{_p})$

$\delta{_x}$ are the small added/subtracted values which are used to perturbate weights $x$ - it's the perturbation matrix.

$\alpha$ is the learning rate divided by the number of perturbations.

So we can think of the first example like this:
$w = w_{prev} + ((Reward_{a-pos} \times \delta{_a}) - (Reward_{a-neg} \times \delta{_a}))$

Expanding out the expression, we can think of the $Reward_{x-posOrNeg}$ as a coefficient to the perturbation matrix $\delta{_x}$ which prevents $w_{prev}$ from being multiplied by zero because if it was the case that:

$w = w_{prev} + ((a_{pos} - a_{neg}) \times \delta{_a})$

We'd have an issue where $a_{pos}$ and $a_{neg}$ would cancel out. But in doing it the way initially portayed above by looking at the rewards gotten by different perturbations, we can then move the new weight in the direction of the better reward as, $((Reward_{a-pos} \times \delta{_a}) - (Reward_{a-neg} \times \delta{_a}))$ provides a vector value with both magnitude and direction being a coefficient to the perturbation matrix.

Augmenting Basic Random Search¶

There were three main updates done in ARS which make it "augmented":

Scaling update step by standard deviation of rewards.
Online normalisation of states.
Discarding directions that yield lowest rewards.

1) This is simply taking the equation gotten at earlier, where:
$w = w_{prev} + \alpha ((Reward_{a-pos} - Reward_{a-neg}) \times \delta{_a}$ $ + (Reward_{b-pos} - Reward_{b-neg}) \times \delta{_b}$ $ + (Reward_{c-pos} - Reward_{c-neg}) \times \delta{_c}$ $\dots$ $ + (Reward_{p-pos} - Reward_{p-neg}) \times \delta{_p})$

And then dividing it by the standard deviation of rewards involved.

As explained in section 3.1 of the research paper, standard deviation scaling accounts from the large variance that is had from a random search in the parameter space of policies making it difficult to choose an optimal learning rate $\alpha$.

2) Online normalisation of states:
This is where the information of our agent's state (the signals into the input neurons) will be normalised in real time as the agent is learning and navigating throught it's environment. We will normalise the values not only based off of what they are but also based off of what the network has already seen to account for the stochastic nature of the environment.

This is because changing weights by a factor of maybe 0.1 will have a much larger effect on a signal of magnitude 100 compared to a signal of magnitude 1 and therefore a drastic change in the output (this is made worse because we're using shallow layering).

So without normalisation, the slightest change in weights could have a huge and undersired different magnitude of impact on the output value of our network. This is one of the big updates which helped ARS be able to take on the challange of training a virtual humanoid in how to walk.

3) Discarding directions that yield lowest rewards:
Here we create a heirachy of the top $k$ rewards w.r.t their associated perturbation matrixes meaning that the weights will only evolve in the directions of the perturbation matrixes observed to be most successful.

E.g.
$w = w_{prev} + \alpha \cdot ((Reward_{a-pos} - Reward_{a-neg}) \times \delta{_a}$ $+ (Reward_{b-pos} - Reward_{b-neg}) \times \delta{_b}$ $+ (Reward_{p-pos} - Reward_{p-neg}) \times \delta{_d})$

This is one way which we can start optimising our weights once we discard step vectors $c$, and all other weights in between $d$ and $p$.

Why ARS?¶

ARS performs exploration in the policy space where other AI's perform exploration in the action space.
Method of finite differences vs Gradient descent.
Perceptron based shallow learning vs deep learning.

1) What this means is that rewards are accumilated throughout the entire episode into a total reward and then looks at the total reward after the episode. This is in contrast to other AIs which will update the weights after each action. So we're exploring the whole episode meaning that we're exploring the policy space. Whereas with other AIs we would be exploring the actions and therefore the action space.

2) As ARS assesses policy space, we don't have a value function gotten from an action space and so we can't have a loss function to backpropagate against which is why we can't use gradient descent. However, ARS still gives a good enough approximation of the gradient without being as computationally expensive as discussed in the paper.

3) There are no hidden layers, the input layer is connected directly to the output layer, it's simpler and less involved.

You would think ARS wouldn't perform as well as other methods of creating AIs but it actually performs a lot better in many cases despite being so much less involved. It approximates gradients, assessess policy space rather than action space and begins with a random search.

Yet dispite this, the paper shows that ARS can perform up to 15$\times$ faster and will yield higher rewards than other methods in lots of specific cases.

Updating the weights with method of finite differences (Heart of ARS algorithm)¶

New weight matrix = $\theta_{j+1} = \theta_{j} + \dfrac{\alpha}{b \cdot \sigma_{r}} \sum \limits_{k=1}^{N} [r(\pi_{j,k,+}) - r(\pi_{j,k,-})] \delta_k$

Unlike gradient descent, we don't need to and can't use an error function due to ARS evaluations the policy space as opposed to the action state which means that we can't work out the Q-Values of states and therefore temportal differences.

So instead of minimising a loss function, we will instead directly have access to the rewards and so we can try to maximise the rewards instead updating our gradient in the direction of perterbutions that yield the most lucrative rewards.