Feedback

Figure 4.6: Evaluation of the Vital State
\includegraphics[width=4.5in]{animat2_evaluation.eps}

Figure 4.6 shows the general idea of the specialized feedback based on the vital state. The first assumption is that the perception (P) has a built-in causal relationship impact() onto the vital state (V). If something happens this can increase or decrease ENERGY. Then the absolute level of ENERGY will be mapped onto a certain threshold $ \theta$. If ENERGY is above the threshold $ \theta$ than the parameter VITAL (V) is set to '1'; otherwise it is set to '0'.


$\displaystyle impact1$ $\displaystyle :$ $\displaystyle P \longmapsto ENERGY$ (4.75)
$\displaystyle impact2$ $\displaystyle :$ $\displaystyle ENERGY \times \{\theta\} \longmapsto \cal{V}$ (4.76)
$\displaystyle impact$ $\displaystyle =$ $\displaystyle impact1 \otimes impact2$ (4.77)

\begin{displaymath}
impact2 = \left\{
\begin{array}{lcr}
= 1 & ,& if ENERGY > \theta\\
= 0 & ,& if ENERGY \leq \theta
\end{array} \right.
\end{displaymath}

Based on the actual perception $ P$ and the actual vital state $ V$ will the memory as represented by the set of classifiers CLASSIF be matched with regard to those classifiers $ c_{i}
\in CLASSIF$ which agree with $ (P,V)$. This generates the match-set (M). From this match-set $ M$ will that classifier be selected which either has the highest reward value (R) or -if there are more than one such classifiers- which has been randomly selected from the set of highest valued rewards. From this selected classifier will then the action (A) be selected and executed.


$\displaystyle match$ $\displaystyle :$ $\displaystyle \cal{P} \times \cal{V} \times \cal{C} \longmapsto \cal{M}$ (4.78)
$\displaystyle action$ $\displaystyle :$ $\displaystyle \cal{M} \longmapsto ACT$ (4.79)
$\displaystyle aout$ $\displaystyle :$ $\displaystyle ACT \longmapsto POS$ (4.80)

This action $ A$ as part of classifer $ C \in CLASSIF$ as $ C[A]$ generates a new perception $ P'$ which again can change the energy level and with this the vital state $ V'$. When the energy difference $ EDIFF > 0$ is positiv -signaling an increase in energy- then will this cause a reward action (REW+) for all classifiers $ C_{i}[A_{j}]$ of CLASSIF which have with their actions preceeded the last action $ A$, written as ( $ E := ERNERGY$, $ CL := CLASSIF$, $ L :=
LASTACTS$:


$\displaystyle eval$ $\displaystyle :$ $\displaystyle E_{OLD} \times E_{NEW} \times CL_{L} \longmapsto CL$ (4.81)
$\displaystyle CL_{L}$ $\displaystyle \subseteq$ $\displaystyle C^{n}$ (4.82)
$\displaystyle CL_{L}$ $\displaystyle =$ $\displaystyle C_{i}[A_{n}], C_{j}[A_{n-1}],C_{k}[A_{n-2}], \cdots, C_{r}[A_{1}],C[A],$ (4.83)
$\displaystyle C$ $\displaystyle \in$ $\displaystyle CL$ (4.84)

Here arises the crucial question how the reward should be distributed over the participating actions which can be understood as a sequence of actions $ \langle
A_{1}, \cdots, A_{n}\rangle$ with the last action $ A_{n}$ as that action which caused the change of the internal states. The first action $ A_{1}$ is that action which is the first action after the last reward. In a general sense one has to assume that all actions before $ A_{n}$ have somehow contributed to the final success. But how much did every single action contribute?

From the point of the acting system it is interesting to learn some kind of a statistic telling the likelihood of success doing in a certain situation some action. Thus if action $ A_{i}$ predes a successful acxtion $ A^{*}$ more often than an action $ A_{j}$ then this should somehow to be encoded. Because we do not assume any kind of a sequence memory here (not yet), we have to find an alternative mechanism.

An alternative approach could be to cumulate amounts of reward saying that an action -independent of all other actions- has gained some amount of reward and therefore making this action preferrable. An experimental approach copuld be the following one where the success is partitioned proportionally:

  1. Only $ a_{n} $ with $ a_{n}+ R$
  2. Only $ a_{n-1}, a_{n}$ with 2:1, saying $ a_{n-1}+ \frac{1}{3}R, a_{n} + \frac{2}{3}R$
  3. Only $ a_{n-2}, a_{n-1}, a_{n}$ with 3:2:1, saying $ a_{n-2} + \frac{1}{6}R, a_{n-1}+ \frac{2}{6}R, a_{n} + \frac{3}{6}R$
  4. ....

As general formula:

$ \frac{\displaystyle 1 : 2 : .... : n}{\displaystyle \sum_{i=1}^{n}i}
$

This allows in the long run some ordering: the actions with the highest scores are those leading directly to the goal and those with the lower scores are those which precede the higher scoring ones.

Such a kind of proportinal scoring implies that one has to assume some minimal action memory if $ n > 1$.

Gerd Doeben-Henisch 2012-03-31