Feedback

**Figure 4.6:** Evaluation of the Vital State
$\includegraphics[width=4.5in]{animat2_evaluation.eps}$

Figure 4.6 shows the general idea of the specialized feedback based on the vital state. The first assumption is that the perception (P) has a built-in causal relationship impact() onto the vital state (V). If something happens this can increase or decrease ENERGY. Then the absolute level of ENERGY will be mapped onto a certain threshold $\theta$ . If ENERGY is above the threshold $\theta$ than the parameter VITAL (V) is set to '1'; otherwise it is set to '0'.

$\displaystyle impact1$	$\displaystyle :$	$\displaystyle P \longmapsto ENERGY$	(4.75)
$\displaystyle impact2$	$\displaystyle :$	$\displaystyle ENERGY \times \{\theta\} \longmapsto \cal{V}$	(4.76)
$\displaystyle impact$	$\displaystyle =$	$\displaystyle impact1 \otimes impact2$	(4.77)

$\begin{displaymath} impact2 = \left\{ \begin{array}{lcr} = 1 & ,& if ENERGY > \theta\\ = 0 & ,& if ENERGY \leq \theta \end{array} \right. \end{displaymath}$

Based on the actual perception

and the actual vital state

will the memory as represented by the set of classifiers CLASSIF be matched with regard to those classifiers $c_{i} \in CLASSIF$ which agree with

. This generates the match-set (M). From this match-set

will that classifier be selected which either has the highest reward value (R) or -if there are more than one such classifiers- which has been randomly selected from the set of highest valued rewards. From this selected classifier will then the action (A) be selected and executed.

$\displaystyle match$	$\displaystyle :$	$\displaystyle \cal{P} \times \cal{V} \times \cal{C} \longmapsto \cal{M}$	(4.78)
$\displaystyle action$	$\displaystyle :$	$\displaystyle \cal{M} \longmapsto ACT$	(4.79)
$\displaystyle aout$	$\displaystyle :$	$\displaystyle ACT \longmapsto POS$	(4.80)

This action

as part of classifer $C \in CLASSIF$ as

generates a new perception

which again can change the energy level and with this the vital state

. When the energy difference

is positiv -signaling an increase in energy- then will this cause a reward action (REW+) for all classifiers $C_{i}[A_{j}]$ of CLASSIF which have with their actions preceeded the last action

, written as (

$\displaystyle eval$	$\displaystyle :$	$\displaystyle E_{OLD} \times E_{NEW} \times CL_{L} \longmapsto CL$	(4.81)
$\displaystyle CL_{L}$	$\displaystyle \subseteq$	$\displaystyle C^{n}$	(4.82)
$\displaystyle CL_{L}$	$\displaystyle =$	$\displaystyle C_{i}[A_{n}], C_{j}[A_{n-1}],C_{k}[A_{n-2}], \cdots, C_{r}[A_{1}],C[A],$	(4.83)
$\displaystyle C$	$\displaystyle \in$	$\displaystyle CL$	(4.84)

Here arises the crucial question how the reward should be distributed over the participating actions which can be understood as a sequence of actions $\langle A_{1}, \cdots, A_{n}\rangle$ with the last action $A_{n}$ as that action which caused the change of the internal states. The first action $A_{1}$ is that action which is the first action after the last reward. In a general sense one has to assume that all actions before $A_{n}$ have somehow contributed to the final success. But how much did every single action contribute?

From the point of the acting system it is interesting to learn some kind of a statistic telling the likelihood of success doing in a certain situation some action. Thus if action $A_{i}$ predes a successful acxtion $A^{*}$ more often than an action $A_{j}$ then this should somehow to be encoded. Because we do not assume any kind of a sequence memory here (not yet), we have to find an alternative mechanism.

An alternative approach could be to cumulate amounts of reward saying that an action -independent of all other actions- has gained some amount of reward and therefore making this action preferrable. An experimental approach copuld be the following one where the success is partitioned proportionally:

This allows in the long run some ordering: the actions with the highest scores are those leading directly to the goal and those with the lower scores are those which precede the higher scoring ones.

Such a kind of proportinal scoring implies that one has to assume some minimal action memory if