Figure 4.6 shows the general idea of the specialized feedback based on the vital state. The first assumption is that the perception (P) has a built-in causal relationship impact() onto the vital state (V). If something happens this can increase or decrease ENERGY. Then the absolute level of ENERGY will be mapped onto a certain threshold . If ENERGY is above the threshold
than the parameter VITAL (V) is set to '1'; otherwise it is set to '0'.
![]() |
![]() |
![]() |
(4.75) |
![]() |
![]() |
![]() |
(4.76) |
![]() |
![]() |
![]() |
(4.77) |
Based on the actual perception and the actual vital state
will the memory as represented by
the set of classifiers CLASSIF be matched with regard to those classifiers
which agree with
. This generates the match-set (M). From this
match-set
will that classifier be selected which either has the highest reward value
(R) or -if there are more than one such classifiers- which has been randomly selected from the
set of highest valued rewards. From this selected classifier will then the action (A) be
selected and executed.
This action as part of classifer
as
generates a new perception
which
again can change the energy level and with this the vital state
. When the energy
difference
is positiv -signaling an increase in energy- then will this cause a
reward action (REW+) for all classifiers
of CLASSIF which have with
their actions preceeded the last action
, written as (
,
,
:
Here arises the crucial question how the reward should be distributed over the
participating actions which can be understood as a sequence of actions
with the last action
as that action which caused the
change of the internal states. The first action
is that action which is the first
action after the last reward. In a general sense one has to assume that all actions before
have somehow contributed to the final success. But how much did every single
action contribute?
From the point of the acting system it is interesting to learn some kind of a statistic telling
the likelihood of success doing in a certain situation some action. Thus if action
predes a successful acxtion
more often than an action
then this should
somehow to be encoded. Because we do not assume any kind of a sequence memory here (not
yet), we have to find an alternative mechanism.
An alternative approach could be to cumulate amounts of reward saying that an action -independent of all other actions- has gained some amount of reward and therefore making this action preferrable. An experimental approach copuld be the following one where the success is partitioned proportionally:
As general formula:
This allows in the long run some ordering: the actions with the highest scores are those leading directly to the goal and those with the lower scores are those which precede the higher scoring ones.
Such a kind of proportinal scoring implies that one has to assume some minimal action
memory
if .
Gerd Doeben-Henisch 2012-03-31