Learning automata which update their action probabilities on the basis of the responses they get from an environment are considered. The automata update the probabilities whether the environment responds with a reward or a penalty. An automaton is said to possess ergodicity of the mean (EM) if the mean action probability is the total state probability of an ergodic Markov chain. The only known EM algorithm is the linear reward–penalty (L<inf>RP</inf>) scheme. For the two-action case necessary and sufficient conditions have been derived for nonlinear updating schemes to be EM. The method of controlling the rate of convergence of this scheme has been presented. In particular, a generalized linear algorithm has been proposed which is superior to the L<inf>RP</inf> scheme. The expression for the variance of the limiting action probabilities of this scheme has been derived. The technique of designing the optimal linear automaton in this family has also been considered. Methods to decrease the variance for the general nonlinear scheme have been discussed. The set of absolutely expedient schemes and the set of schemes which possess ergodicity of the mean are shown to be mutually disjoint.

Additional Metadata
Persistent URL dx.doi.org/10.1109/TSMC.1983.6313191
Journal IEEE Transactions on Systems, Man and Cybernetics
Thathachar, M.A.L. (M. A L), & Oommen, J. (1983). Learning Automata Processing Ergodicity of the Mean: The Two-Action Case. IEEE Transactions on Systems, Man and Cybernetics, SMC-13(6), 1143–1148. doi:10.1109/TSMC.1983.6313191