This paper investigates reinforcement learning problems where a stochastic time delay is present in the reinforcement signal, but the delay is unknown to the learning agent. This work posits that the agent may receive individual reinforcements out of order, which is a relaxation of an important assumption in previous works from the literature. To that end, a stochastic time delay is introduced into a mobile robot line-following application. The main contribution of this work is to provide a novel stochastic approximation algorithm, which is an extension of Q-learning, for the time-delayed reinforcement problem. The paper includes a proof of convergence as well as grid world simulation results from MATLAB, results of line-following simulations within the Cyberbotics Webots mobile robot simulator, and finally, experimental results using an e-Puck mobile robot to follow a real track despite the presence of large, stochastic time delays in its reinforcement signal.

Delayed reinforcement, Jitter, Markov decision process, Multiple models, Reinforcement learning, Stochastic time delay
Journal of Intelligent and Robotic Systems: Theory and Applications
Department of Systems and Computer Engineering

Campbell, J.S. (Jeffrey S.), Givigi, S.N. (Sidney N.), & Schwartz, H.M. (2016). Multiple Model Q-Learning for Stochastic Asynchronous Rewards. Journal of Intelligent and Robotic Systems: Theory and Applications, 81(3-4), 407–422. doi:10.1007/s10846-015-0222-2