Dabbling in Reinforcement Learning

Reinforcement Learning of a Markov Adversarial Game through Stochastic Fictitious Play

This is a paper that started out as a group project for an Artificial Intelligence class. The co-authors and I pursued further development of the idea after the end of the class.


Our work models a game in which police agents play mixed-strategies so to catch drivers that decide to go above the speed limit on a road. We are interested in understanding how agents use decision theory and game theory to make decisions on which roads to speed and which not to speed. Additionally, we are interested in knowing what kinds of police deployment are most useful in the context of certain geographies.

Our game is modeled as a graph. Driver agents seek to plan a path from goal to destination and traverse the graph to get to destination. Police seek nodes that maximize their probability of .catching speeding drivers. The game is modeled as a Bayesian Stackelberg game. The leader (the police agent) commits to a strategy first, and, given the police strategy, the follower(the driver agent) selfishly chooses, with a probability, the strategy that maximizes its profi.t. In turn, the leader may choose to play a follower Stackelberg's strategies, so to catch the follower. We make use of two different learning models to simulate behavior, Opponent Modeling Q Learning and Experience Weighted Attraction (EWA).

Opponent Modeling Q Learning(OM) allows a player to take advantage of the less than optimal moves an opponent may make during a game. OM uses all of the same information as Minimax Q Learning, but also keeps track of how many times the opponent chooses in a certain action in each state. This extra information allows the player to overcome the problem from Minimax Q Learning of having an opponent agent that does not try each move from a state in.finitely often.

Experience Weighted Attraction is a learning model that combines two seemingly disparate learning models, belief and choice reinforcement. In a belief learning model, a player keeps track of the history of the moves of other players and develops a belief of how the other players act. Then, given these beliefs, they choose a best-response which will maximize their expected payout.. In a choice reinforcement model, the strategy assumes that the previous payouts of chosen strategies a.ct how a strategy is currently chosen. Most of the time players don't have a belief about how other players will play; they only care about the payouts received in the past, not how the play evolved to yield those payout.s. These two learning models are treated as fundamentally different approaches, but EWA shows that they are related. EWA creates a model where both of these learning models are special cases. This allows EWA to learn from actions and experiences that are not directly reinforced by the choice of action in each step.