You Are Now >> Home >> Resaerch >> Artificial Intelligence >> Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processe


Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processe



 

Abstract

We study an approach to policy selection for large relational Markov Decision Processes (MDPs). We consider a variant of approximate policy iteration (API) that replaces the usual value-function learning step with a learning step in policy space. This is advantageous in domains where good policies are easier to represent and learn than the corresponding value functions, which is often the case for the relational MDPs we are interested in. In order to apply API to such problems, we introduce a relational policy language and corresponding learner. In addition, we introduce a new bootstrapping routine for goalbased planning domains, based on random walks. Such bootstrapping is necessary for many large relational MDPs, where reward is extremely sparse, as API is ineffective in such domains when initialized with an uninformed policy. Our experiments show that the resulting system is able to find good policies for a number of classical planning domains and their stochastic variants by solving them as extremely large relational MDPs. The experiments also point to some limitations of our approach, suggesting future work.

Introduction

Many planning domains are most naturally represented in terms of objects and relations among them. Accordingly, AI researchers have long studied algorithms for planning and learning-to-plan in relational state and action spaces. These include, for example, “classical” STRIPS domains such as the blocks world and logistics.

A common criticism of such domains and algorithms is the assumption of an idealized, deterministic world model. This, in part, has led AI researchers to study planning and learning within a decision-theoretic framework, which explicitly handles stochastic environments and generalized reward-based objectives. However, most of this work is based on explicit or propositional state-space models, and so far has not demonstrated scalability to the large relational domains that are commonly addressed in classical planning. Intelligent agents must be able to simultaneously deal with both the complexity arising from relational structure and the complexity arising from uncertainty. The primary goal of this research is to move toward such agents by bridging the gap between classical and decision-theoretic techniques.

In this paper, we describe a straightforward and practical method for solving very large, relational MDPs. Our work can be viewed as a form of relational reinforcement learning (RRL) where we assume a strong simulation model of the environment. That is, we assume access to a black-box simulator, for which we can provide any (relationally represented) state/action pair and receive a sample from the appropriate next-state and reward distributions. The goal is to interact with the simulator in order to learn a policy for achieving high expected reward. It is a separate challenge, not considered here, to combine our work with methods for learning the environment simulator to avoid dependence on being provided such a simulator.

Dynamic-programming approaches to finding optimal control policies in MDPs (Bellman, 1957; Howard, 1960), using explicit (flat) state space representations, break down when the state space becomes extremely large. More recent work extends these algorithms to use propositional (Boutilier & Dearden, 1996; Dean & Givan, 1997; Dean, Givan, & Leach, 1997; Boutilier, Dearden, & Goldszmidt, 2000; Givan, Dean, & Greig, 2003; Guestrin, Koller, Parr, & Venkataraman, 2003b) as well as relational (Boutilier, Reiter, & Price, 2001; Guestrin, Koller, Gearhart, & Kanodia, 2003a) state-space representations. These extensions have significantly expanded the set of approachable problems, but have not yet shown the capacity to solve large classical planning problems such as the benchmark problems used in planning competitions (Bacchus, 2001), let alone their stochastic variants. One possible reason for this is that these methods are based on calculating and representing value functions. For familiar STRIPS planning domains (among others), useful value functions can be difficult to represent compactly, and their manipulation becomes a bottle-neck.

Most of the above techniques are purely deductive—that is, each value function is guaranteed to have a certain level of accuracy. Rather, in this work, we will focus on inductive techniques that make no such guarantees in practice. Most existing inductive forms of approximate policy iteration (API) utilize machine learning to select compactly represented approximate value functions at each iteration of dynamic programming (Bertsekas & Tsitsiklis, 1996). As with any machine learning algorithm, the selection of the hypothesis space, here a space of value functions, is critical to performance. An example space used frequently is the space of linear combinations of a human-selected feature set.

To our knowledge, there has been no previous work that applies any form of API to benchmark problems from classical planning, or their stochastic variants. 1 Again, one reason for this is the high complexity of typical value functions for these large relational domains, making it difficult to specify good value-function spaces that facilitate learning. Comparably, it is often much easier to compactly specify good policies, and accordingly good policy spaces for learning. This observation is the basis for recent work on inductive policy selection in relational planning domains, both deterministic (Khardon, 1999a; Martin & Geffner, 2000), and probabilistic (Yoon, Fern, & Givan, 2002). These techniques show that useful policies can be learned using a policy-space bias described by a generic (relational) knowledge representation language. Here we incorporate those ideas into a variant of API, that achieves significant success without representing or learning approximate value functions. Of course, a natural direction for future work is to combine policy-space techniques with value-function techniques, to leverage the advantages of both. Given an initial policy, our approach uses the simulation technique of policy rollout (Tesauro & Galperin, 1996) to generate trajectories of an improved policy. These trajectories are then given to a classification learner, which searches for a classifier, or policy, that “matches” the trajectory data, resulting in an approximately improved policy.

 


Sponsored link:

Read Full Research
Research Person : Alan Fern,Sungwook Yoon,Robert Givan
Contact Person : 1.afern@cs.orst.edu
School of Electrical Engineering and Computer Science, Oregon State University
2.sy@purdue.edu
School of Electrical and Computer Engineering, Purdue University
3.givan@purdue.edu
School of Electrical and Computer Engineering, Purdue University
Year : 2011

Category: Artificial Intelligence
Related Research:
Related Product:
Related News:
Related Dictionay: