Decision-Theoretic Planning with non-Markovian Rewards
Abstract
A decision process in which rewards depend on history rather than merely on the current state is called a decision process with non-Markovian rewards (NMRDP). In decisiontheoretic planning, where many desirable behaviours are more naturally expressed as properties of execution sequences rather than as properties of states, NMRDPs form a more natural model than the commonly adopted fully Markovian decision process (MDP) model.
While the more tractable solution methods developed for MDPs do not directly apply in the presence of non-Markovian rewards, a number of solution methods for NMRDPs have been proposed in the literature. These all exploit a compact specification of the non-Markovian reward function in temporal logic, to automatically translate the NMRDP into an equivalent MDP which is solved using efficient MDP solution methods. This paper presents nmrdpp(Non-Markovian Reward Decision Process Planner), a software platform for the development and experimentation of methods for decision-theoretic planning with nonMarkovian rewards. The current version of nmrdpp implements, under a single interface, a family of methods based on existing as well as new approaches which we describe in detail. These include dynamic programming, heuristic search, and structured methods.
Using nmrdpp, we compare the methods and identify certain problem features that affect their performance. nmrdpp’s treatment of non-Markovian rewards is inspired by the treatment of domain-specific search control knowledge in the TLPlan planner, which it incorporates as a special case. In the First International Probabilistic Planning Competition, nmrdpp was able to compete and perform well in both the domain-independent and hand-coded tracks, using search control knowledge in the latter.
The Problem
Markov decision processes (MDPs) are now widely accepted as the preferred model for decision-theoretic planning problems (Boutilier, Dean, & Hanks, 1999). The fundamental assumption behind the MDP formulation is that not only the system dynamics but also the reward function are Markovian. Therefore, all information needed to determine the reward at a given state must be encoded in the state itself. This requirement is not always easy to meet for planning problems, as many desirable behaviours are naturally expressed as properties of execution sequences (see e.g., Drummond, 1989; Haddawy & Hanks, 1992; Bacchus & Kabanza, 1998; Pistore & Traverso, 2001). Typical cases include rewards for the maintenance of some property, for the periodic achievement of some goal, for the achievement of a goal within a given number of steps of the request being made, or even simply for the very first achievement of a goal which becomes irrelevant afterwards. For instance, consider a health care robot which assists ederly or disabled people by achieving simple goals such as reminding them to do important tasks (e.g. taking a pill), entertaining them, checking or transporting objects for them (e.g. checking the stove’s temperature or bringing coffee), escorting them, or searching (e.g. for glasses or for the nurse) (Cesta et al., 2003). In this domain, we might want to reward the robot for making sure a given patient takes his pill exactly once every 8 hours (and penalise it if it fails to prevent the patient from doing this more than once within this time frame!), we may reward it for repeatedly visiting all rooms in the ward in a given order and reporting any problem it detects, it may also receive a reward once for each patient’s request answered within the appropriate time-frame, etc. Another example is the elevator control domain (Koehler & Schuster, 2000), in which an elevator must get passengers from their origin to their destination as efficiently as possible, while attempting to satisfying a range of other conditions such as providing priority services to critical customers. In this domain, some trajectories of the elevator are more desirable than others, which makes it natural to encode the problem by assigning rewards to those trajectories.
A decision process in which rewards depend on the sequence of states passed through rather than merely on the current state is called a decision process with non-Markovian rewards (NMRDP) (Bacchus, Boutilier, & Grove, 1996). A difficulty with NMRDPs is that the most efficient MDP solution methods do not directly apply to them. The traditional way to circumvent this problem is to formulate the NMRDP as an equivalent MDP, whose states result from augmenting those of the original NMRDP with extra information capturing enough history to make the reward Markovian. Hand crafting such an MDP can however be very difficult in general. This is exacerbated by the fact that the size of the MDP impacts the effectiveness of many solution methods. Therefore, there has been interest in automating the translation into an MDP, starting from a natural specification of nonMarkovian rewards and of the system’s dynamics (Bacchus et al., 1996; Bacchus, Boutilier, & Grove, 1997). This is the problem we focus on.
Existing Approaches
When solving NMRDPs in this setting, the central issue is to define a non-Markovian reward specification language and a translation into an MDP adapted to the class of MDP solution methods and representations we would like to use for the type of problems at hand. More precisely, there is a tradeoff between the effort spent in the translation, e.g. in producing a small equivalent MDP without many irrelevant history distinctions, and the effort required to solve it. Appropriate resolution of this tradeoff depends on the type of representations and solution methods envisioned for the MDP. For instance, structured representations and solution methods which have some ability to ignore irrelevant information may cope with a crude translation, while state-based (flat) representations and methods will require a more sophisticated translation producing an MDP as small as feasible.
Both the two previous proposals within this line of research rely on past linear temporal logic (PLTL) formulae to specify the behaviours to be rewarded (Bacchus et al., 1996, 1997). A nice feature of PLTL is that it yields a straightforward semantics of non-Markovian rewards, and lends itself to a range of translations from the crudest to the finest. The two proposals adopt very different translations adapted to two very different types of solution methods and representations. The first (Bacchus et al., 1996) targets classical state-based solution methods such as policy iteration (Howard, 1960) which generate complete policies at the cost of enumerating all states in the entire MDP. Consequently, it adopts an expensive translation which attempts to produce a minimal MDP. By contrast, the second translation (Bacchus et al., 1997) is very efficient but crude, and targets structured solution methods and representations (see e.g., Hoey, St-Aubin, Hu, & Boutilier, 1999; Boutilier, Dearden, & Goldszmidt, 2000; Feng & Hansen, 2002), which do not require explicit state enumeration.
A New Approach
The first contribution of this paper is to provide a language and a translation adapted to another class of solution methods which have proven quite effective in dealing with large MDPs, namely anytime state-based heuristic search methods such as LAO* (Hansen & Zilberstein, 2001), LRTDP (Bonet & Geffner, 2003), and ancestors (Barto, Bardtke, & Singh, 1995; Dean, Kaelbling, Kirman, & Nicholson, 1995; Thi´ebaux, Hertzberg, Shoaff, & Schneider, 1995). These methods typically start with a compact representation of the MDP based on probabilistic planning operators, and search forward from an initial state, constructing new states by expanding the envelope of the policy as time permits. They may produce an approximate and even incomplete policy, but explicitly construct and explore only a fraction of the MDP. Neither of the two previous proposals is well-suited to such solution methods, the first because the cost of the translation (most of which is performed prior to the solution phase) annihilates the benefits of anytime algorithms, and the second because the size of the MDP obtained is an obstacle to the applicability of state-based methods. Since here both the cost of the translation and the size of the MDP it results in will severely impact on the quality of the policy obtainable by the deadline, we need an appropriate resolution of the tradeoff between the two.
Sponsored link:
Canberra, ACT 0200, Australi
email: Sylvie.Thiebaux@anu.edu.au
Category: Artificial Intelligence
- 8-Valent Fuzzy Logic for Iris Recognition and Biometry
- A Heuristic Search Planner for First-Order MDPs
- A Visual Entity-Relationship Model for Constraint-Based
- Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processe
- Asynchronous Partial Overlay: A New Algorithm for Solving Distributed Constraint Satisfaction Problems

Artificial Intelligence