2.2 PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES
In
POMDPs
[24],
the
exact
state
of
the
world
is
not
directly
observable.
Instead,
observations
are made that depend probabilistically on the current state of the world. The probability that observation o is made from state s is given by O(s, o). If the observation space, denoted O, is continuous, then O(s, o) is a density. If O is .nite, O(s, o) is a probability mass.
In general, the initial state is unknown. The uncertainty in the initial state is represented by a distribution, often called a belief state. A belief state b assigns probability b(s) to being in state
s. The space of possible beliefs is denoted B, which is a continuous space (each point in this space represents a probability distribution over S) with dimensionality |S|.
If the belief state is b, a is executed, and o is observed, the new belief state b is given by
b (s ) = Pr(s | o, a, b) ∝ Pr(o | s , a, b)Pr(s | a, b) = Pr(o | s ) s T (s, a, s )b(s) (4) (5) (6)
= O(s , o) T (s, a, s )b(s). (7)
s
This process of updating the belief state follows directly from Bayes’ rule and is known as state estimation
or
.ltering
[25].
In a POMDP, the belief state is used to determine which action to execute. Hence, policies are mappings from belief states to actions. As with MDPs, various optimality criteria can be de.ned for POMDPs, including the minimization of expected cost. Because the belief state space is high-dimensional and continuous, representing the optimal policy for a POMDP is much more di.cult than for an MDP. Computing the optimal policy exactly, even for .nite horizon problems, is
typically
infeasible
[26].
However,
many
di.erent
approximation
methods
have
been
proposed
in
the
literature
[27–29].
2.3 FULL OBSERVABILITY APPROXIMATION
Several di.erent POMDP solution strategies involve solving the problem assuming full observability using
a
DP
algorithm
[30–33].
If
πMDP is the policy assuming full observability, then one strategy called
the
“most
likely
state”
(MLS)
strategy
[34]
is
as
follows:
π(b)= πMDP(arg max b(s)), (8)
s
where b is the current belief state. The problem with the MLS strategy is that it does not take into account the possibility of being in di.erent states. In the collision avoidance problem, for example, if the aircraft is slightly more likely to be in a “safe” state than an “unsafe” state, MLS will ignore the possibility of being in the unsafe state and recommend an overly optimistic action. The current version of TCAS uses point estimates of certain state quantities, although it incorporates some heuristics to assess the con.dence in the estimates.
The POMDP approximation method adopted in this report involves choosing the optimal action under the assumption that the world will become fully observable at the next decision point. This method involves computing the state-action costs JMDP(s, a) assuming full observability. If the current belief state is b, then the action to be selected is given by
π(b) = arg min b(s)JMDP(s, a). (9)
a
s
This heuristic, often called QMDP,2
has been studied by others, and has been found to perform well on many problems but tends to fail in problems where taking particular actions results in a signi.cant
reduction
in
state
uncertainty
[30–33].
Because
this
method
takes
into
consideration
the
full belief distribution when making decisions, it can provide more robust performance than MLS when state uncertainty is signi.cant.
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址:Robust Airborne Collision Avoidance through Dynamic Programm(13)