Decentralized partially observable Markov decision process

<h2 id="definition">Definition</h2>
<h3>Formal definition</h3>
<p>A Dec-POMDP is a 7-tuple 
  
    
      
        (
        S
        ,
        {
        
          A
          
            i
          
        
        }
        ,
        T
        ,
        R
        ,
        {
        
          Ω
          
            i
          
        
        }
        ,
        O
        ,
        γ
        )
      
    
    {\displaystyle (S,\{A_{i}\},T,R,\{\Omega _{i}\},O,\gamma )}
  
, where
</p>
<ul><li>
  
    
      
        S
      
    
    {\displaystyle S}
  
 is a set of states,</li>
<li>
  
    
      
        
          A
          
            i
          
        
      
    
    {\displaystyle A_{i}}
  
 is a set of actions for agent 
  
    
      
        i
      
    
    {\displaystyle i}
  
, with 
  
    
      
        A
        =
        
          ×
          
            i
          
        
        
          A
          
            i
          
        
      
    
    {\displaystyle A=\times _{i}A_{i}}
  
 is the set of joint actions,</li>
<li>
  
    
      
        T
      
    
    {\displaystyle T}
  
 is a set of conditional transition probabilities between states, 
  
    
      
        T
        (
        s
        ,
        a
        ,
        
          s
          ′
        
        )
        =
        P
        (
        
          s
          ′
        
        ∣
        s
        ,
        a
        )
      
    
    {\displaystyle T(s,a,s')=P(s'\mid s,a)}
  
,</li>
<li>
  
    
      
        R
        :
        S
        ×
        A
        →
        
          R
        
      
    
    {\displaystyle R:S\times A\to \mathbb {R} }
  
 is the reward function.</li>
<li>
  
    
      
        
          Ω
          
            i
          
        
      
    
    {\displaystyle \Omega _{i}}
  
 is a set of observations for agent 
  
    
      
        i
      
    
    {\displaystyle i}
  
, with 
  
    
      
        Ω
        =
        
          ×
          
            i
          
        
        
          Ω
          
            i
          
        
      
    
    {\displaystyle \Omega =\times _{i}\Omega _{i}}
  
 is the set of joint observations,</li>
<li>
  
    
      
        O
      
    
    {\displaystyle O}
  
 is a set of conditional observation probabilities 
  
    
      
        O
        (
        
          s
          ′
        
        ,
        a
        ,
        o
        )
        =
        P
        (
        o
        ∣
        
          s
          ′
        
        ,
        a
        )
      
    
    {\displaystyle O(s',a,o)=P(o\mid s',a)}
  
, and</li>
<li>
  
    
      
        γ
        ∈
        [
        0
        ,
        1
        ]
      
    
    {\displaystyle \gamma \in [0,1]}
  
 is the discount factor.</li></ul>
<p>At each time step, each agent takes an action 
  
    
      
        
          a
          
            i
          
        
        ∈
        
          A
          
            i
          
        
      
    
    {\displaystyle a_{i}\in A_{i}}
  
, the state updates based on the transition function 
  
    
      
        T
        (
        s
        ,
        a
        ,
        
          s
          ′
        
        )
      
    
    {\displaystyle T(s,a,s')}
  
 (using the current state and the joint action), each agent observes an <a href="/facts/Observation/NTodmzCo">observation</a> based on the observation function 
  
    
      
        O
        (
        
          s
          ′
        
        ,
        a
        ,
        o
        )
      
    
    {\displaystyle O(s',a,o)}
  
 (using the next state and the joint action) and a reward is generated for the whole team based on the reward function 
  
    
      
        R
        (
        s
        ,
        a
        )
      
    
    {\displaystyle R(s,a)}
  
.  The goal is to maximize expected cumulative reward over a finite or infinite number of steps. These time steps repeat until some given horizon (called finite horizon) or forever (called infinite horizon). The discount factor 
  
    
      
        γ
      
    
    {\displaystyle \gamma }
  
 maintains a finite sum in the infinite-horizon case (
  
    
      
        γ
        ∈
        [
        0
        ,
        1
        )
      
    
    {\displaystyle \gamma \in [0,1)}
  
).
</p>

<h2 id="external-links">External links</h2>
<ul><li><a href="http://masplan.org">maspan.org</a></li>
<li><a href="http://rbr.cs.umass.edu/camato/decpomdp/">The Dec-POMDP page</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Bernstein, Daniel S.; Givan, Robert; Immerman, Neil; Zilberstein, Shlomo (November 2002). "The Complexity of Decentralized Control of Markov Decision Processes". Mathematics of Operations Research. 27 (4): 819–840. arXiv:1301.3836. doi:10.1287/moor.27.4.819.297. ISSN 0364-765X. S2CID 1195261. <a href="/wiki/Mathematics_of_Operations_Research" target="_blank">/wiki/Mathematics_of_Operations_Research</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Oliehoek, Frans A.; Amato, Christopher (2016). A Concise Introduction to Decentralized POMDPs | SpringerLink (PDF). SpringerBriefs in Intelligent Systems. doi:10.1007/978-3-319-28929-8. ISBN 978-3-319-28927-4. S2CID 3263887. <a href="978-3-319-28927-4" target="_blank">978-3-319-28927-4</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Oliehoek, Frans A.; Amato, Christopher (2016-06-03). A Concise Introduction to Decentralized POMDPs. Springer. ISBN 978-3-319-28929-8. <a href="978-3-319-28929-8" target="_blank">978-3-319-28929-8</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
</ol>

Decentralized partially observable Markov decision process open-in-new

Decentralized partially observable Markov decision process