차동 동적 프로그래밍

DDP(Differential Dynamic Programming)는 궤도 최적화 등급의 최적 제어 알고리즘이다.이 알고리즘은 메인에 의해^[1] 1966년에 도입되었고, 이후 제이콥슨과 메인의 eponymous book에서 분석되었다.^[2]알고리즘은 역학 및 비용 함수의 국소 2차 모델을 사용하며 2차 수렴을 표시한다.판토자의 단계적 뉴턴의 방법과 밀접한 관련이 있다.^[3]^[4]

유한수평 이산 시간 문제

역학

\mathbf {x} _{i+1}=\mathbf {f}(\mathbf {x} _{i},\mathbf {u} _{i})

(1)

컨트롤 $\mathbf {u}$ ${\$ $textstyle \$ $mathbf {x$ $}$ 이 $\textstyle \mathbf {x}$ $(가$ ) 지정된 상태 $\textstyle \mathbf {x}$ ${\$ $i}$ 에서 $\mathbf {u}$ $i+1$ $i$ ${\displaystyle$ i} $i+1$ + $i+1$ ${\displaysty i+1}$ 까지의 $i$ 진화에 대해 설명하십시오 $i+1$ 총비용 $J_{0}$ ${\$ 은 $J_{0}$ $\ell _{f}$ 는) $\mathbf {x}$ x ${\$ $\$ $textstyle \$ $ell$ $\mathbf {x}$ $_{f}$ 을 $\textstyle \ell$ $($ 를) 시작하고 제어 $\mathbf {x}$ 시퀀스 $\mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}$ $\mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}$ { $\mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}$ 0 $\mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}$ … $\mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}$ , U $\mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}$ - $\mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}$ 을(를) 적용할 때 발생하는 실행비용의 $\ell _{f}$ 이다 $.$ 수평선에 도달할 때까지 $\mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u} _{1}\dots ,\mathbf {u} _{N-1}\}$ ${\displaystyle \mathbf {U} \equiv \{\mathbf {u} _{0},\mathbf {u},\mathbf {u} _{N$ :1 $}}:$

J_{0}(\mathbf {x},\mathbf {U})=\sum _{i=0}^{N-1}\ell(\mathbf {x} _{i}}},\mathbf {u}+\ell(\mathb {x}), _{N}}

여기서 $\mathbf {x} _{0}\equiv \mathbf {x}$ $\mathbf {x} _{0}\equiv \mathbf {x}$ $\mathbf {x} _{0}\equiv \mathbf {x}$ $\mathbf {x} _{0}\equiv \mathbf {x}$ ${\$ $displaystyle \mathbf {x} _{0}\equiv$ $\$ $mathbf {x} }$ }과 $i>0$ $)$ $i>0$ 에 대한 $\mathbf {x} _{i}$ $\mathbf {x} _{i}$ i {\ $displaystyle$ \ $mathbf {x} _{i$ $}}$ 이(가) Eq. 1에 의해 주어진다 $i>0$ .The solution of the optimal control problem is the minimizing control sequence $\mathbf {U} ^{*}(\mathbf {x} )\equiv \operatorname {argmin} _{\mathbf {U} }J_{0}(\mathbf {x} ,\mathbf {U} ).$ Trajectory optimization means finding ${\displaystyle$ $\mathbf {U} ^{*}(\mathbf {x}$ _ ${0$ }) 특정 x $\mathbf {x} _{0}$ ${\$ {0 $}$ 에 $\mathbf {U} ^{*}(\mathbf {x} )$ 대해 가능한 모든 초기 상태가 아닌,

동적 프로그래밍

Let $\mathbf {U} _{i}$ be the partial control sequence $\mathbf {U} _{i}\equiv \{\mathbf {u} _{i},\mathbf {u} _{i+1}\dots ,\mathbf {u} _{N-1}\}$ and define the cost-to-go $J_{i}$ as the partial sum of costs $i {\displaystyle$ i $}$ 에서 $N$ $N$ 까지 $i$ $N$

J_{i}(\mathbf {x}),\mathbf {U}_{i}\sum _{j=i}^{{j-1}\ell(\mathbf {x} _{j}}}+\ell) _{f}(\mathbf {x} _{n}).

$시간$ i $i$ 에서 최적의 이동 비용 또는 가치 함수는 최소화된 제어 시퀀스를 고려할 때 이동 비용이다 $i$ .

V(\mathbf {x},i)\equiv \min _{\mathbf {U} _{i})J_{i}(\mathbf {x},\mathbf {U} _{i}).

$V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ ( $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ , $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ ) $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ ( $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ N $){\displaystyle V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ 를 설정하면 동적 프로그래밍 원리는 제어의 전체 시퀀스에 대한 최소화를 단일 제어에 대한 최소화로 축소하여 다음 시간을 거꾸로 진행한다 $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$

V(\mathbf {x},i)=\min _{\mathbf {u}[\mathbf {x},\mathbf {u}]+V(\mathbf {x},\mathbf {u}),i+1)]

(2)

이것이 벨만 방정식이다.

차동 동적 프로그래밍

DDP는 새로운 제어 시퀀스를 생성하기 위해 공칭 궤적에 대한 역방향 패스를 반복적으로 수행하여 새로운 공칭 궤적을 계산하고 평가하기 위한 전진 패스를 수행한다.우리는 후진 패스부터 시작한다.만약

\ell(\mathbf {x},\mathbf {u} )+V(\mathbf {f {f},\mathbf {u}),i+1)

Eq. 2의 $\min[]$ [ $\min[]$ ] ${\displaystyle \min[]$ 연산자의 $\min[]$ 인수인 경우, $Q$ $Q$ 을(를) $i$ $i$ -th $i$ $(\mathbf {x} ,\mathbf {u} )$ , $(\mathbf {x} ,\mathbf {u} )$ ) ${\displaystyle(\mathbf {x} ,\mathbf {u})$ 쌍 주위에 $(\mathbf {x} ,\mathbf {u} )$ 이 수량의 변동을 두십시오 $Q$ .

{\displaystyle{\begin{정렬}(\delta \mathbf{x},\delta \mathbf{너})\equiv&\ell(\mathbf{x}+\delta \mathbf{x},\mathbf{너}+\delta \mathbf{너})&,&{}(\mathbf{f}(\mathbf{x}+\delta \mathbf{x},\mathbf{너}+\delta \mathbf{너}),i+1)\\-&, \ell(\mathbf{x},\mathbf{너})&,&{}(\mathbf{f}(\mathbf{x},\mathbf{너}).,i+1)\end{정렬}}}

그리고 2순위로 확장한다.

\approx{\frac{1}{2}};\mathbf{)}\\\delta\mathbf{u}\end{bmatrix}}^{\mathsf{T}}{\begin{bmatrix}0&{\begin{bmatrix}1\\\delta.Q_{\mathbf{)}}^{\mathsf{T}}&Q_{\mathbf{너}}^{\mathsf{T}}\\Q_{\mathbf{)}}&.Q_{\mathbf{)}({)}}&.Q_{\mathbf{)}({u}}\\Q_{\mathbf{너}}&.Q_{\mathbf{너}({)}}&.Q_{\mathbf{너}\mathbf{u} }\end{bmatrix}{{bmatrix}}{\\nd \mathbf {x}\\nd \mathbf {u} \end{bmatrix}}}

(3)

여기서 사용되는 $Q$ $Q$ 표기법은 $Q$ 첨자가 분모 레이아웃에서 차이를 나타내는 모리모토 표기법의 변형이다.^[5]읽기 쉽도록 $i$ $인덱스$ i ${\displaystyle$ i $}$ 을(를) 삭제하면 다음 시간 단계 V v $V'\equiv V(i+1)$ $V'\equiv V(i+1)$ + $V'\equiv V(i+1)$ ) ${\displaystyle V'\equiv V(i+1$ 확장 계수는

{\reasonedat}{2}Q_{\mathbf{)}}&=\ell _{\mathbf{)}}+\mathbf{f}_{\mathbf{)}}^{\mathsf{T}}V'_{\mathbf{)}}\\Q_{\mathbf{너}}&=\ell _{\mathbf{너}}+\mathbf{f}_{\mathbf{너}}^{\mathsf{T}}V'_{\mathbf{)}}\\Q_{\mathbf{)}\mathbf{)}}&=\ell _{\mathbf{)}\mathbf{)}}+\mathbf{f}_{\mathbf{)}}^{\mathsf{T}}V'_{\mathbf{)}({)}}\mathb.f{f}_{\matHbf{)}}+V_{\mathbf{)}}'\cdot\mathbf{f}_{\mathbf{)}\mathbf{)}};=\ell _{\mathbf{너}\mathbf{u}}+\mathbf{f}_{\mathbf{너}}^{\mathsf{T}}V'_{\mathbf{)}({)}}\mathbf{f}_{\mathbf{너}}+{V'_{\mathbf{)}}\mathbf{u}}및 \\Q_{\mathbf{너}}\cdot \mathbf}\mathbf{)}}및 \\Q_{\mathbf{너}, =\ell_{\mat _{\mathbf{너}\mathbf{u}{f}.hbf{u}\mathbf {x} }+\mathbf {f} _{\mathbf {u} }^{\mathsf {T}}V'_{\mathbf {x} \mathbf {x} }\mathbf {f} _{\mathbf {x} }+{V'_{\mathbf {x} }}\cdot \mathbf {f} _{\mathbf {u} \mathbf {x} }.\end{alignedat}}

마지막 세 방정식의 마지막 항은 텐서(tensor)를 가진 벡터의 수축을 의미한다. $\delta \mathbf {u}$ $\delta \mathbf {u}$ ${\$ 에 대한 2차 근사치(3) 최소화 $\delta \mathbf {u}$

{\delta \mathbf {u} }^{*}=\operatorname {argmin} \limits _{\delta \mathbf {u} }Q(\delta \mathbf {x} ,\delta \mathbf {u} )=-Q_{\mathbf {u} \mathbf {u} }^{-1}(Q_{\mathbf {u} }+Q_{\mathbf {u} \mathbf {x} }\delta \mathbf {x} ),

(4)

giving an open-loop term $\mathbf {k} =-Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} }$ and a feedback gain term $\mathbf {K} =-Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} \mathbf {x} }$ . Plugging the result back into (3) 이제 시간 $i$ $i$ 의 값에 대한 2차 모델이 제공됨 $i$

{\begin{alignedat}{2}\Delta V(i)&=&{}-{\tfrac {1}{2}}Q_{\mathbf {u} }^{T}Q_{\mathbf {u} \mathbf {u} }^{-1}Q_{\mathbf {u} }\\V_{\mathbf {x} }(i)&=Q_{\mathbf {x}{}-Q_{}-Q_{\mathbf {xu}}Q_{}{\mathbf {u}^{-1}Q_{\mathbf {u}\\V_{\mathbf {x} \mathbf {x}}}{x} \mathbf {x}}}}}&}=Q_{\mathbf {x} \mathbf {x}{}-Q_{\mathbf {x} \mathbf {u} {}Q_{\mathbf {u} \mathbf {u}^{-1}Q_{\mathbf {u} \mathbf {x}}}}}\end{aignedat}}}}}}}}}}}}}}}}}

Recursively computing the local quadratic models of $V(i)$ and the control modifications $\{\mathbf {k} (i),\mathbf {K} (i)\}$ , from $i=N-1$ down to $i=1$ , constitutes the backward pass.위와 같이 $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ ( $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ , N $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ ) $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ ( x N $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ ) { ℓ ℓ f ( $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$ ) ${\displaystyle V(\mathbf {x},$ N $)\equiv \ell \{f}(\mathbf {x} _{N}).$ 후진 패스가 완료되면 전진 패스가 새로운 궤적을 계산한다 $V(\mathbf {x} ,N)\equiv \ell _{f}(\mathbf {x} _{N})$

{\begin{aligned}{\hat {\mathbf {x} }}(1)&=\mathbf {x} (1)\\{\hat {\mathbf {u} }}(i)&=\mathbf {u} (i)+\mathbf {k} (i)+\mathbf {K} (i)({\hat {\mathbf {x} }}(i)-\mathbf {x} (i))\\{\hatbf{x}}}}}}({\hattbf {x}})(i) {\hattbf {x}}}}{\hatbf {u}}}}}(i)}{\hattbf {}}}}}(i)\end{정렬}}

후진 패스 및 전진 패스는 수렴될 때까지 반복된다.

정규화 및 라인 검색

차동 동적 프로그래밍은 뉴턴의 방식과 같은 2차 알고리즘이다.따라서 최소화를 향해 큰 걸음을 내딛고 종종 정규화 및/또는 라인 검색을 필요로 한다.^[7] DDP 컨텍스트에서 정규화는 Eq $Q_{\mathbf {u} \mathbf {u} }$ 의 $Q_{\mathbf {u} \mathbf {u} }$ u {\ $displaystyle$ Q_{\ $mathbf {u$ } \ $mathbf$ {u $}}}}$ 행렬이 양적으로 확실한지 확인하는 것을 의미한다.DDP의 라인 검색은 오픈 루프 제어 수정 $\mathbf {k}$ ${\$ 을(를) $0<\alpha <1$ < $0<\alpha <1$ $<<\displaystyle 0>\alpha <1}$ 만큼 $\mathbf {k}$ 확장하는 것이다 $0<\alpha <1$

몬테카를로 버전

샘플링된 차동 동적 프로그래밍(SaDDP)^[8]^[9]^[10]은 차동 동적 프로그래밍의 몬테카를로 변종이다.그것은 차동 동적 프로그래밍의 이차적 비용을 볼츠만 분포의 에너지로 처리하는 것에 기초한다.이 방법으로 DDP의 수량을 다차원 정규 분포의 통계량과 일치시킬 수 있다.그 통계는 분화되지 않고 샘플링된 궤도로부터 다시 계산될 수 있다.

샘플링된 차동 동적 프로그래밍은 차동 동적 프로그래밍을 통한 경로 통합 정책 개선으로 확장되었다.^[11]이것은 확률론적 최적 제어의 ^[12]틀인 차동적 프로그래밍과 경로 적분 제어 사이의 연결을 만든다.

제한된 문제

내부 포인트 차동형 프로그래밍(Internal Point Differential Dynamic Programming, IPDDP)은 비선형 상태 및 입력 제약으로 최적의 제어 문제를 해결할 수 있는 DDP의 내부 포인트 방식 일반화다.^[13]

참고 항목

최적제어

참조

^ Mayne, D. Q. (1966). "A second-order gradient method of optimizing non-linear discrete time systems". Int J Control. 3: 85–95. doi:10.1080/00207176608921369.
^ Mayne, David H. and Jacobson, David Q. (1970). Differential dynamic programming. New York: American Elsevier Pub. Co. ISBN 978-0-444-00070-5.
^ de O. Pantoja, J. F. A. (1988). "Differential dynamic programming and Newton's method". International Journal of Control. 47 (5): 1539–1553. doi:10.1080/00207178808906114. ISSN 0020-7179.
^ Liao, L. Z.; C. A Shoemaker (1992). "Advantages of differential dynamic programming over Newton's method for discrete-time optimal control problems". Cornell University, Ithaca, NY. hdl:1813/5474. {{cite journal}}:Cite 저널은 필요로 한다. journal=(도움말)
^ Morimoto, J.; G. Zeglin; C.G. Atkeson (2003). "Minimax differential dynamic programming: Application to a biped walking robot". Intelligent Robots and Systems, 2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Conference on. Vol. 2. pp. 1927–1932.
^ Liao, L. Z; C. A Shoemaker (1991). "Convergence in unconstrained discrete-time differential dynamic programming". IEEE Transactions on Automatic Control. 36 (6): 692. doi:10.1109/9.86943.
^ Tassa, Y. (2011). Theory and implementation of bio-mimetic motor controllers (PDF) (Thesis). Hebrew University. Archived from the original (PDF) on 2016-03-04. Retrieved 2012-02-27.
^ "Sampled differential dynamic programming - IEEE Conference Publication". doi:10.1109/IROS.2016.7759229. S2CID 1338737. {{cite journal}}:Cite 저널은 필요로 한다. journal=(도움말)
^ "Regularizing Sampled Differential Dynamic Programming - IEEE Conference Publication". ieeexplore.ieee.org. Retrieved 2018-10-19.
^ Joose, Rajamäki (2018). Random Search Algorithms for Optimal Control. Aalto University. ISBN 9789526081564. ISSN 1799-4942.
^ Lefebvre, Tom; Crevecoeur, Guillaume (July 2019). "Path Integral Policy Improvement with Differential Dynamic Programming". 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM): 739–745. doi:10.1109/AIM.2019.8868359. hdl:1854/LU-8623968. ISBN 978-1-7281-2493-3. S2CID 204816072.
^ Theodorou, Evangelos; Buchli, Jonas; Schaal, Stefan (May 2010). "Reinforcement learning of motor skills in high dimensions: A path integral approach". 2010 IEEE International Conference on Robotics and Automation: 2397–2403. doi:10.1109/ROBOT.2010.5509336. ISBN 978-1-4244-5038-1. S2CID 15116370.
^ Pavlov, Andrei; Shames, Iman; Manzie, Chris (2020). "Interior Point Differential Dynamic Programming". arXiv:2004.12710 [math.OC].

외부 링크

[1] Mayne, D. Q. (1966). "A second-order gradient method of optimizing non-linear discrete time systems". Int J Control. 3: 85–95. doi:10.1080/00207176608921369.

[2] Mayne, David H. and Jacobson, David Q. (1970). Differential dynamic programming. New York: American Elsevier Pub. Co. ISBN 978-0-444-00070-5.

[3] O. Pantoja, J. F. A. (1988). "Differential dynamic programming and Newton's method". International Journal of Control. 47 (5): 1539–1553. doi:10.1080/00207178808906114. ISSN 0020-7179.

[4] Liao, L. Z.; C. A Shoemaker (1992). "Advantages of differential dynamic programming over Newton's method for discrete-time optimal control problems". Cornell University, Ithaca, NY. hdl:1813/5474. {{cite journal}}:Cite 저널은 필요로 한다. journal=(도움말)

[5] Morimoto, J.; G. Zeglin; C.G. Atkeson (2003). "Minimax differential dynamic programming: Application to a biped walking robot". Intelligent Robots and Systems, 2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Conference on. Vol. 2. pp. 1927–1932.

[6] Liao, L. Z; C. A Shoemaker (1991). "Convergence in unconstrained discrete-time differential dynamic programming". IEEE Transactions on Automatic Control. 36 (6): 692. doi:10.1109/9.86943.

[7] Tassa, Y. (2011). Theory and implementation of bio-mimetic motor controllers (PDF) (Thesis). Hebrew University. Archived from the original (PDF) on 2016-03-04. Retrieved 2012-02-27.

[8] "Sampled differential dynamic programming - IEEE Conference Publication". doi:10.1109/IROS.2016.7759229. S2CID 1338737. {{cite journal}}:Cite 저널은 필요로 한다. journal=(도움말)

[9] "Regularizing Sampled Differential Dynamic Programming - IEEE Conference Publication". ieeexplore.ieee.org. Retrieved 2018-10-19.

[10] Joose, Rajamäki (2018). Random Search Algorithms for Optimal Control. Aalto University. ISBN 9789526081564. ISSN 1799-4942.

[11] Lefebvre, Tom; Crevecoeur, Guillaume (July 2019). "Path Integral Policy Improvement with Differential Dynamic Programming". 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM): 739–745. doi:10.1109/AIM.2019.8868359. hdl:1854/LU-8623968. ISBN 978-1-7281-2493-3. S2CID 204816072.

[12] Theodorou, Evangelos; Buchli, Jonas; Schaal, Stefan (May 2010). "Reinforcement learning of motor skills in high dimensions: A path integral approach". 2010 IEEE International Conference on Robotics and Automation: 2397–2403. doi:10.1109/ROBOT.2010.5509336. ISBN 978-1-4244-5038-1. S2CID 15116370.

[13] Pavlov, Andrei; Shames, Iman; Manzie, Chris (2020). "Interior Point Differential Dynamic Programming". arXiv:2004.12710 [math.OC].

[1]

[2]

[3]

[4]

[5]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Search

차동 동적 프로그래밍

네임스페이스

더

목차

유한수평 이산 시간 문제

동적 프로그래밍

차동 동적 프로그래밍

정규화 및 라인 검색

몬테카를로 버전

제한된 문제

참고 항목

참조

외부 링크