역전파

기계학습에서 역전파(backprop,^[1] BP)는 피드포워드 뉴럴 네트워크를 훈련하기 위해 널리 사용되는 알고리즘이다.역전파의 일반화는 다른 인공신경망(ANN) 및 일반적으로 기능에 대해 존재한다.이러한 알고리즘의 클래스는, 통칭으로 「역전파」^[2]라고 불립니다.신경망을 장착할 때 역전파는 단일 입력-출력 예에 대한 네트워크의 가중치에 대한 손실 함수의 구배를 계산하고, 각 가중치에 대한 구배의 순진한 직접 계산과는 달리 효율적으로 계산한다.이 효율성은 다층 네트워크를 훈련하기 위해 구배 방법을 사용할 수 있게 하며, 손실을 최소화하기 위해 가중치를 업데이트한다. 구배 강하 또는 확률적 구배 강하와 같은 변형이 일반적으로 사용된다.역전파 알고리즘은 체인 규칙에 의한 각 무게에 대한 손실 함수의 구배를 계산하고, 구배를 한 번에 1개씩 계산하며, 체인 규칙에서 중간 용어의 중복 계산을 피하기 위해 마지막 계층에서 역방향으로 반복함으로써 작동한다.이것은 동적 프로그래밍의 ^[3]한 예이다.

역전파라는 용어는 엄격하게 구배 계산 알고리즘만을 참조하며, 구배 사용 방법은 참조하지 않는다. 그러나 이 용어는 종종 확률적 구배 ^[4]강하와 같은 구배 사용 방법을 포함한 전체 학습 알고리즘을 참조하기 위해 느슨하게 사용된다.역전파는 역전파의 단일층 버전인 델타 규칙에서 구배 계산을 일반화하며, 역전파는 역전파의 특수한 경우(또는 "역모드")^[5]인 자동 미분에 의해 일반화된다.루멜하트, 힌튼 & 윌리엄스(1986a)에서 역전파라는 용어와 신경 네트워크에서의 일반적인 용도가 발표되었고, 그 후 루멜하트, 힌튼 & 윌리엄스(1986b)에서 정교하고 대중화되었지만, 이 기술은 여러 번 독립적으로 재발견되었고 1960년대 이전 기술도 많이 있다. ^[6]역사를 보라.Goodfellow, Bengio & Courville(2016)[7]의 딥러닝 교과서에 현대적인 개요가 제시되어 있다.

개요

역전파는 손실함수에 관한 피드포워드 뉴럴 네트워크의 무게공간에서의 구배를 계산한다.다음을 나타냅니다.

$x$ {\ $displaystyle$ x $x$ : 입력(기능의 일부)
$\displaystyle$ y $y$ 목표 출력
분류의 경우 출력은 클래스 확률의 벡터(예를 들어 ( $(0.1,0.7,0.2)$ 0. $(0.1,0.7,0.2)$ .2 $)\displaystyle(0.1,$ $(0,1,0)$ $.7,0.$ 2 $(0.1,0.7,0.2)$ 타겟 출력은 one-hot/dummy 변수(예를 들어 $(0,1,0)$ , $(0,1,0)$ 0 $)\displaystyle(0,0,0))$ 에 의해 부호화된 특정 클래스입니다.
$\displaystyle$ C $C$ 손실 함수 또는 "비용 함수"^[a]
분류의 경우 일반적으로 교차 엔트로피(XC, 로그 손실)이며 회귀의 경우 일반적으로 SEL(Square Error Loss)입니다.
$L$ $\$ $displaystyle$ L $L$ : 레이어 수
$W^{l}=(w_{jk}^{l})$ l $W^{l}=(w_{jk}^{l})$ ( $W^{l}=(w_{jk}^{l})$ j $W^{l}=(w_{jk}^{l})$ l $W^{l}=(w_{jk}^{l})$ ) { { $displaystyle$ W $^{$ l } = ( w $_$ { $jk$ }^{ $l$ } $W^{l}=(w_{jk}^{l})$ : $l-1$ $l-1$ - $l-1$ 1 $l-1$ ~ $l$ { $displaystyle$ $l$ $l$ $}$ 사이의 $w_{jk}^{l}$ $l-1$ 무게. $w_{jk}^{l}$ 서 w $w_{jk}^{l}$ { $displaystyle$ $w$ _ { $jk$ }^ $l}$ { l $}$ { $displaystyle$ l $l-1$ }^1 - l l l l l l l l l l l in $}$ in in l1 - l - l - node l （ $l-1$ ） : l - l $l-1$ l 1 l in l l l l l l l $style$ j $}$ - $l$ 1의 세번째 노드 {\ $display style$ l $}$
$f^{l}$ l $f^{l}$ {\ $displaystyle$ f^{ $l$ : $layer$ l {\ $displaystyle$ l $}$ 에서의 활성화 기능
분류의 경우 마지막 레이어는 보통 이진 분류를 위한 로지스틱 함수이며, 다중 클래스 분류를 위한 softargmax(softargmax)인 반면, 은닉 레이어의 경우 전통적으로 각 노드(좌표)에서 Sigmoid 함수(로지스틱 함수 또는 기타)였지만, 현재는 정류기(램프, ReLU)가 일반적이기 때문에 더 다양합니다.

역전파의 도출에는 다른 중간량이 사용되며, 필요에 따라 아래에 소개된다.바이어스 항은 고정 입력이 1인 가중치에 해당하므로 특별히 처리되지 않습니다.역전파의 목적상, 특정 손실 함수 및 활성화 함수는 그러한 함수와 그 파생상품을 효율적으로 평가할 수 있는 한 중요하지 않다.기존의 활성화 기능에는 Sigmoid, tanh 및 ReLU가 포함되지만 이에 한정되지 않습니다. 이때부터 swish,^[8]^[9] mish 및 기타 활성화 기능도 제안되었습니다.

네트워크 전체는 함수 구성과 매트릭스 곱셈의 조합입니다.

(\displaystyle g(x):=f^{L}(W^{L}f^{L-1}(W^{L-1}\cdots f^{1}(W^{1}x)\cdots ) )

트레이닝 세트의 경우 입력-출력 쌍의 세트 $\left\{(x_{i},y_{i})\right\}$ { ( $\left\{(x_{i},y_{i})\right\}$ i , $\left\{(x_{i},y_{i})\right\}$ i ) $}$ { $displaystyle \$ left \ { ( $x$ _ { $\left\{(x_{i},y_{i})\right\}$ $i$ , $y$ _ { $i$ ) \ $right \$ } 。트레이닝 세트의 각 입력-출력 $(x_{i},y_{i})$ 쌍( $i$ , $y )$ 에 $(x_{i},y_{i})$ 대해 모델의 손실은 그 쌍의 차이입니다. $g(x_{i})$ $g(x_{i})$ $g(x_{i})$ ) { $displaystyle$ g ( $x$ _ { $i$ ) } the $g(x_{i})$ the y y y $y_{i}$ y $y_{i}$ i { $displaystyle$ y $_$ { $i$ } :

(\displaystyle C(y_{i},g(x_{i})) )

구별에 주의해 주세요.모델 평가 중 가중치는 고정되지만 입력은 변화하고(및 타깃 출력을 알 수 없는 경우도 있습니다), 네트워크는 출력 레이어로 종료됩니다(손실 함수는 포함되지 않습니다).모델 트레이닝 중에, 입력-출력 쌍은 고정되고, 무게는 변화하며, 네트워크는 손실 함수로 끝납니다.

역전파는 고정 입출력 쌍 $,$ $)$ 의 구배를 계산합니다.여기서 $w_{jk}^{l}$ w $w_{jk}^{l}$ k $w_{jk}^{l}$ $(\$ $displaystyle w_{$ jk $}^{$ $i$ 은 $w_{jk}^{l}$ 다를 수 있습니다.그라데이션의 각 성분인 $\partial C/\partial w_{jk}^{l},$ C / $\partial C/\partial w_{jk}^{l},$ $\partial C/\partial w_{jk}^{l},$ j $\partial C/\partial w_{jk}^{l},$ l $\partial C/\partial w_{jk}^{l},$ , \ $displaystyle$ \ $partial$ C / \ $partial$ w $_$ { $jk$ }^{ l $\partial C/\partial w_{jk}^{l},$ }는 $\partial C/\partial w_{jk}^{l},$ 체인규칙으로 계산할 수 있지만, 각 무게에 대해 별도로 이 작업을 수행하는 것은 비효율적입니다.역전파는 불필요한 중간값을 계산하지 않고 중복 계산을 회피함으로써 구배를 효율적으로 계산한다.구배는 각 층의 가중치 입력의 구배( $\delta ^{l}$ § $\delta ^{l}$ l \ $displaystyle \delta$ ^{ $l})$ 를 후방에서 전방으로 계산한다.

비공식적으로 중요한 것은 $W^{l}$ l $\$ W $^{l}$ 의 $W^{l}$ 무게가 손실에 영향을 미치는 유일한 방법은 다음 레이어에 대한 영향을 통해 선형적으로 영향을 미치므로 $\$ ^{ $l}$ $l$ l $\displaystyle$ l $l$ 에서 무게의 구배를 계산하기 위해 필요한 데이터는 l\displaystyle \delta ^{l}뿐입니다.n 이전 레이어 $\delta ^{l-1}$ l - $\delta ^{l-1}$ { $displaystyle$ \ $displaystyle ^$ { l - $\delta ^{l-1}$ 1 } 을 $\delta ^{l-1}$ 계산하고 재귀적으로 반복합니다.이를 통해 두 가지 방법으로 비효율성을 방지할 수 있습니다.첫째, $l$ 의 $l$ 구배를 계산할 때 l $l+1,l+2,\ldots$ + $l+1,l+2,\ldots$ , l $l+1,l+2,\ldots$ + $l+1,l+2,\ldots$ , ... (\ $displaystyle$ l + 1, l + $,\$ displaystyle $l$ + $1, l$ + 2, \ $ldots$ } $둘째$ , 각 단계에서 가중치의 변화에 관한 숨겨진 레이어 값의 파생물을 불필요하게 계산하는 대신 최종 출력(손실 $)$ 에 대한 가중치의 구배를 직접 계산하기 때문에 $\partial a_{j'}^{l'}/\partial w_{jk}^{l}$ 한 중간 계산을 회피한다 $\partial a_{j'}^{l'}/\partial w_{jk}^{l}$ $laystyle \syslog a_{j'}^{l'}/\syslog w_{jk}^{l$

역전파는 단순한 피드포워드 네트워크에 대해 매트릭스 곱셈의 관점에서, 또는 보다 일반적으로 인접 그래프로 표현할 수 있습니다.

행렬 곱셈

각 레이어의 노드가 (레이어를 건너뛰지 않고) 바로 다음 레이어의 노드에만 접속되어 최종 출력의 스칼라 손실을 계산하는 손실함수가 있는 피드포워드 네트워크의 기본 케이스에서는 매트릭스 곱셈만으로 ^[c]역전파를 이해할 수 있다.본질적으로 역전파는 각 계층 간의 가중치 구배를 부분적 산물의 단순한 수정('역전파 오차')으로 하여 왼쪽에서 오른쪽으로 각 계층 간의 파생물의 곱으로서 비용함수의 파생물에 대한 식을 평가한다.

입출력 쌍 $(x,y)$ , $(x,y)$ ) { $displaystyle (x , y$ $(x,y)$ 의 경우 손실은 다음과 같습니다.

({displaystyle C(y,f^{L}(W^{L-1}\cdots f^{2}(W^{2}f^{1}x)\cdots ))})

이를 계산하려면 입력 $(\displaystyle$ x $)$ 에서 $x$ 시작하여 앞으로 이동합니다. 각 레이어의 가중치 입력은 z $(\$ z $^{l})$ 이고 $z^{l}$ $레이어$ l(\ $displaystyle$ a $^{l})$ 의 $l$ 출력은 $a^{l}$ l(\ $displaystyle$ a^{l $a^{l}$ 입니다. 역전파의 경우 $a^{l}$ 는 l $(\$ a $^{l})$ 입니다.} 및 $a^{l}$ 파생 $(f^{l})'$ ( $f$ l $(f^{l})'$ ) ${\$ { { $displaystyle$ ( f $^$ { l ) $(f^{l})'$ } $（$ $z^{l}$ $l$ { $displaystyle$ z $z^{l}$ { l} ） {\ $a^{l}$ pass for pass for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for for

입력에 관한 손실의 미분은 체인 규칙에 의해 주어집니다.각 항은 입력 $(\displaystyle$ x $x$ 의 네트워크 값으로 평가되는 총 미분입니다.

{\displaystyle\frac{dC}{da^{L}}\circ {frac {da^}L}}{dz^{L}}\cdot {frac {dz^}L}}{da^{L-1}}\circ {frac {da^{L-1}}\cdot {dz^{L-1}}{da^{L-2}}\ldots {frac {da^{1}}\cdot {partial}\cdot {1}{frac^{l-1}}

여기서 ${\$ { $displaystyle \circ }$ 는 $\circ$ Hadamard 제품으로 요소별 제품으로 왼쪽에서 오른쪽으로 제품을 가져옵니다.

손실함수의 ^[d]도함수, ^[e]활성화함수의 도함수,^[f] 가중치의 행렬은 다음과 같다.

{\displaystyle\frac{dC}{da^{L}}\circ(f^{L}),\cdot W^{L}\circ(f^{L-1}),\cdot W^{1},\cdot W^{1}.

구배 $\nabla$ δ(\ $displaystyle \nabla)$ 는 $\nabla$ 입력에 관한 출력의 도함수 전치이므로 행렬은 전치되고 곱셈 순서는 반전되지만 엔트리는 동일합니다.

\displaystyle _{x}C=(W^{1})^{T}\cdot (f^{1})'\ldots \circ (W^{L-1})^{T}\cdot(f^{L-1}),\circ(W^{L})^{T}\cdot(f^{L})'\circ \nabla _{a^{L}}C.}

역전파는 기본적으로 이 식을 오른쪽에서 왼쪽으로 평가하고(등가적으로 왼쪽에서 오른쪽으로 도함수에 대한 이전 식을 곱하는 것), 도중에 각 계층에서 구배를 계산하는 것으로 구성됩니다. 가중치의 구배는 단순한 하위 표현식이 아니기 때문에 추가된 단계가 있습니다: 추가 곱셈이 있습니다..

부분 제품(오른쪽에서 왼쪽으로 곱하기)의 보조량 $\delta ^{l}$ l \ $display \delta$ ^{ $l}$ 을 $\delta ^{l}$ 소개합니다.이것은 " $level$ l $\display style$ l $l$ 에서 "error"로 해석되며 $,$ $level$ l\ $display$ style l $l$ 에서 입력값의 기울기로 정의됩니다.

\displaystyle \display ^{l}:=(f^{l}),\display(W^{l+1})^{T}\cdots\circ(W^{L-1})^{T}\cdot(f^{L-1}),\circ(W^{L})^{T}\cdot(f^{L})'\circ \nabla _{a^{L}}C.}

$(\$ ^{ $l})$ 은 $\delta ^{l}$ $(\displaystyle$ l $l$ 의 노드 수와 같은 길이의 벡터입니다.각 컴포넌트는 "그 노드의 (값)에 기인하는 비용"으로 해석됩니다.

$레이어$ l\ $display l$ 의 $l$ 무게의 구배는 다음과 같습니다.

({displaystyle\displayla_{W^{l}}C=\delta^{l}(a^{l-1})^{T.}

$({$ a $^{l-1})$ 의 $a^{l-1}$ 계수는 $({displaystyle l-1})$ 과 $l-1$ 레벨l $({$ $displaystyle$ l $})$ 사이의 $l$ $W^{l}$ W $W^{l}$ ({ $displaystyle$ W $^{l})$ 이 $W^{l}$ 입력(활성화)에 비례하여 $l$ $레벨$ l({ $displaystyle$ l})에 영향을 미치기 때문입니다.

§ $\delta ^{l}$ \ $displaystyle \delta$ ^{ $l}$ 은 $\delta ^{l}$ 다음과 같이 재귀적으로 쉽게 계산할 수 있습니다.

\displaystyle \l-1:=(f^{l-1}),\display(W^{l})^{T}\cdot \delta ^{l}.}

따라서 체중의 구배는 각 수준에 대해 몇 개의 행렬 곱셈을 사용하여 계산할 수 있다. 이것은 역전파이다.

순진하게 포워드를 $\delta ^{l}$ 하는 경우와 비교( $\delta ^{l}$ 에 § $\delta ^{l}$ l \ $displaystyle \delta$ ^{ $l}$ 사용 $\delta ^{l}$ ):

(\displaystyle\displaystyle\signed^{1}&=(f^{1}),\displaystyle(W^{2})^{T}\cdot (f^{2})'\cdots \circ (W^{L-1})^{T}\cdot(f^{L-1}),\circ(W^{L})^{T}\cdot (f^{L})'\circ \cdots _{a^{L}}C\\delta ^{2}&=(f^{2})'\cdots \cdots (W^{L-1})^{T}\cdot(f^{L-1}),\circ(W^{L})^{T}\cdot (f^{L})'\circ \cdla _{a^{L}}C\&\vdots \\circ ^{L-1}&=(f^{L-1})'\circ (W^{L}}^{T}\cdot (f^{L})'\circ \cdla _{a^{L}C\\delta ^{L}&=(f^{L})'\circ \cdla _{l}C,\end{aligned}}}}

역전파에는 다음 두 가지 주요 차이점이 있습니다.

§ $\delta ^{l}$ $\delta ^{l-1}$ - $\delta ^{l-1}$ (\ $displaystyle \delta$ ^{ $l-1})$ 을 $\delta ^{l-1}$ $\delta ^{l}$ 연산하면 l(\ $displaystyle$ \ $delta$ ^{ $l})$ 이후 $l$ $l$ 의 명백한 중복을 피할 수 있습니다.
$\nabla _{a^{L}}C$ 를 역방향으로 전파하는 L C $(\displaystyle\nabla_{a^{L}}C)$ 에서 시작하는 곱셈은 각 스텝이 $\delta ^{l}$ 벡터(\ $displaystyle\delta$ ^{ $l$ 에 $(W^{l})^{T}$ 가중치 행렬( $(W^{l})^{T}$ $(W^{l})^{T}$ T $(\$ 를 곱하는 것을 의미합니다. $(f^{l-1})'$ $T}$ 및 $(W^{l})^{T}$ 활성화의 $(f^{l-1})'$ ( $(f^{l-1})'$ l $(f^{l-1})'$ - 1) ${\$ { $displaystyle$ ( $f^{l-1$ 이와는 대조적으로, 앞의 층의 변화에서 시작하여 앞으로 곱하는 것은 각 곱셈이 행렬에 행렬을 곱하는 것을 의미합니다.이는 비용이 많이 들기 때문에 $l+2$ l + $($ $W^{l+1}$ + $($ l + 1 $W^{l+1}$ { $display$ $style$ W^ { l + $W^{l+1}$ 1})에 $W^{l+2}$ + 2 $($ W^ { $l$ + $2$ 를 곱하는 경우)로 향하는1개의 $레이어$ l(\ $displaystyle$ l $)$ 의 $l$ 변경 경로를 모두 추적하고 deri를 곱하는 데 대응합니다.활성화의 변동) 체중 변화가 숨겨진 노드의 값에 어떻게 영향을 미치는지에 대한 중간 양을 불필요하게 계산한다.

인접 그래프

보다 일반적인 그래프 및 기타 고급 변형에 대해서는 역 전파를 자동 미분이라는 관점에서 이해할 수 있다. 여기서 역 전파는 역 누적(또는 "역 모드")^[5]의 특수한 경우이다.

직감

동기

지도 학습 알고리즘의 목적은 입력 세트를 올바른 출력에 가장 잘 매핑하는 함수를 찾는 것입니다.역전파의 동기는 입력과 ^[10]출력의 임의의 매핑을 학습할 수 있도록 적절한 내부 표현을 학습할 수 있도록 다층 신경망을 훈련시키는 것이다.

최적화 문제로서의 학습

역전파 알고리즘의 수학적 도출을 이해하기 위해, 먼저 뉴런의 실제 출력과 특정 훈련 예에 대한 정확한 출력 사이의 관계에 대한 직관을 개발하는 데 도움이 됩니다.2개의 입력 단위, 1개의 출력 단위와 숨겨진 단위가 없는 단순한 신경망을 생각해 보자.각 뉴런이 입력의 가중치 합인 선형 출력을 사용한다(입력에서 출력으로의 매핑이 ^[g]비선형인 신경망에 대한 대부분의 작업과 달리).

2개의 입력 유닛(각각 1개의 입력)과 1개의 출력 유닛(각각 2개의 입력)이 있는 단순한 뉴럴 네트워크

처음에는 훈련 전에 체중이 랜덤으로 설정됩니다.그런 다음 뉴런은 일련의 튜플 $x$ 1, $(x_{1},x_{2},t)$ , $(x_{1},x_{2},t)$ t $)(x_$ {1}, $x_{2}, t)$ 로 $(x_{1},x_{2},t)$ 구성됩니다. $x_{1}$ 서 x 1 $({$ 및 $x_{2}$ 2({ $display style$ x_ ${2$ }})는 $x_{2}$ 네트워크에 대한 $입력$ 이고 t는 올바른 출력입니다(네트워크가 제공하는 출력 t).호스 입력(교육된 경우) $x_{1}$ 번째 네트워크( $x_{1}$ { $display style x_{1}}$ 및 $x_{1}$ $x_{2}$ 2 { $displaystyle x_{2$ 에서는 t(임의 가중치 부여)와는 다른 $출력$ y가 계산됩니다. $손실함수$ L $L(t,y)$ , $L(t,y)$ ) { $displaystyle$ L $(t, y$ )}은 $L(t,y)$ 목표출력 t와 $연산출력$ y의 불일치 측정에 사용된다.회귀 분석 문제의 경우 오차 제곱을 손실 함수로 사용할 수 있으며, 분류를 위해 범주형 교차 엔트로피를 사용할 수 있습니다.

예를 들어, 제곱 오차를 손실로 사용하는 회귀 문제를 고려하십시오.

{{displaystyle L(t,y)=(t-y)^{2}=E,}

$여기$ 서 E는 불일치 또는 오류입니다.

네트워크를 단일 트레이닝 케이스( $(1,1,0)$ ( $(1,1,0)$ , $(1,1,0)$ , 0 $){$ $displaystyle ( 1, 1$ , 0 $(1,1,0)$ ) $(1,1,0)$ )로 검토합니다. $x_{1}$ 입력 x $({$ 과 $x_{1}$ x $({$ 는 $x_{2}$ 각각 1과 1이며 올바른 출력 $t$ 는 0입니다. $수평축$ 의 네트워크 출력 y와 수직축의 $오차E$ 사이에 관계가 플롯되어 있으면 결과는 포물선이 됩니다.포물선의 최소값은 $오차$ E를 최소화하는 $출력$ y에 해당합니다.단일 훈련의 경우 최소값은 수평축에도 해당됩니다.즉, 오차는 0이 되고 네트워크는 목표 $출력$ t와 정확히 일치하는 $출력$ y를 생성할 수 있습니다.따라서 입력을 출력에 매핑하는 문제는 최소한의 오차를 발생시키는 함수를 찾는 최적화 문제로 줄일 수 있습니다.

단일 교육 사례에 대한 선형 뉴런 오류 표면

그러나 뉴런의 출력은 모든 입력의 가중치 합계에 따라 달라진다.

y=x_{1}w_{1}+x_{2}w_{2}

$w_{1}$ 서 w 1 $({$ $w_{2}$ w $w_{2}$ ({ $displaystyle w_{2})$ 는 $w_{2}$ 입력 장치에서 출력 장치로 연결되는 무게입니다.따라서, 오류는 또한 뉴런에 유입되는 무게에 의존하며, 이는 궁극적으로 학습을 가능하게 하기 위해 네트워크에서 변경되어야 하는 것이다.

이 예에서는 트레이닝 데이터를 주입하면 손실 함수가

${{displaystyle E=(t-y)^2}=y^{2}=(x_{1}w_{1}+x_{2}w_{2}^2}=(w_{1}+w_{2})^2}}$

$w_{1}=-w_{2}$ 다음 손실 $함수$ E({ $displaystyle$ E $})$ 는 $E$ 포물선 원통의 형태를 취하며 베이스는 $w_{1}=-w_{2}$ 1 $w_{1}=-w_{2}$ w $w_{1}=-w_{2}$ ({ $displaystyle w_{1$ }=- $w_{$ 2 $w_{1}=-w_{2}$ 를 따릅니다. 따라서 w $=$ w $w_{1}=-w_{2}$ ({ $displaystyle w_$ {1}=- $w_{$ 2 $w_{1}=-w_{2}$ }})를 충족하는 모든 무게 세트가 손실 함수를 최소화해야 하므로, 이 추가 제약 조건이 충족해야 합니다.고유한 솔루션으로 수렴할 수 있습니다.가중치에 특정 조건을 설정하거나 추가 훈련 데이터를 주입하여 추가 제약 조건을 생성할 수 있다.

오류를 최소화하는 가중치 집합을 찾기 위해 일반적으로 사용되는 알고리즘 중 하나는 경사 강하이다.역전파를 통해 가장 가파른 하강 방향은 손실 함수 대 현재 시냅스 가중치로 계산된다.그런 다음, 가장 가파른 하강 방향을 따라 무게를 수정할 수 있으며, 오차는 효율적인 방법으로 최소화된다.

파생

경사 강하법은 네트워크의 가중치에 관한 손실 함수의 도함수를 계산하는 것을 포함한다.이것은 보통 역전파를 사용하여 이루어집니다.하나의 출력 ^[h]뉴런을 가정할 때, 제곱 오차 함수는 다음과 같습니다.

E=L(t,y)

어디에

L

{\

displaystyle

L

}

은 출력

L

y {\

displaystyle

y

}

및

y

목표값

t {\

displaystyle

t

t

의 손실입니다.

\displaystyle

t는

t

트레이닝 샘플의 타깃 출력입니다.

\displaystyle

y는

y

출력 뉴런의 실제 출력입니다.

각 $뉴런$ j $\displaystyle$ j에 대해 $o_{j}$ $j$ $\$ 는 $o_j$ 다음과 같이 정의됩니다.

({displaystyle o_{j}=\varphi({text{net}_{j}=\varphi\left(\sum_{k=1}^{n}w_{kj}o_{k}\right)})})

여기서 활성화 함수 $\varphi$ (\ $displaystyle \varphi)$ 는 $\varphi$ 비선형이며 구별 가능합니다(ReLU가 한 지점에 없는 경우에도).이전에 사용된 활성화 함수는 로지스틱 함수입니다.

(\displaystyle \varphi (z)=sublicfrac {1}{1+e^{-z}}}})

다음과 같은 편리한 파생어가 있습니다.

\displaystyle {d\varphi(z)}{dz}=\varphi(z)(1-\varphi(z))}

뉴런에 대한 ${\text{net}}_{j}$ ${\text{net}}_{j}$ j $(\displaystyle\text{net}$ _ ${\text{net}}_{j}$ {j})는 이전 뉴런의 $o_{k}$ $o_{k}$ k(\ $displaystyle o_{k})$ 의 $o_k$ 가중치 합계입니다.뉴런이 입력 레이어 뒤의 첫 번째 레이어에 있는 경우 입력 레이어의 $o_{k}$ k $(\$ })는 $o_k$ 네트워크에 대한 $x_{k}$ $x_{k}$ k $(\$ })일 $x_{k}$ 뿐입니다.뉴런의 입력 단위 수는 $\displaystyle$ n입니다 $n$ $w_{kj}$ w $w_{kj}$ j(\ $displaystyle w_{kj})$ 는 $w_{kj}$ 이전 $k$ 의 뉴런k(\ $displaystyle$ k $)$ 와 $k$ 현재 레이어의 $뉴런$ j(\ $displaystyle$ j $)$ 사이의 무게를 $j$ 나타냅니다.

오류의 도함수 찾기

여기서 사용되는 표기법을 설명하기 위한 인공 신경망의 다이어그램

$무게$ 에 $w_{ij}$ 대한 $w_{ij}$ 오차의 부분 도함수 계산은 체인 규칙을 사용하여 두 번 수행합니다 $.$

(*displaystyle {\frac E} {\partial w_{ij}} = partial o_{j} = partial o_{j} {\frac {\frac o_{j} {\frac } {\frac o_{j} {\frac } {\frac } {\frac o_{{{{{\frac}} {\frac} {\partial o_j} {\frac} {\frac} {\f} {\f} {\f

(제1호)

위의 오른쪽 마지막 요소에서는 ${\text{net}}_{j}$ ${\text{net}}_{j}$ j $(\$ 에서 ${\text{net}}_{j}$ $w_{ij}$ j $(\$ 에 의존하는 용어는 1개뿐입니다.

{\frac {text{net}_{j}}=paramfrac {\param w_{ij}}\left(\sum _ {k=1}^{n}w_{k}o_{k}\오른쪽)=paramfrac {\frac}{n}{i}{i}

(제2호)

뉴런이 입력 레이어 뒤의 첫 번째 레이어에 있는 경우 $o_{i}$ {\ $displaystyle o_{i$ }}는 $o_i$ $(\$ 입니다.

입력에 대한 $뉴런$ j $\displaystyle$ j 출력의 $j$ 도함수는 단순히 활성화 함수의 부분 도함수이다.

\displaystyle {\frac o_{j} {\text{net}_{j}} = parfrac {\text{net}_{j}} {\frac {\text{net}}_{j}}}

(제3호)

로지스틱 활성화 기능 케이스는 다음과 같습니다.

({displaystyle {\frac o_{j}}{\text{net}_{j}}=\varphi(\text{net}={j}(1-\varphi{text{net}}}_{j})})=o_{j}(1-o_{j})

역전파에는 활성화 기능이 차별화되어야 하는 이유가 여기에 있습니다.(단, AlexNet 등에서는 0으로 구분할 수 없는 ReLU 활성화 기능이 매우 보급되어 있습니다.)

$o_{j}=y$ 번째 인자는 $o_{j}=y$ j $=$ (\ $displaystyle o_{j}=$ y $o_{j}=y$ })이기 때문에 뉴런이 출력층에 있는지 평가하는 것은 간단하다.

(\displaystyle {\frac E} {\partial o_{j}} = partial frac {\frac E} {\partial y} )

(제4호)

제곱 오차의 절반을 손실 함수로 사용하면 다음과 같이 다시 쓸 수 있습니다.

{\frac E}{\partial o_{j}=partial frac {\frac }{\frac {1}{{j}}{\frac {1}{y}=y-t

단 $,$ j(\ $displaystyle$ j)가 $j$ 네트워크의 임의의 내부 레이어에 $있는$ $o_{j}$ (\ $displaystyle$ o_{j})에 $o_j$ 대한 $파생$ E $(\displaystyle$ E $)$ 를 $E$ 찾는 것은 그다지 명확하지 않습니다.

E $(\displaystyle$ E $)$ 를 $E$ 모든 $L=\{u,v,\dots ,w\}$ L $=$ { $L=\{u,v,\dots ,w\}$ , $L=\{u,v,\dots ,w\}$ , $L=\{u,v,\dots ,w\}$ , w $L=\{u,v,\dots ,w\}$ }(\ $displaystyle$ L=\{ $u$ , $v$ , \ $dots$ , $w\})$ 의 $L=\{u,v,\dots ,w\}$ 함수로 $E$ 하여 $뉴런$ j(\ $displaystyle$ j $j$ 로부터 입력을 받습니다.

{\frac E(o_{j}} {\partial o_{j}} = frac E(\mathrm {net} _ {u} , \text {net} ) _{v} , \frac, \mathrm {net} _{w} } }

$o_{j}$ j {\ $displaystyle o_{j$ 에 대한 전체 도함수를 취하면 도함수에 대한 재귀 식을 얻을 수 있다.

{\frac E}{\partial o_{j}=\sum _{\frac \in L}\left {\frac {\text {net}_{\ell }}{\frac {\text }}_{\ell }{\frac o_j}}{\sum}{\sum}}{\sum o_{j}}=\sum _{\ell \in L}\left({\frac {\partial o_{\ell }}{\frac {\frac o_{\ell }}{\frac {\text {net}}_{\ell }}_{\fl }\w\light

(제5호)

Therefore, the derivative with respect to $o_{j}$ can be calculated if all the derivatives with respect to the outputs $o_{\ell }$ of the next layer – the ones closer to the output neuron – are known.( $주$ : 집합 $L$ 의 뉴런 $L$ 중 하나라도 $뉴런$ j({ $displaystyle$ j $j$ 에 연결되어 있지 않은 경우, $w_{ij}$ 은 $w_{ij}$ ({ $displaystyle w_{$ ij $w_{ij}$ })와는 독립적이며, 합계에서 대응하는 편도함수는 0으로 사라집니다.)

Eq.1의 Eq.2, Eq.3 Eq.4 및 Eq.5를 대체하면 다음을 얻을 수 있다.

{\displaystyle {\frac E} {\partial w_{ij} = partial o_{j} {\frac o_{j} {\frac } {\frac {\text{net}_{j}} {\frac {\text} {\} {\frac} {\frac} {\frac} {\frac} {\frac} {\text} {\frac} {{{{{net} {\frac} {\frac} {\fr} {{{{{{

({displaystyle {frac E}{\partial w_ij}}=o_{i}\partial _{j})

와 함께

\displaystyle _{j}=partial frac {\partial o_{j}}{\frac {text {net}_{j}}=partial {j}{\frac L(o_{j},t)}{\frac {partial o_{j}{phi}{\frac}{\frac}{partial }{phi}}{d{\text{net}_{j}}&{\text{if}}}j{\text{\ell\in L}w_{j\ell}}{\delta_{\ell}}{\frac {d\varphi{\text{net}_{j}}}}}}\\\text{\}}}}}}}{\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\}{dhtext{net}_{j}}&{\text{if}}jhtext{는 내부 뉴런입니다.}}\end {case}}

$\varphi$ \ $displaystyle$ \ $varphi$ }가 로지스틱 $\varphi$ 함수이고 오류가 제곱 오류인 경우:

\delta _{j}={\frac {\partial E}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\text{net}}_{j}}}={\begin{cases}(o_{j}-t_{j})o_{j}(1-o_{j})&{\text{if }}j{\text{ is an output neuron,}}\\(\sum _{\ell \in L}w_{j\ell }\delta _{\ell })o_{j}(1-o_{j})&{\text{if }}j{\text{ is an inner neuron.}}\end{cases}}

To update the weight $w_{ij}$ using gradient descent, one must choose a learning rate, $\eta >0$ . The change in weight needs to reflect the impact on $E$ of an increase or decrease in $w_{ij}$ . If ${\frac {\partial E}{\partial w_{ij}}}>0$ , an increase in $w_{ij}$ increases $E$ ; conversely, if ${\frac {\partial E}{\partial w_{ij}}}<0$ , an increase in $w_{ij}$ decreases $E$ . The new $\Delta w_{ij}$ is added to the old weight, and the product of the learning rate and the gradient, multiplied by $-1$ guarantees that $w_{ij}$ changes in a way that always decreases $E$ . In other words, in the equation immediately below, $-\eta {\frac {\partial E}{\partial w_{ij}}}$ always changes $w_{ij}$ in such a way that $E$ is decreased:

\Delta w_{ij}=-\eta {\frac {\partial E}{\partial w_{ij}}}=-\eta o_{i}\delta _{j}

Second-order gradient descent

Using a Hessian matrix of second-order derivatives of the error function, the Levenberg-Marquardt algorithm often converges faster than first-order gradient descent, especially when the topology of the error function is complicated.^[11]^[12] It may also find solutions in smaller node counts for which other methods might not converge.^[12] The Hessian can be approximated by the Fisher information matrix.^[13]

Loss function

The loss function is a function that maps values of one or more variables onto a real number intuitively representing some "cost" associated with those values. For backpropagation, the loss function calculates the difference between the network output and its expected output, after a training example has propagated through the network.

Assumptions

The mathematical expression of the loss function must fulfill two conditions in order for it to be possibly used in backpropagation.^[14] The first is that it can be written as an average ${\textstyle E={\frac {1}{n}}\sum _{x}E_{x}}$ over error functions ${\textstyle E_{x}}$ , for ${\textstyle n}$ individual training examples, ${\textstyle x}$ . The reason for this assumption is that the backpropagation algorithm calculates the gradient of the error function for a single training example, which needs to be generalized to the overall error function. The second assumption is that it can be written as a function of the outputs from the neural network.

Example loss function

Let $y,y'$ be vectors in $\mathbb {R} ^{n}$ .

Select an error function $E(y,y')$ measuring the difference between two outputs. The standard choice is the square of the Euclidean distance between the vectors $y$ and $y'$ :

E(y,y')={\tfrac {1}{2}}\lVert y-y'\rVert ^{2}

The error function over

{\textstyle n}

training examples can then be written as an average of losses over individual examples:

E={\frac {1}{2n}}\sum _{x}\lVert (y(x)-y'(x))\rVert ^{2}

Limitations

Gradient descent may find a local minimum instead of the global minimum.

Gradient descent with backpropagation is not guaranteed to find the global minimum of the error function, but only a local minimum; also, it has trouble crossing plateaus in the error function landscape. This issue, caused by the non-convexity of error functions in neural networks, was long thought to be a major drawback, but Yann LeCun et al. argue that in many practical problems, it is not.^[15]
Backpropagation learning does not require normalization of input vectors; however, normalization could improve performance.^[16]
Backpropagation requires the derivatives of activation functions to be known at network design time.

History

The term backpropagation and its general use in neural networks was announced in Rumelhart, Hinton & Williams (1986a), then elaborated and popularized in Rumelhart, Hinton & Williams (1986b), but the technique was independently rediscovered many times, and had many predecessors dating to the 1960s.^[6]^[17]

The basics of continuous backpropagation were derived in the context of control theory by Henry J. Kelley in 1960,^[18] and by Arthur E. Bryson in 1961.^[19]^[20]^[21]^[22]^[23] They used principles of dynamic programming. In 1962, Stuart Dreyfus published a simpler derivation based only on the chain rule.^[24] Bryson and Ho described it as a multi-stage dynamic system optimization method in 1969.^[25]^[26] Backpropagation was derived by multiple researchers in the early 60's^[22] and implemented to run on computers as early as 1970 by Seppo Linnainmaa.^[27]^[28]^[29] Paul Werbos was first in the US to propose that it could be used for neural nets after analyzing it in depth in his 1974 dissertation.^[30] While not applied to neural networks, in 1970 Linnainmaa published the general method for automatic differentiation (AD).^[28]^[29] Although very controversial, some scientists believe this was actually the first step toward developing a back-propagation algorithm.^[22]^[23]^[27]^[31] In 1973 Dreyfus adapts parameters of controllers in proportion to error gradients.^[32] In 1974 Werbos mentioned the possibility of applying this principle to artificial neural networks,^[30] and in 1982 he applied Linnainmaa's AD method to non-linear functions.^[23]^[33]

Later the Werbos method was rediscovered and described in 1985 by Parker,^[34]^[35] and in 1986 by Rumelhart, Hinton and Williams.^[17]^[35]^[36] Rumelhart, Hinton and Williams showed experimentally that this method can generate useful internal representations of incoming data in hidden layers of neural networks.^[10]^[37]^[38] Yann LeCun proposed the modern form of the back-propagation learning algorithm for neural networks in his PhD thesis in 1987. In 1993, Eric Wan won an international pattern recognition contest through backpropagation.^[22]^[39]

During the 2000s it fell out of favour^{[citation needed]}, but returned in the 2010s, benefitting from cheap, powerful GPU-based computing systems. This has been especially so in speech recognition, machine vision, natural language processing, and language structure learning research (in which it has been used to explain a variety of phenomena related to first^[40] and second language learning.^[41]).

Error backpropagation has been suggested to explain human brain ERP components like the N400 and P600.^[42]

Notes

^ Use $C$ for the loss function to allow $L$ to be used for the number of layers
^ This follows Nielsen (2015), and means (left) multiplication by the matrix $W^{l}$ corresponds to converting output values of layer $l-1$ to input values of layer $l$ : columns correspond to input coordinates, rows correspond to output coordinates.
^ This section largely follows and summarizes Nielsen (2015).
^ The derivative of the loss function is a covector, since the loss function is a scalar-valued function of several variables.
^ The activation function is applied to each node separately, so the derivative is just the diagonal matrix of the derivative on each node. This is often represented as the Hadamard product with the vector of derivatives, denoted by $(f^{l})'\odot$ , which is mathematically identical but better matches the internal representation of the derivatives as a vector, rather than a diagonal matrix.
^ Since matrix multiplication is linear, the derivative of multiplying by a matrix is just the matrix: $(Wx)'=W$ .
^ One may notice that multi-layer neural networks use non-linear activation functions, so an example with linear neurons seems obscure. However, even though the error surface of multi-layer networks are much more complicated, locally they can be approximated by a paraboloid. Therefore, linear neurons are used for simplicity and easier understanding.
^ There can be multiple output neurons, in which case the error is the squared norm of the difference vector.

References

^ Goodfellow, Bengio & Courville 2016, p. 200, "The back-propagation algorithm (Rumelhart et al., 1986a), often simply called backprop, ..."
^ Goodfellow, Bengio & Courville 2016, p. 200, "Furthermore, back-propagation is often misunderstood as being specific to multi-layer neural networks, but in principle it can compute derivatives of any function"
^ Goodfellow, Bengio & Courville 2016, p. 214, "This table-filling strategy is sometimes called dynamic programming."
^ Goodfellow, Bengio & Courville 2016, p. 200, "The term back-propagation is often misunderstood as meaning the whole learning algorithm for multilayer neural networks. Backpropagation refers only to the method for computing the gradient, while other algorithms, such as stochastic gradient descent, is used to perform learning using this gradient."
^ ^a ^b Goodfellow, Bengio & Courville (2016, p. 217–218), "The back-propagation algorithm described here is only one approach to automatic differentiation. It is a special case of a broader class of techniques called reverse mode accumulation."
^ ^a ^b Goodfellow, Bengio & Courville (2016, p. 221), "Efficient applications of the chain rule based on dynamic programming began to appear in the 1960s and 1970s, mostly for control applications (Kelley, 1960; Bryson and Denham, 1961; Dreyfus, 1962; Bryson and Ho, 1969; Dreyfus, 1973) but also for sensitivity analysis (Linnainmaa, 1976). ... The idea was finally developed in practice after being independently rediscovered in different ways (LeCun, 1985; Parker, 1985; Rumelhart et al., 1986a). The book Parallel Distributed Processing presented the results of some of the first successful experiments with back-propagation in a chapter (Rumelhart et al., 1986b) that contributed greatly to the popularization of back-propagation and initiated a very active period of research in multilayer neural networks."
^ Goodfellow, Bengio & Courville (2016, 6.5 Back-Propagation and Other Differentiation Algorithms, pp. 200–220)
^ Ramachandran, Prajit; Zoph, Barret; Le, Quoc V. (2017-10-27). "Searching for Activation Functions". arXiv:1710.05941 [cs.NE].
^ Misra, Diganta (2019-08-23). "Mish: A Self Regularized Non-Monotonic Activation Function". arXiv:1908.08681 [cs.LG].
^ ^a ^b Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986a). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. S2CID 205001834.
^ Tan, Hong Hui; Lim, King Han (2019). "Review of second-order optimization techniques in artificial neural networks backpropagation". IOP Conference Series: Materials Science and Engineering. 495 (1): 012003. Bibcode:2019MS&E..495a2003T. doi:10.1088/1757-899X/495/1/012003. S2CID 208124487.
^ ^a ^b Wiliamowski, Bogdan; Yu, Hao (June 2010). "Improved Computation for Levenberg–Marquardt Training" (PDF). IEEE Transactions on Neural Networks and Learning Systems. 21 (6).
^ Martens, James (August 2020). "New Insights and Perspectives on the Natural Gradient Method" (PDF). Journal of Machine Learning Research (21). arXiv:1412.1193.
^ Nielsen (2015), "[W]hat assumptions do we need to make about our cost function ... in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average ... over cost functions ... for individual training examples ... The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network ..."
^ LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep learning". Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID 26017442. S2CID 3074096.
^ Buckland, Matt; Collins, Mark (2002). AI Techniques for Game Programming. Boston: Premier Press. ISBN 1-931841-08-X.
^ ^a ^b Rumelhart; Hinton; Williams (1986). "Learning representations by back-propagating errors" (PDF). Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. S2CID 205001834.
^ Kelley, Henry J. (1960). "Gradient theory of optimal flight paths". ARS Journal. 30 (10): 947–954. doi:10.2514/8.5282.
^ Bryson, Arthur E. (1962). "A gradient method for optimizing multi-stage allocation processes". Proceedings of the Harvard Univ. Symposium on digital computers and their applications, 3–6 April 1961. Cambridge: Harvard University Press. OCLC 498866871.
^ Dreyfus, Stuart E. (1990). "Artificial Neural Networks, Back Propagation, and the Kelley-Bryson Gradient Procedure". Journal of Guidance, Control, and Dynamics. 13 (5): 926–928. Bibcode:1990JGCD...13..926D. doi:10.2514/3.25422.
^ Mizutani, Eiji; Dreyfus, Stuart; Nishio, Kenichi (July 2000). "On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application" (PDF). Proceedings of the IEEE International Joint Conference on Neural Networks.
^ ^a ^b ^c ^d Schmidhuber, Jürgen (2015). "Deep learning in neural networks: An overview". Neural Networks. 61: 85–117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.
^ ^a ^b ^c Schmidhuber, Jürgen (2015). "Deep Learning". Scholarpedia. 10 (11): 32832. Bibcode:2015SchpJ..1032832S. doi:10.4249/scholarpedia.32832.
^ Dreyfus, Stuart (1962). "The numerical solution of variational problems". Journal of Mathematical Analysis and Applications. 5 (1): 30–45. doi:10.1016/0022-247x(62)90004-5.
^ Russell, Stuart; Norvig, Peter (1995). Artificial Intelligence : A Modern Approach. Englewood Cliffs: Prentice Hall. p. 578. ISBN 0-13-103805-2. The most popular method for learning in multilayer networks is called Back-propagation. It was first invented in 1969 by Bryson and Ho, but was more or less ignored until the mid-1980s.
^ Bryson, Arthur Earl; Ho, Yu-Chi (1969). Applied optimal control: optimization, estimation, and control. Waltham: Blaisdell. OCLC 3801.
^ ^a ^b Griewank, Andreas (2012). "Who Invented the Reverse Mode of Differentiation?". Optimization Stories. Documenta Matematica, Extra Volume ISMP. pp. 389–400. S2CID 15568746.
^ ^a ^b Seppo Linnainmaa (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6–7.
^ ^a ^b Linnainmaa, Seppo (1976). "Taylor expansion of the accumulated rounding error". BIT Numerical Mathematics. 16 (2): 146–160. doi:10.1007/bf01931367. S2CID 122357351.
^ ^a ^b The thesis, and some supplementary information, can be found in his book, Werbos, Paul J. (1994). The Roots of Backpropagation : From Ordered Derivatives to Neural Networks and Political Forecasting. New York: John Wiley & Sons. ISBN 0-471-59897-6.
^ Griewank, Andreas; Walther, Andrea (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition. SIAM. ISBN 978-0-89871-776-1.
^ Dreyfus, Stuart (1973). "The computational solution of optimal control problems with time lag". IEEE Transactions on Automatic Control. 18 (4): 383–385. doi:10.1109/tac.1973.1100330.
^ Werbos, Paul (1982). "Applications of advances in nonlinear sensitivity analysis" (PDF). System modeling and optimization. Springer. pp. 762–770.
^ Parker, D.B. (1985). "Learning Logic". Center for Computational Research in Economics and Management Science. Cambridge MA: Massachusetts Institute of Technology. {{cite journal}}: Cite journal requires journal= (help)
^ ^a ^b Hertz, John. (1991). Introduction to the theory of neural computation. Krogh, Anders., Palmer, Richard G. Redwood City, Calif.: Addison-Wesley Pub. Co. p. 8. ISBN 0-201-50395-6. OCLC 21522159.
^ Anderson, James Arthur, (1939- ...)., ed. Rosenfeld, Edward, ed. (1988). Neurocomputing Foundations of research. MIT Press. ISBN 0-262-01097-6. OCLC 489622044. {{cite book}}: last= has generic name (help)CS1 maint: multiple names: authors list (link)
^ Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986b). "8. Learning Internal Representations by Error Propagation". In Rumelhart, David E.; McClelland, James L. (eds.). Parallel Distributed Processing : Explorations in the Microstructure of Cognition. Vol. 1 : Foundations. Cambridge: MIT Press. ISBN 0-262-18120-7.
^ Alpaydin, Ethem (2010). Introduction to Machine Learning. MIT Press. ISBN 978-0-262-01243-0.
^ Wan, Eric A. (1994). "Time Series Prediction by Using a Connectionist Network with Internal Delay Lines". In Weigend, Andreas S.; Gershenfeld, Neil A. (eds.). Time Series Prediction : Forecasting the Future and Understanding the Past. Proceedings of the NATO Advanced Research Workshop on Comparative Time Series Analysis. Vol. 15. Reading: Addison-Wesley. pp. 195–217. ISBN 0-201-62601-2. S2CID 12652643.
^ Chang, Franklin; Dell, Gary S.; Bock, Kathryn (2006). "Becoming syntactic". Psychological Review. 113 (2): 234–272. doi:10.1037/0033-295x.113.2.234. PMID 16637761.
^ Janciauskas, Marius; Chang, Franklin (2018). "Input and Age-Dependent Variation in Second Language Learning: A Connectionist Account". Cognitive Science. 42: 519–554. doi:10.1111/cogs.12519. PMC 6001481. PMID 28744901.
^ Fitz, Hartmut; Chang, Franklin (2019). "Language ERPs reflect learning through prediction error propagation". Cognitive Psychology. 111: 15–52. doi:10.1016/j.cogpsych.2019.03.002. hdl:21.11116/0000-0003-474D-8. PMID 30921626. S2CID 85501792.

External links

Backpropagation neural network tutorial at the Wikiversity
Bernacki, Mariusz; Włodarczyk, Przemysław (2004). "Principles of training multi-layer neural network using backpropagation".
Karpathy, Andrej (2016). "Lecture 4: Backpropagation, Neural Networks 1". CS231n. Stanford University. Archived from the original on 2021-12-12 – via YouTube.
"What is Backpropagation Really Doing?". 3Blue1Brown. November 3, 2017. Archived from the original on 2021-12-12 – via YouTube.

[8] Use $C$ for the loss function to allow $L$ to be used for the number of layers

[9] This follows Nielsen (2015), and means (left) multiplication by the matrix $W^{l}$ corresponds to converting output values of layer $l-1$ to input values of layer $l$ : columns correspond to input coordinates, rows correspond to output coordinates.

[12] This section largely follows and summarizes Nielsen (2015).

[13] The derivative of the loss function is a covector, since the loss function is a scalar-valued function of several variables.

[14] The activation function is applied to each node separately, so the derivative is just the diagonal matrix of the derivative on each node. This is often represented as the Hadamard product with the vector of derivatives, denoted by $(f^{l})'\odot$ , which is mathematically identical but better matches the internal representation of the derivatives as a vector, rather than a diagonal matrix.

[15] Since matrix multiplication is linear, the derivative of multiplying by a matrix is just the matrix: $(Wx)'=W$ .

[17] One may notice that multi-layer neural networks use non-linear activation functions, so an example with linear neurons seems obscure. However, even though the error surface of multi-layer networks are much more complicated, locally they can be approximated by a paraboloid. Therefore, linear neurons are used for simplicity and easier understanding.

[18] There can be multiple output neurons, in which case the error is the squared norm of the difference vector.

[1] Goodfellow, Bengio & Courville 2016, p. 200, "The back-propagation algorithm (Rumelhart et al., 1986a), often simply called backprop, ..."

[2] Goodfellow, Bengio & Courville 2016, p. 200, "Furthermore, back-propagation is often misunderstood as being specific to multi-layer neural networks, but in principle it can compute derivatives of any function"

[FOOTNOTEGoodfellowBengioCourville2016[httpswwwdeeplearningbookorgcontentsmlphtmlpf33_214]-3] Goodfellow, Bengio & Courville 2016, p. 214, "This table-filling strategy is sometimes called dynamic programming."

[4] Goodfellow, Bengio & Courville 2016, p. 200, "The term back-propagation is often misunderstood as meaning the whole learning algorithm for multilayer neural networks. Backpropagation refers only to the method for computing the gradient, while other algorithms, such as stochastic gradient descent, is used to perform learning using this gradient."

[DL-reverse-mode-5] Goodfellow, Bengio & Courville (2016, p. 217–218), "The back-propagation algorithm described here is only one approach to automatic differentiation. It is a special case of a broader class of techniques called reverse mode accumulation."

[DL-history-6] Goodfellow, Bengio & Courville (2016, p. 221), "Efficient applications of the chain rule based on dynamic programming began to appear in the 1960s and 1970s, mostly for control applications (Kelley, 1960; Bryson and Denham, 1961; Dreyfus, 1962; Bryson and Ho, 1969; Dreyfus, 1973) but also for sensitivity analysis (Linnainmaa, 1976). ... The idea was finally developed in practice after being independently rediscovered in different ways (LeCun, 1985; Parker, 1985; Rumelhart et al., 1986a). The book Parallel Distributed Processing presented the results of some of the first successful experiments with back-propagation in a chapter (Rumelhart et al., 1986b) that contributed greatly to the popularization of back-propagation and initiated a very active period of research in multilayer neural networks."

[7] Goodfellow, Bengio & Courville (2016, 6.5 Back-Propagation and Other Differentiation Algorithms, pp. 200–220)

[10] Ramachandran, Prajit; Zoph, Barret; Le, Quoc V. (2017-10-27). "Searching for Activation Functions". arXiv:1710.05941 [cs.NE].

[11] Misra, Diganta (2019-08-23). "Mish: A Self Regularized Non-Monotonic Activation Function". arXiv:1908.08681 [cs.LG].

[RumelhartHintonWilliams1986a-16] Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986a). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. S2CID 205001834.

[Tan2018-19] Tan, Hong Hui; Lim, King Han (2019). "Review of second-order optimization techniques in artificial neural networks backpropagation". IOP Conference Series: Materials Science and Engineering. 495 (1): 012003. Bibcode:2019MS&E..495a2003T. doi:10.1088/1757-899X/495/1/012003. S2CID 208124487.

[Wiliamowski2010-20] Wiliamowski, Bogdan; Yu, Hao (June 2010). "Improved Computation for Levenberg–Marquardt Training" (PDF). IEEE Transactions on Neural Networks and Learning Systems. 21 (6).

[Martens2020-21] Martens, James (August 2020). "New Insights and Perspectives on the Natural Gradient Method" (PDF). Journal of Machine Learning Research (21). arXiv:1412.1193.

[22] Nielsen (2015), "[W]hat assumptions do we need to make about our cost function ... in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average ... over cost functions ... for individual training examples ... The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network ..."

[23] LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep learning". Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID 26017442. S2CID 3074096.

[24] Buckland, Matt; Collins, Mark (2002). AI Techniques for Game Programming. Boston: Premier Press. ISBN 1-931841-08-X.

[learning-representations-25] Rumelhart; Hinton; Williams (1986). "Learning representations by back-propagating errors" (PDF). Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. S2CID 205001834.

[kelley1960-26] Kelley, Henry J. (1960). "Gradient theory of optimal flight paths". ARS Journal. 30 (10): 947–954. doi:10.2514/8.5282.

[bryson1961-27] Bryson, Arthur E. (1962). "A gradient method for optimizing multi-stage allocation processes". Proceedings of the Harvard Univ. Symposium on digital computers and their applications, 3–6 April 1961. Cambridge: Harvard University Press. OCLC 498866871.

[dreyfus1990-28] Dreyfus, Stuart E. (1990). "Artificial Neural Networks, Back Propagation, and the Kelley-Bryson Gradient Procedure". Journal of Guidance, Control, and Dynamics. 13 (5): 926–928. Bibcode:1990JGCD...13..926D. doi:10.2514/3.25422.

[29] Mizutani, Eiji; Dreyfus, Stuart; Nishio, Kenichi (July 2000). "On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application" (PDF). Proceedings of the IEEE International Joint Conference on Neural Networks.

[schmidhuber2015-30] Schmidhuber, Jürgen (2015). "Deep learning in neural networks: An overview". Neural Networks. 61: 85–117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.

[scholarpedia2015-31] Schmidhuber, Jürgen (2015). "Deep Learning". Scholarpedia. 10 (11): 32832. Bibcode:2015SchpJ..1032832S. doi:10.4249/scholarpedia.32832.

[32] Dreyfus, Stuart (1962). "The numerical solution of variational problems". Journal of Mathematical Analysis and Applications. 5 (1): 30–45. doi:10.1016/0022-247x(62)90004-5.

[33] Russell, Stuart; Norvig, Peter (1995). Artificial Intelligence : A Modern Approach. Englewood Cliffs: Prentice Hall. p. 578. ISBN 0-13-103805-2. The most popular method for learning in multilayer networks is called Back-propagation. It was first invented in 1969 by Bryson and Ho, but was more or less ignored until the mid-1980s.

[34] Bryson, Arthur Earl; Ho, Yu-Chi (1969). Applied optimal control: optimization, estimation, and control. Waltham: Blaisdell. OCLC 3801.

[grie2012-35] Griewank, Andreas (2012). "Who Invented the Reverse Mode of Differentiation?". Optimization Stories. Documenta Matematica, Extra Volume ISMP. pp. 389–400. S2CID 15568746.

[lin1970-36] Seppo Linnainmaa (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6–7.

[lin1976-37] Linnainmaa, Seppo (1976). "Taylor expansion of the accumulated rounding error". BIT Numerical Mathematics. 16 (2): 146–160. doi:10.1007/bf01931367. S2CID 122357351.

[werbos1974-38] The thesis, and some supplementary information, can be found in his book, Werbos, Paul J. (1994). The Roots of Backpropagation : From Ordered Derivatives to Neural Networks and Political Forecasting. New York: John Wiley & Sons. ISBN 0-471-59897-6.

[grie2008-39] Griewank, Andreas; Walther, Andrea (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition. SIAM. ISBN 978-0-89871-776-1.

[dreyfus1973-40] Dreyfus, Stuart (1973). "The computational solution of optimal control problems with time lag". IEEE Transactions on Automatic Control. 18 (4): 383–385. doi:10.1109/tac.1973.1100330.

[werbos1982-41] Werbos, Paul (1982). "Applications of advances in nonlinear sensitivity analysis" (PDF). System modeling and optimization. Springer. pp. 762–770.

[42] Parker, D.B. (1985). "Learning Logic". Center for Computational Research in Economics and Management Science. Cambridge MA: Massachusetts Institute of Technology. {{cite journal}}: Cite journal requires journal= (help)

[:0-43] Hertz, John. (1991). Introduction to the theory of neural computation. Krogh, Anders., Palmer, Richard G. Redwood City, Calif.: Addison-Wesley Pub. Co. p. 8. ISBN 0-201-50395-6. OCLC 21522159.

[44] Anderson, James Arthur, (1939- ...)., ed. Rosenfeld, Edward, ed. (1988). Neurocomputing Foundations of research. MIT Press. ISBN 0-262-01097-6. OCLC 489622044. {{cite book}}: last= has generic name (help)CS1 maint: multiple names: authors list (link)

[RumelhartHintonWilliams1986b-45] Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986b). "8. Learning Internal Representations by Error Propagation". In Rumelhart, David E.; McClelland, James L. (eds.). Parallel Distributed Processing : Explorations in the Microstructure of Cognition. Vol. 1 : Foundations. Cambridge: MIT Press. ISBN 0-262-18120-7.

[46] Alpaydin, Ethem (2010). Introduction to Machine Learning. MIT Press. ISBN 978-0-262-01243-0.

[47] Wan, Eric A. (1994). "Time Series Prediction by Using a Connectionist Network with Internal Delay Lines". In Weigend, Andreas S.; Gershenfeld, Neil A. (eds.). Time Series Prediction : Forecasting the Future and Understanding the Past. Proceedings of the NATO Advanced Research Workshop on Comparative Time Series Analysis. Vol. 15. Reading: Addison-Wesley. pp. 195–217. ISBN 0-201-62601-2. S2CID 12652643.

[48] Chang, Franklin; Dell, Gary S.; Bock, Kathryn (2006). "Becoming syntactic". Psychological Review. 113 (2): 234–272. doi:10.1037/0033-295x.113.2.234. PMID 16637761.

[49] Janciauskas, Marius; Chang, Franklin (2018). "Input and Age-Dependent Variation in Second Language Learning: A Connectionist Account". Cognitive Science. 42: 519–554. doi:10.1111/cogs.12519. PMC 6001481. PMID 28744901.

[50] Fitz, Hartmut; Chang, Franklin (2019). "Language ERPs reflect learning through prediction error propagation". Cognitive Psychology. 111: 15–52. doi:10.1016/j.cogpsych.2019.03.002. hdl:21.11116/0000-0003-474D-8. PMID 30921626. S2CID 85501792.

[1]

[2]

[3]

[4]

[5]

[6]

[a]

[8]

[9]

[c]

[d]

[e]

[f]

[10]

[g]

[h]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

Search

역전파

네임스페이스

더

목차

개요