While we're familiar with backward mode autodiff due to the almightly backprop, its not the only way to auto differentiate.
Before autodiff, we also have to consider the classical differentation procedures.
Symbolic Differentiation
Too computationally intensive to obtain a closed form simplified expression.
Finite difference method
The popular $\frac{d}{dx}f(x) = \lim_{h \to 0}\frac{f(x + h) - f(x)}{h}$ suffers from numerical instability.
Both are slow for partial derivative computation with respect to many inputs. A more efficient and stable method for the same is to multiply a chain of jacobians through simple chain rule.
If $x$ and $y$ are vectors, and -
$y = h(g(f(x)))$
Then through chain rule,
\[\frac{\partial y}{\partial x} = \frac{\partial h(w_2)}{\partial w_2} \frac{\partial g(w_1)}{\partial w_1} \frac{\partial f(w_0)}{\partial x}\]
Where $w_0 = x, w_1 = f(w_0), w_2 = g(w_1)$.
Each of these terms in the chain rule equation form a jacobian for multi-variate inputs. Now, while backprop works by performing a top-down evaluation of this chain rule like so -
\[\frac{\partial y}{\partial x} = \left( \frac{\partial h(w_2)}{\partial w_2} \frac{\partial g(w_1)}{\partial w_1} \right) \frac{\partial f(w_0)}{\partial x}\]
Forward prop would instead perform the following evaluation -
\[\frac{\partial y}{\partial x} = \frac{\partial h(w_2)}{\partial w_2} \left( \frac{\partial g(w_1)}{\partial w_1} \frac{\partial f(w_0)}{\partial x} \right)\]
Now for complexity analysis, assume $x$ is a vector of size $N$ and $y$ is a vector of size $M$.
For a function $f: R^N \rightarrow R^M$, if $N > M$, we can see that backward accumulation is more efficient. In the other case where $M > N$, a forward accumulation is better. Neural networks belong to the first class of functions, and hence a backward propagation is required for efficiency.
However, neither backward not forward accumulation are optimal. The optimal way of multiplying these jacobians is NP-Complete.
They are a class of numbers like the complex (i.e. $i^2 = -1$). In Dual number system, we assume the existence of $\epsilon^2 = 0$, where $\epsilon \neq 0$. All higher powers of $\epsilon$ are also $0$.
A dual number is made up of a real and a dual part, and is represented as $a + b\epsilon$.
A neat property of dual numbers arising from the taylor series expansion is their ability to calculate derivatives through simple function application.
For example, to calculate the value and derivative of some function $f$ at $a$ we apply the function at $a + \epsilon$. This gives us -
$f(a + \epsilon) = f(a) + f'(a)\epsilon$
(as the higher powers in the taylor expansion become 0)
This simplifies the calculation of forward mode autodifferentiation and removes any overhead. Real numbers in the calculations are lifted to be dual numbers with dual part 0.
While the standard dual numbers are helpful for first degree derivative calculation, we can extend them. Instead of using the definition where $\epsilon^2 = 0$, we can define $\epsilon^n = 0$, and it will help us calculate the (n-1)th derivative.
For example, setting $n = 3$, we get the definition of a dual number as $\left(a + b\epsilon + c\epsilon^2\right)$. Evaluating a function at $\left(a + \epsilon + \epsilon^2\right)$ will give us -
$f(a + \epsilon + \epsilon^2) = f(a) + f'(a)\epsilon + \frac{f''(a)\epsilon^2}{2!}$
dual numbers can be represented by a matrix -
\[a + b\epsilon = \begin{pmatrix} a & b \newline 0 & a \end{pmatrix}\]
This representation satisfies all the dual number properties.
Even the extensions of dual numbers for higher order derivatives can be represented by a similar matrix (upper-triangular diagonal-constant).
\[a + b\epsilon + c\epsilon^2 = \begin{pmatrix} a & b & c \newline 0 & a & b \newline 0 & 0 & a \newline \end{pmatrix} \text{for } \epsilon^3 = 0\]
Through this method, the complexity of derivative calculation becomes quadratic in the degree of the highest order derivative.