π_θ(a|s) = 𝒩( μ_θ(s), Σ_θ(s) )
The actor network outputs a Gaussian distribution for continuous exploration.
V_φ(s)
The critic network predicts expected future stabilization capability.
J(θ) = E[ Σ γᵗ R_t ]
L^CLIP(θ) = E[ min( r_t(θ)A_t , clip(...) ) ]
Proximal Policy Optimization maximizes long-term discounted rewards while binding updates with a trust region penalty, preventing catastrophic unlearning during volatile aerodynamic moments.