\section{Motivation and Scope}Many optimization problems in machine learning, signal processing, and computational statistics are \emph{composite} in nature:\begin{equation}\label{eq:composite_intro} \min_{x\in\mathbb{R}^n} \; F(x) \;=\; f(x) + h(x)\,,\end{equation}where $f$ is smooth (differentiable with $L$-Lipschitz gradient) and $h$ is proper, closed, and often non-smooth but \emph{proximable}. Classic examples include $\ell_1$ penalties for sparsity, constraints encoded as indicator functions, and robust losses coupled with structured regularization. For such problems, first-order methods dominate practice: they scale, exploit problem structure, and admit sharp iteration-complexity guarantees. Two algorithmic ideas underpin state-of-the-art performance on \eqref{eq:composite_intro}: (i) \emph{proximal splitting}, which decouples $f$ and $h$ by a gradient step on $f$ followed by a proximity step on $h$ (ISTA/proximal gradient); and (ii) \emph{acceleration}, which injects extrapolation (momentum) to achieve the optimal $\mathcal{O}(1/k^2)$ rate in the smooth convex regime (Nesterov acceleration; FISTA in the composite case).In modern applications, the regularizer $h$ is sometimes only \emph{weakly convex} (e.g., SCAD), which invalidates the standard convex proximal framework unless one \emph{convexifies} $h$ of the composite model. Moreover, empirical performance often hinges on how well the discrete method mirrors an underlying \emph{continuous-time} energy decay. This thesis develops a principled and unified treatment of these themes: we give complete, self-contained Lyapunov proofs for GD, NAG, ISTA, and FISTA; We derive and analyze a new accelerated scheme, \emph{SQ2FISTA}, proposed by PhD student Mr. Ushiyama, from a carefully designed time-varying inertial ODE via a \emph{weak discrete gradient} (wDG) discretization, and prove its $\mathcal{O}(1/k^2)$ and linear rates; we extend FISTA to weakly convex proximals through a \emph{convexified} variant, denoted \emph{FISTA($\delta$)}, leveraging a prox-shift identity that restores convexity and stability; and we provide PyTorch-style code listings that exactly implement the analyzed algorithms with \emph{safety clamps} for SCAD. A central empirical finding is that \emph{SQ2FISTA} strictly improves over \emph{standard} FISTA and closely matches (often ties) the performance of \emph{FISTA($\delta$)} on realistic synthetic benchmarks, validating the design principle: when $h$ is weakly convex, either adopt an ODE-inspired accelerated discretization (SQ2FISTA) or convexify the proximal part (FISTA($\delta$)) — both are principled and high-performing, while \emph{plain} FISTA is typically suboptimal.\section{Problem Class, Assumptions, and Notation}We consider \eqref{eq:composite_intro} with the following standing assumptions unless stated otherwise. The smooth part $f:\mathbb{R}^n\to\mathbb{R}$ is differentiable and $L$-smooth: $\|\nabla f(x)-\nabla f(y)\|\le L\|x-y\|$ for all $x,y$. In the strongly convex regime, $f$ is $\mu_m$-strongly convex ($\mu_m>0$ future Model-$\mu$). The regularizer $h:\mathbb{R}^n\to(-\infty,+\infty]$ is proper, closed, and \emph{proximable}, with proximity operator\[\operatorname{prox}_{\lambda h}(v) \;=\; \arg\min_x \Big\{\, h(x) + \tfrac{1}{2\lambda}\|x-v\|^2 \Big\}\,.\]When $h$ is \emph{weakly convex} (e.g.\ SCAD), there exists $\rho\ge 0$ such that $h(x)+\tfrac{\rho}{2}\|x\|^2$ is convex; equivalently, $h$ has curvature $\mu_p=-\rho\le 0$, (future Proximal-$\mu$). The \emph{total} strong convexity we use throughout is\[\mu_{\text{tot}} \;=\; \mu_m + \mu_p\,.\]We denote by $x^\star$ a global minimizer of $F$, and write $F^\star=F(x^\star)$. A standard residual is the \emph{prox-gradient mapping} at stepsize $\eta>0$:\begin{equation}\label{eq:PG-map} G_\eta(x)\;:=\;\frac{1}{\eta}\Big(x - \operatorname{prox}_{\eta h}\big(x - \eta\,\nabla f(x)\big)\Big),\qquad \|G_\eta(x)\|=0 \;\Leftrightarrow\; 0\in \nabla f(x)+\partial h(x)\,.\end{equation}\paragraph{Convexification by prox shift.}When $h$ is weakly convex with $\mu_p<0$, we use the identity\begin{equation}\label{eq:prox-shift} \underbrace{\operatorname{prox}_{\eta\,(h+\tfrac{\delta}{2}\|\cdot\|^2)}}_{\text{convex for }\delta\ge -\mu_p}(v)\; =\; \operatorname{prox}_{\tfrac{\eta}{1+\eta\delta}\,h}\!\Big(\tfrac{1}{1+\eta\delta}\,v\Big)\,,\end{equation}and set $\delta=-\mu_p$ so that $h+\tfrac{\delta}{2}\|\cdot\|^2$ is convex. Algebraically, this transfers curvature from $h$ into $f$, replacing $(L,\mu_f)$ by $(L+\delta,\ \mu_f-\delta)$ in the smooth part. We exploit \eqref{eq:prox-shift} to define \emph{FISTA($\delta$)}.\section{Algorithms at a Glance}We study six baseline methods in the following order and notation consistent with the rest of the thesis. Gradient Descent (GD) uses the update $x_{k+1}=x_k-\tfrac{1}{L}\nabla f(x_k)$. Nesterov’s Accelerated Gradient (NAG) evaluates a gradient at an extrapolated point, achieving $\mathcal{O}(1/k^2)$ in convex problems and a linear rate with factor $(1-\sqrt{\mu_{\text{m}}/L})^k$ in strongly convex settings. ISTA (proximal gradient) attains $\mathcal{O}(1/k)$ in the convex case (and linear convergence $(1-c\,\mu_{\text{m}}/L)^k$ with suitable tuning under strong convexity). FISTA (accelerated ISTA) achieves $\mathcal{O}(1/k^2)$ in the convex case and $(1-\sqrt{\mu_{\text{m}}/L})^k$ when strongly convex with tuned momentum. SQ2FISTA is our ODE-informed accelerated proximal method designed via wDG, attaining $\mathcal{O}(1/k^2)$ in convex problems and a linear factor of the form $(1-(\sqrt{2\mu_{\text{tot}}/L_{\text{eff}}}))^k$ in the strongly convex composite, where $L_{\text{eff}}$ is the effective smoothness constant arising in the SQ2FISTA analysis. Finally, to find SQ2FISTA concurrence, the convexified variant, FISTA($\delta$), applies \eqref{eq:prox-shift}; its linear factor is $(1-\sqrt{\mu_{\text{tot}}/(L+\delta)})^k$ when $\mu_{\text{tot}}>0$.\begin{table}[t]\centering\caption{Canonical worst-case iteration complexity (deterministic, exact gradients). Strongly convex entries use an error contraction factor $\rho\in(0,1)$.}\label{tab:rates}\begin{tabular}{lcc}\hlineMethod & Convex case & Strongly convex case ($\mu_{\text{tot}}>0$) \\\hlineGD & $\mathcal{O}(1/k)$ & $(1-\mu_{\text{m}}/L)^k$ \\ISTA & $\mathcal{O}(1/k)$ & $(1-c\,\mu_{\text{m}}/L)^k$ \\NAG & $\mathcal{O}(1/k^2)$ & $(1-\sqrt{\mu_{\text{m}}/L})^k$ \\FISTA & $\mathcal{O}(1/k^2)$ & $(1-\sqrt{\mu_{\text{m}}/L})^k$ \\FISTA($\delta$) & $\mathcal{O}(1/k^2)$ & $(1-\sqrt{\mu_{\text{tot}}/(L+\delta)})^k$ \\SQ2FISTA & $\mathcal{O}(1/k^2)$ & $(1-(\sqrt{2\mu_{\text{tot}}/L_{\text{eff}}))^k$ \\\hline\end{tabular}\end{table}\section{Contributions}\paragraph{C1. Unified Lyapunov proofs for GD, NAG, ISTA, and FISTA.}We present complete, concise proofs of the classical rates using a common Lyapunov perspective. For GD, a descent lemma combined with a strong-convexity sandwich yields $\mathcal{O}(1/k)$ (convex) and $(1-\mu_{\text{m}}/L)^k$ (strongly convex). For NAG, a quadratic-in-$k$ potential establishes $\mathcal{O}(1/k^2)$ (convex) and factor $1-\sqrt{\mu_{\text{m}}/L}$ (strongly convex). For ISTA/FISTA, three-point inequalities and an estimate-sequence potential yield $\mathcal{O}(1/k)$ and $\mathcal{O}(1/k^2)$, respectively; the strongly convex FISTA variant attains the accelerated linear factor as above.\paragraph{C2. SQ2FISTA via an inertial ODE and weak discrete gradients.}We design a time-varying inertial flow with \emph{hyperbolic} damping and construct a discrete scheme using a weak discrete gradient (wDG) identity so that a discrete energy $E_k$ provably decreases. Solving a scalar recurrence for the weight sequence $A_k$ yields an optimal schedule, leading to $F(x_k)-F^\star=\mathcal{O}(1/k^2)$ in the convex case and $F(x_k)-F^\star\le C\,\rho^k$ in the strongly convex case with $\rho \approx 1-(2\sqrt{\mu_{\text{tot}}/L_{\text{eff}}})$. The proof mirrors the continuous Lyapunov decay and is fully discrete, requiring only smoothness, (total) strong convexity when present, and a proximal step.\paragraph{C3. Convexified FISTA for weakly convex proximals.}We introduce FISTA($\delta$) by convexifying the proximal part via \eqref{eq:prox-shift} with $\delta=-\mu_p$. This preserves the standard FISTA structure but replaces gradients of $f$ by those of $f_\delta:=f-\tfrac{\delta}{2}\|\cdot\|^2$ (hence $L\mapsto L+\delta$, $\mu_m\mapsto\mu_m-\delta$). We prove the same $\mathcal{O}(1/k^2)$ rate in the convex case and a linear rate governed by $\mu_{\text{tot}}$ in the strongly convex case. For SCAD we implement \emph{safety clamps} to keep the closed-form prox well-defined.\paragraph{C4. Pseudocodes.}We provide Pseudo code for all optimizers (GD, NAG, ISTA/FISTA, FISTA($\delta$), SQ2FISTA) and proximals (including SCAD with safety clamps), as well as three model families: \emph{Smoothed Hinge SVM}, \emph{Saturated Nonlinear Regression (tanh link)}, and \emph{Ill-conditioned quadratic}. The listings follow a uniform interface, expose exact $L$ bounds used by the theory, and record both objective gaps and prox-gradient norms.\section{Expectations}\paragraph{Validity.}All analyses are deterministic and assume exact (or high-precision) gradients, exact prox steps (with safe closed forms for SCAD), and step sizes based on global $L$ bounds (or their safe overestimates). The linear-rate guarantees use the \emph{total} strong convexity $\mu_{\text{tot}}$ when present.\paragraph{Limitations.}We do not analyze stochastic gradients, line searches, or adaptive step sizes. With weakly convex $g$, the \emph{original} nonconvex composite need not be globally convex; our guarantees either (i) target the convexified problem (FISTA($\delta$)) or (ii) control a discrete Lyapunov function for SQ2FISTA that ensures objective decrease and stationarity, but not necessarily avoidance of all nonconvex stationary points beyond our constructed setting. SCAD’s region-2 proximal formula requires a positive denominator; we enforce this via \emph{safety clamps} — benign in practice but slightly perturbative.\paragraph{Takeaway.}In regimes where $h$ is weakly convex (a common practical case), \emph{standard} FISTA is not the right baseline. Either convexify (FISTA($\delta$)) or discretize a well-chosen inertial flow (SQ2FISTA). Both are principled; empirically they are neck-and-neck, with SQ2FISTA offering a physics-consistent derivation and FISTA($\delta$) offering a minimal patch over a standard workhorse.\section{Reading Guide and Chapter Outline}Chapter~2 revisits Gradient Descent, its Lyapunov analysis, and continuous gradient flow, providing full proofs in both convex and strongly convex regimes. Chapter~3 develops Nesterov’s Accelerated Gradient (NAG), including discrete proofs of $\mathcal{O}(1/k^2)$ and linear rates, alongside their ODE counterparts. Chapter~4 introduces proximal operators, ISTA, FISTA, and a strongly convex accelerated variant; proofs rely on three-point inequalities and estimate sequences. Chapter~5 bridges continuous-time inertial dynamics and discrete methods via \emph{weak discrete gradients} (wDG), yielding a unifying template for Lyapunov-based design and derives \emph{SQ2FISTA} from a hyperbolically damped inertial ODE. We prove energy decrease, derive the optimal weight recurrence, and establish convergence rates. Chapter~6 reports an empirical study on three model families (\emph{Smoothed Hinge SVM}, \emph{Saturated Nonlinear Regression (tanh link)}, and \emph{Ill-conditioned quadratic}) with SCAD, comparing \emph{plain} FISTA, \emph{FISTA($\delta$)}, and \emph{SQ2FISTA}. Chapters~7 consolidate conclusions, nuanced comparisons between \emph{standard} vs.\ \emph{convexified} FISTA and SQ2FISTA, and discuss limitations.\begin{table}[t]\centering\caption{Main symbols and conventions.}\label{tab:notation}\begin{tabular}{ll}\hlineSymbol & Meaning \\\hline$x^\star$ & a global minimizer of $F=f+h$; $F^\star=F(x^\star)$ \\$L$ & Lipschitz smoothness constant of $f$ \\$\mu_m$ & strong convexity constant of $f$ (of the Model) \\$\mu_p$ & curvature of $h$ (weak convexity: $\mu_p<0$; convex: $\mu_p\ge 0$) \\$\mu_{\text{tot}}$ & total strong convexity: $\mu_{\text{tot}}=\mu_m+\mu_p$ \\$\operatorname{prox}_{\lambda h}$ & proximity operator of $h$ with parameter $\lambda>0$ \\$G_\eta(x)$ & prox-gradient mapping at stepsize $\eta$; see \eqref{eq:PG-map} \\FISTA($\delta$) & FISTA with prox shift $\delta=-\mu_p$; cf.\ \eqref{eq:prox-shift} \\SQ2FISTA & ODE-informed accelerated proximal method designed via wDG \\$L_{\text{eff}}$ & effective Lipschitz constant (in SQ2FISTA analysis) \\\hline\end{tabular}\end{table}\newpage\section{Summary}\bigskip\noindent In summary, this thesis contributes a unified analysis and implementation playbook for accelerated composite optimization in the presence of weakly convex proximals. The new method \emph{SQ2FISTA}, developed by PhD student Mr. Ushiyama from the University of Tokyo, derived from first principles, and the pragmatic \emph{FISTA($\delta$)} baseline together form a robust toolkit that is theoretically sound and empirically validated; \emph{plain} FISTA remains a useful point of reference but is often dominated in the regimes that motivate contemporary applications.