Probability Inequality
Theorem 1: Gaussian Tail Inequality
Given \({{x}_{1}},\cdots ,{{x}_{n}}\sim N\left( 0,1 \right)\) then, \(P\left( \left| X \right|>\varepsilon \right)\le \frac{2{{e}^{-{{{\varepsilon }^{2}}}/{2}\;}}}{\varepsilon }\) and \(P\left( \left| {{{\bar{X}}}_{n}} \right|>\varepsilon \right)\le \frac{2}{\sqrt{n}\varepsilon }{{e}^{-{n{{\varepsilon }^{2}}}/{2}\;}}\overset{l\arg e\ n}{\mathop{\le }}\,{{e}^{-{n{{\varepsilon }^{2}}}/{2}\;}}\).
Proof of Gaussian Tail Inequality
Consider a univariate \({{x}_{1}},\cdots ,{{x}_{n}}\sim N\left( 0,1 \right)\), then the probability density function is given as \(\phi \left( x \right)=\frac{1}{\sqrt{2\pi }}{{e}^{-\frac{{{x}^{2}}}{2}}}\). Let’s take the derivative w.r.t \(x\) we get: \[\frac{d\phi \left( x \right)}{dx}={\phi }'\left( x \right)=\frac{d\left( \frac{1}{\sqrt{2\pi }}{{e}^{-\frac{{{x}^{2}}}{2}}} \right)}{dx}=\frac{1}{\sqrt{2\pi }}\frac{d\left( \,{{e}^{-\frac{{{x}^{2}}}{2}}} \right)}{dx}=\frac{1}{\sqrt{2\pi }}\frac{d\left( \,{{e}^{-\frac{{{x}^{2}}}{2}}} \right)}{d\left( -\frac{{{x}^{2}}}{2} \right)}\frac{d\left( -\frac{{{x}^{2}}}{2} \right)}{dx}=\frac{1}{\sqrt{2\pi }}{{e}^{-\frac{{{x}^{2}}}{2}}}\left( -x \right)=-x\phi \left( x \right)\] Let’s define the gaussian tail inequality. \[P\left( X>\varepsilon \right)\le {{\varepsilon }^{-1}}\int_{\varepsilon }^{\infty }{s\phi \left( s \right)ds}\] \[P\left( X>\varepsilon \right)\le {{\varepsilon }^{-1}}\int_{\varepsilon }^{\infty }{s\phi \left( s \right)ds}=-{{\varepsilon }^{-1}}\int_{\varepsilon }^{\infty }{{\phi }'\left( s \right)ds}=-{{\varepsilon }^{-1}}\left. {\phi }'\left( s \right) \right|_{\varepsilon }^{\infty }=-{{\varepsilon }^{-1}}\left[ {\phi }'\left( \infty \right)-{\phi }'\left( \varepsilon \right) \right]\] We know that \(x\phi \left( x \right)=-{\phi }'\left( x \right)\) \[P\left( X>\varepsilon \right)\le -{{\varepsilon }^{-1}}\left[ 0-{\phi }'\left( \varepsilon \right) \right]=\frac{{\phi }'\left( \varepsilon \right)}{\varepsilon }=\frac{1}{\varepsilon \sqrt{2\pi }}{{e}^{-\frac{{{\varepsilon }^{2}}}{2}}}\le \frac{{{e}^{-{{{\varepsilon }^{2}}}/{2}\;}}}{\varepsilon }\] Now, by the symmetry of distribution, \[P\left( \left| X \right|>\varepsilon \right)\le \frac{2{{e}^{-{{{\varepsilon }^{2}}}/{2}\;}}}{\varepsilon }\]
Proof for Gaussian Tail Inequality for distribution of mean
Now, let’s consider \({{x}_{1}},\cdots ,{{x}_{n}}\sim N\left( 0,1 \right)\) and \({{\bar{X}}_{n}}={{n}^{-1}}\sum\limits_{i=1}^{n}{{{x}_{i}}}\sim N\left( 0,{{n}^{-1}} \right)\) therefore, \({{\bar{X}}_{n}}\overset{d}{\mathop{=}}\,{{n}^{-{1}/{2}\;}}Z\) where \(Z\sim N\left( 0,1 \right)\) and by Gaussian Tail Inequalities \[P\left( \left| {{{\bar{X}}}_{n}} \right|>\varepsilon \right)=P\left( {{n}^{-{1}/{2}\;}}\left| Z \right|>\varepsilon \right)=P\left( \left| Z \right|>\sqrt{n}\varepsilon \right)\le \frac{2}{\sqrt{n}\varepsilon }{{e}^{-{n{{\varepsilon }^{2}}}/{2}\;}}\]
Exercise:
Imagine \({{x}_{1}},\cdots ,{{x}_{n}}\sim N\left( 0,{{\sigma }^{2}} \right)\)and prove the gaussian tail inequality that \[P\left( \left| X \right|>\varepsilon \right)\le \frac{{{\sigma }^{2}}}{\varepsilon }\frac{1}{\sqrt{2\pi {{\sigma }^{2}}}}2{{e}^{-{{{\varepsilon }^{2}}}/{\left( 2{{\sigma }^{2}} \right)}\;}}\]
Theorem 2: Markov’s Inequality
Let \(X\) be a non-negative random variable and \(E\left( X \right)\) exists, For any \(t>0\); \(P\left( X>t \right)\le \frac{E\left( X \right)}{t}\)
Proof of Markov’s Inequality
For \(X>0\) we can write expectation of \(X\) as: \[E\left( X \right)=\int\limits_{0}^{\infty }{xp\left( x \right)dx}=\int\limits_{0}^{t}{xp\left( x \right)dx}+\int\limits_{t}^{\infty }{xp\left( x \right)dx}\ge \int\limits_{t}^{\infty }{xp\left( x \right)dx}\] \[E\left( X \right)\ge \int\limits_{t}^{\infty }{xp\left( x \right)dx}\ge t\int\limits_{t}^{\infty }{p\left( x \right)dx}=tP\left( X>t \right)\] \[\frac{E\left( X \right)}{t}\ge P\left( X>t \right)\] \[P\left( X>t \right)\le \frac{E\left( X \right)}{t}\]
Theorem 3: Chebyshev’s Inequality
Let \(\mu =E\left( X \right)\) and \(Var\left( X \right)={{\sigma }^{2}}\), then \(P\left( \left| X-\mu \right|\ge t \right)\le \frac{{{\sigma }^{2}}}{{{t}^{2}}}\) and \(P\left( \left| Z \right|\ge k \right)\le \frac{1}{{{k}^{2}}}\)where \(Z=\frac{X-\mu }{{{\sigma }^{2}}}\) and in particular \(P\left( \left| Z \right|>2 \right)\le \frac{1}{4}\) and \(P\left( \left| Z \right|>3 \right)\le \frac{1}{9}\).
Proof of Chebyshev’s Inequality
Let’s take \[P\left( \left| X-\mu \right|>t \right)=P\left( {{\left| X-\mu \right|}^{2}}>{{t}^{2}} \right)\le \frac{E{{\left( X-\mu \right)}^{2}}}{{{t}^{2}}}=\frac{{{\sigma }^{2}}}{{{t}^{2}}}\] Let’s take \[P\left( \left| \frac{X-\mu }{\sigma } \right|>\sigma k \right)=P\left( {{\left| \frac{X-\mu }{\sigma } \right|}^{2}}>{{\sigma }^{2}}{{k}^{2}} \right)\le \frac{E{{\left( X-\mu \right)}^{2}}}{{{\sigma }^{2}}{{k}^{2}}}=\frac{{{\sigma }^{2}}}{{{\sigma }^{2}}{{k}^{2}}}=\frac{1}{{{k}^{2}}}\]
Theorem 4: Hoeffding Inequality
In probability theory, Hoeffding’s inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding’s inequality was proven by Wassily Hoeffding in 1963. This inequality is sharper than Markov inequality and we can create upper bound without knowing the variance.
If \(a<X<b\) and \(\mu =E\left( X \right)\) then \(P\left( \left| X-\mu \right|>\varepsilon \right)\le 2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\).
Proof (part-A)
Let’s assume \(\mu =0\). If data is don’t have \(\mu =0\), we can always center the data and \(a<X<b\). Now
\[X=\gamma a+\left( 1-\gamma \right)b\] where \(0<\gamma <1\) and \(\gamma =\frac{X-a}{b-a}\). With convexity we can write: \[{{e}^{tX}}\le \gamma {{e}^{tb}}+\left( 1-\gamma \right){{e}^{ta}}=\frac{X-a}{b-a}{{e}^{tb}}+\left( 1-\frac{X-a}{b-a} \right){{e}^{ta}}=\frac{X-a}{b-a}{{e}^{tb}}+\frac{b-X}{b-a}{{e}^{ta}}\] \[{{e}^{tX}}\le \frac{X{{e}^{tb}}-a{{e}^{tb}}+b{{e}^{ta}}-X{{e}^{ta}}}{b-a}=\left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)+\frac{X\left( {{e}^{tb}}-{{e}^{ta}} \right)}{b-a}\] Let’s take the expectation on the both sides: \[E\left( {{e}^{tX}} \right)\le E\left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)+E\frac{X\left( {{e}^{tb}}-{{e}^{ta}} \right)}{b-a}=\left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)+\frac{\left( {{e}^{tb}}-{{e}^{ta}} \right)}{b-a}E\left( X \right)\] Since \(\mu =E\left( X \right)=0\) therefore, \[E\left( {{e}^{tX}} \right)\le \left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)+0\] \[E\left( {{e}^{tX}} \right)\le \left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)=\frac{{{e}^{ta}}\left( b-a{{e}^{t\left( b-a \right)}} \right)}{b-a}\]
Let’s define \[{{e}^{g\left( t \right)}}=\frac{{{e}^{ta}}\left( b-a{{e}^{t\left( b-a \right)}} \right)}{b-a}\]
Taking \(log\) on the both sides:
\[\log \left( {{e}^{g\left( t \right)}} \right)=\log \left( \frac{{{e}^{ta}}\left( b-a{{e}^{t\left( b-a \right)}} \right)}{b-a} \right)\]
\[g\left( t \right)=\log \left( {{e}^{ta}}\left( b-a{{e}^{t\left( b-a \right)}} \right) \right)-\log \left( b-a \right)\]
\[g\left( t \right)=\log \left( {{e}^{ta}} \right)+\log \left( b-a{{e}^{t\left( b-a \right)}} \right)-\log \left( b-a \right)\]
\[\begin{equation} g\left( t \right)=ta+\log \left( b-a{{e}^{t\left( b-a \right)}} \right)-\log \left( b-a \right) \end{equation}\]
Taylor series expansion
For a univariate function \(g(x)\)evaluated at \({{x}_{0}}\) , we can express with Taylor series expansion as: \[g(x)=g({{x}_{0}})+{{g}^{(1)}}({{x}_{0}})(x-{{x}_{0}})+\frac{1}{2!}{{g}^{(2)}}({{x}_{0}}){{(x-{{x}_{0}})}^{2}}+\cdots +\frac{1}{(m-1)!}{{g}^{(m-1)}}({{x}_{0}}){{(x-{{x}_{0}})}^{m-1}}+\frac{1}{(m)!}{{g}^{(m)}}({{x}_{0}}){{(x-{{x}_{0}})}^{m}}+\cdots \] For a univariate function \(g(x)\)evaluated at \({{x}_{0}}\) that is \(m\) times differentiable, we can express with Taylor series expansion as: \[g(x)=g({{x}_{0}})+{{g}^{(1)}}({{x}_{0}})(x-{{x}_{0}})+\frac{1}{2!}{{g}^{(2)}}({{x}_{0}}){{(x-{{x}_{0}})}^{2}}+\cdots +\frac{1}{(m-1)!}{{g}^{(m-1)}}({{x}_{0}}){{(x-{{x}_{0}})}^{m-1}}+\frac{1}{(m)!}{{g}^{(m)}}(\xi ){{(x-{{x}_{0}})}^{m}}\] where \({{g}^{(s)}}={{\left. \frac{{{\partial }^{s}}g(x)}{\partial {{x}^{2}}} \right|}_{x={{x}_{0}}}}\) and and \(\xi\) lies between \(x\) and \({{x}_{0}}\).
Proof (part-B)
Now, let’s evaluate \(g\left( t=0 \right)\) we get: \[\begin{equation} g\left( t=0 \right)=g\left( 0 \right)=0+\log \left( b-a \right)-\log \left( b-a \right)=0 \end{equation}\]
Let’s evaluate \({g}'\left( t=0 \right)\) but before that lets find \({g}'\left( t \right)\). \[\frac{dg\left( t \right)}{dt}={g}'\left( t \right)=\frac{d\left( ta+\log \left( b-a{{e}^{t\left( b-a \right)}} \right)-\log \left( b-a \right) \right)}{dt}\] \[{g}'\left( t \right)=\frac{d\left( ta \right)}{dt}+\frac{d\log \left( b-a{{e}^{t\left( b-a \right)}} \right)}{dt}+\frac{d\left( -\log \left( b-a \right) \right)}{dt}\] \[{g}'\left( t \right)=a+\frac{d\log \left( b-a{{e}^{t\left( b-a \right)}} \right)}{d\left( b-a{{e}^{t\left( b-a \right)}} \right)}\frac{d\left( b-a{{e}^{t\left( b-a \right)}} \right)}{dt}+\underbrace{\frac{d\left( -\log \left( b-a \right) \right)}{dt}}_{0}\] \[{g}'\left( t \right)=a+\frac{-a\left( b-a \right){{e}^{t\left( b-a \right)}}}{b-a{{e}^{t\left( b-a \right)}}}\] Consider the second term: \[\frac{-a\left( b-a \right){{e}^{t\left( b-a \right)}}}{b-a{{e}^{t\left( b-a \right)}}}=\frac{-a\left( b-a \right)}{\left( b-a{{e}^{t\left( b-a \right)}} \right){{e}^{-t\left( b-a \right)}}}=\frac{-a\left( b-a \right)}{b{{e}^{-t\left( b-a \right)}}-a{{e}^{t\left( b-a \right)}}{{e}^{-t\left( b-a \right)}}}=\frac{-a\left( b-a \right)}{b{{e}^{-t\left( b-a \right)}}-a}=\frac{a\left( b-a \right)}{a+b{{e}^{-t\left( b-a \right)}}}\] \[{g}'\left( t \right)=a+\frac{a\left( b-a \right)}{a+b{{e}^{-t\left( b-a \right)}}}\] Now Let’s evaluate \({g}'\left( t=0 \right)\), we get \[\begin{equation} {g}'\left( t=0 \right)=a+\frac{-a\left( b-a \right){{e}^{0\left( b-a \right)}}}{b-a{{e}^{0\left( b-a \right)}}}=a+\frac{-a\left( b-a \right)}{b-a}=0 \end{equation}\]
Now let’s take\({{g}'}'\left( t \right)\).
\[\frac{d{g}'\left( t \right)}{dt}={{g}'}'\left( t \right)=\frac{d\left( a+\frac{a\left( b-a \right)}{a+b{{e}^{-t\left( b-a \right)}}} \right)}{dt}=\frac{d\left( \frac{a\left( b-a \right)}{a+b{{e}^{-t\left( b-a \right)}}} \right)}{dt}\]
\[{{g}'}'\left( t \right)=\frac{-a\left( b-a \right)\left( -b \right)\left( -\left( b-a \right){{e}^{-t\left( b-a \right)}} \right)}{{{\left( a+b{{e}^{-t\left( b-a \right)}} \right)}^{2}}}=\frac{ab{{\left( b-a \right)}^{2}}\left[ -{{e}^{t\left( b-a \right)}} \right]}{{{\left( a{{e}^{t\left( b-a \right)}}-b \right)}^{2}}}\]
Now we can compare following two terms. \[a{{e}^{t\left( b-a \right)}}\ge a\]
Negate \(b\) and square on the both sides:
\[{{\left( a{{e}^{t\left( b-a \right)}}-b \right)}^{2}}\ge {{\left( a-b \right)}^{2}}={{\left( b-a \right)}^{2}}\]
\[\frac{1}{{{\left( a{{e}^{t\left( b-a \right)}}-b \right)}^{2}}}\le \frac{1}{{{\left( b-a \right)}^{2}}}\]
From above inequality, we can write:
\[{{g}'}'\left( t \right)=\frac{-ab{{\left( b-a \right)}^{2}}\left[ {{e}^{t\left( b-a \right)}} \right]}{{{\left( a{{e}^{t\left( b-a \right)}}-b \right)}^{2}}}\le \frac{-ab{{\left( b-a \right)}^{2}}}{{{\left( b-a \right)}^{2}}}\]
\[\begin{equation} {g}''\left( t \right)\le -ab=\frac{{{\left( a-b \right)}^{2}}-{{\left( b-a \right)}^{2}}}{4}\le \frac{{{\left( b-a \right)}^{2}}}{4} \end{equation}\]
Now, with Taylor series expansion we have:
\[g\left( t \right)=g\left( 0 \right)+t{g}'\left( 0 \right)+\frac{1}{2!}{{t}^{2}}{{g}'}'\left( 0 \right)+\cdots\]
And with truncating Taylor series expansion, we can write. (Note this is not approximation, its exact)
\[g\left( t \right)=g\left( 0 \right)+t{g}'\left( 0 \right)+\frac{1}{2!}{{t}^{2}}{{g}'}'\left( \xi \right)=\frac{1}{2!}{{t}^{2}}{{g}'}'\left( \xi \right)\le \frac{1}{2!}{{t}^{2}}\frac{{{\left( b-a \right)}^{2}}}{4}\]
\[\begin{equation} g\left( t \right)\le \frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8} \end{equation}\]
Proof (part-C)
We have bound \(E\left[ {{e}^{tX}} \right]\le {{e}^{g\left( t \right)}}\) and \({{e}^{g\left( t \right)}}\le {{e}^{\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}\) .
Consider \[P\left( X>\varepsilon \right)=P\left( {{e}^{X}}>{{e}^{\varepsilon }} \right)=P\left( {{e}^{tX}}>{{e}^{t\varepsilon }} \right)\]
And Now with Markov inequality:
\[P\left( {{e}^{tX}}>{{e}^{t\varepsilon }} \right)\le \frac{E\left( {{e}^{tX}} \right)}{{{e}^{t\varepsilon }}}={{e}^{-t\varepsilon }}E\left( {{e}^{tX}} \right)\le {{e}^{-t\varepsilon }}{{e}^{\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}={{e}^{^{-t\varepsilon +\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}}\]
Now we can make it sharper by following argument:
\[P\left( {{e}^{tX}}>{{e}^{t\varepsilon }} \right)\le \underset{t\ge 0}{\mathop{\inf }}\,\frac{E\left( {{e}^{tX}} \right)}{{{e}^{t\varepsilon }}}={{e}^{-t\varepsilon }}E\left( {{e}^{tX}} \right)\le {{e}^{-t\varepsilon }}{{e}^{\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}={{e}^{^{-t\varepsilon +\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}}\]
Proof for sharper version with Chernoff’s method
let define: \(u=t\varepsilon -\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}\) and find the minima as setting FOC as \({u}'\left( t \right)=\varepsilon -\frac{2t{{\left( b-a \right)}^{2}}}{8}\overset{set}{\mathop{=}}\,0\) and \({{t}^{*}}=\frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}\) then substituting to get:
\[{{u}_{\min }}=t\varepsilon -\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}=\varepsilon \frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}-{{\left( \frac{4\varepsilon }{{{\left( b-a \right)}^{2}}} \right)}^{2}}\frac{{{\left( b-a \right)}^{2}}}{8}\]
\[{{u}_{\min }}=\varepsilon \frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}=\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}\]
The reason we want to get \({{u}_{\min }}\) is to make sharper argument for the inequality or alternatively we would like to bound for the minima. Now substituting:
\[P\left( X>\varepsilon \right)=P\left( {{e}^{X}}>{{e}^{\varepsilon }} \right)=P\left( {{e}^{tX}}>{{e}^{t\varepsilon }} \right)\le {{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]
\[P\left( \left| X \right|>\varepsilon \right)\le 2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]
This is very important results, because there is no mean or variance, so this result is very cogitative. If we observe any type of random variable whose functional form is unknown, the above statement is true.
Proof for random variable with non zero mean.
Now we can apply with the mean ie. \(\mu =E\left( X \right)\) and \(Y=x-\mu\) i.e. \(a-\mu <Y<b-\mu\). And:
\[P\left( \left| Y \right|>\varepsilon \right)=P\left( \left| X-\mu \right|>\varepsilon \right)\le 2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-\mu -a+\mu \right)}^{2}}}}}=2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]
So, \(P\left( \left| X-\mu \right|>\varepsilon \right)\le 2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\) is known as Hoeffding’s Inequality. This shows that the variation of the random variable beyond its mean by certain amount \(\varepsilon\) is upper bounded by \(2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\). This is true for any random variable so it’s very powerful generalization.
Proof for bound of mean
Let’s define \({{\bar{Y}}_{n}}=\sum\limits_{i=1}^{n}{{{Y}_{i}}}\) and \({{Y}_{i}}\) are i.id then let’s bound it as:
\[P\left( {{{\bar{Y}}}_{n}}>\varepsilon \right)=P\left( {{n}^{-1}}\sum\limits_{i=1}^{n}{{{Y}_{i}}}>\varepsilon \right)=P\left( \sum\limits_{i=1}^{n}{{{Y}_{i}}}>n\varepsilon \right)=P\left( {{e}^{\sum\limits_{i=1}^{n}{{{Y}_{i}}}}}>{{e}^{n\varepsilon }} \right)=P\left( {{e}^{t\sum\limits_{i=1}^{n}{{{Y}_{i}}}}}>{{e}^{tn\varepsilon }} \right)\]
Note, we introduce \(t\) there that’s for the flexibility that later, I can choose \(t\). Now with Markov inequality we can write under the assumption of i.i.d of \({{Y}_{i}}\)
\[P\left( {{{\bar{Y}}}_{n}}>\varepsilon \right)=P\left( {{e}^{t\sum\limits_{i=1}^{n}{{{Y}_{i}}}}}>{{e}^{tn\varepsilon }} \right)\le {{e}^{-tn\varepsilon }}E\left[ {{e}^{t\sum\limits_{i=1}^{n}{{{Y}_{i}}}}} \right]={{e}^{-tn\varepsilon }}{{\left( E{{e}^{t{{Y}_{i}}}} \right)}^{n}}\]
Since, we have bound \(E\left[ {{e}^{tX}} \right]\le {{e}^{g\left( t \right)}}\) as \({{e}^{g\left( t \right)}}\le {{e}^{\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}\) , therefore,
\[P\left( {{{\bar{Y}}}_{n}}>\varepsilon \right)\le {{e}^{-tn\varepsilon }}{{\left( E{{e}^{t{{Y}_{i}}}} \right)}^{n}}\le {{e}^{-tn\varepsilon }}{{e}^{n\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}\]
Let’s try to put sharper bound and try and solve for \(u\left( t \right)=tn\varepsilon -n\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}\) and the FOC is \({u}'\left( t \right)=n\varepsilon -n\frac{2t{{\left( b-a \right)}^{2}}}{8}\overset{set}{\mathop{=}}\,0\) and solving we get \[{{t}^{*}}=\frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}\] and the plugging the value of \({{t}^{*}}\) on \(u\left( t \right)\) gives:
\[{{u}_{\min }}={{t}^{*}}n\varepsilon -n\frac{{{t}^{*}}^{2}{{\left( b-a \right)}^{2}}}{8}=\frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}n\varepsilon -n\frac{{{\left( 4\varepsilon \right)}^{2}}}{{{\left( {{\left( b-a \right)}^{2}} \right)}^{2}}}\frac{{{\left( b-a \right)}^{2}}}{8}=\frac{4n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}-\frac{2n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}=\frac{2n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}\]
Then,
\[P\left( {{{\bar{Y}}}_{n}}>\varepsilon \right)\le \underset{t\ge 0}{\mathop{\inf }}\,{{e}^{-tn\varepsilon }}{{e}^{n\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}={{e}^{\frac{-2n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]
Then,
\[P\left( \left| {{{\bar{Y}}}_{n}} \right|>\varepsilon \right)\le 2{{e}^{\frac{2n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]
Hence, this gives the bound on the mean.
Proof for Binominal
Hoeffding’s inequality for the \({{Y}_{1}}\sim Ber\left( p \right)\) and it’s upper bound is \(1\) and lower bound is \(0\) so \({{\left( b-a \right)}^{2}}=1\) and with Hoeffding inequality \[P\left( \left| {{{\bar{X}}}_{n}}-p \right|>\varepsilon \right)\le 2{{e}^{-2n{{\varepsilon }^{2}}}}\]
Theorem 5: Kullback Leibler Distance
Proof for distance between density is greater than zero.
Proof that the distance between any two density \(p\) and \(q\) whose random variable is \(X\tilde{\ }p\) (\(p\) is some distribution) is always greater than or equal to zero.
Prior answering this, let’s quickly note two inequality, namely Cauchy-Swartz Inequality and Jensen’s inequality.
Cauchy-Swartz Inequality
\(\left| EXY \right|\le E\left| XY \right|\le \sqrt{E\left( {{X}^{2}} \right)}\sqrt{E\left( {{Y}^{2}} \right)}\).
Jensen’s inequality
If \(g\) is convex then\(Eg\left( X \right)\ge g\left( EX \right)\). If \(g\) is concave, then\(Eg\left( X \right)\le g\left( EX \right)\).
Kullback Leibler Distance
The distance between two density \(p\) and \(q\) is defined by the Kullback Leibler Distance, and given as:
\[D\left( p,q \right)=\int{p\left( x \right)\log \left( \frac{p\left( x \right)}{q\left( x \right)} \right)}dx\]
Before, I move ahead, note that the self-distance between density \(p\) to \(p\) is zero and given as: \(D\left( p,p \right)=0\) and by definition distance is always greater than and equal to zero so, one thing we have to confirm is that distance between two density \(p\) and \(q\) i.e. \(D\left( p,q \right)\ge 0\). But we will use Jensen inequality to proof \(D\left( p,q \right)\ge 0\). Since the \(\log\) function is concave in nature, so we can write, Jensen inequality that:
\[-D\left( p,q \right)=E\left[ \log \left( \frac{p\left( x \right)}{q\left( x \right)} \right) \right]\le \log \left[ E\left( \frac{p\left( x \right)}{q\left( x \right)} \right) \right]=\log \int{p\left( x \right)\frac{q\left( x \right)}{p\left( x \right)}dx}=\log \int{q\left( x \right)dx}=\log \left( 1 \right)=0\]
i.e
\[-D\left( p,q \right)\le 0\] i.e. \[D\left( p,q \right)\ge 0\]
Theorem 6: Maximum of a random variable
Let \({{X}_{i}},\ldots {{X}_{n}}\) be random variable. Suppose there exist \(\sigma >0\) such that \(E\left( {{e}^{t{{X}_{i}}}} \right)\le {{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\). Then, \(E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \sigma \sqrt{2\log n}\).
Maximum of a random variable represents how to bound maximum value of a random variable? i.e. Say the random variable is \({{X}_{1}},\ldots ,{{X}_{n}}\) and say it is arranged in ascending order such that \({{X}_{\left( 1 \right)}}\le {{X}_{\left( 2 \right)}}\le \ldots \le {{X}_{\left( n \right)}}\) and \({{X}_{\left( n \right)}}={{E}_{\max }}\left\{ {{X}_{1}},\cdots ,{{X}_{n}} \right\}\)how to compute the distribution of that maximum value? Now the interesting thing is, say, that we don’t know the exact distribution of \(X\), so can we say in general without knowing the distribution that what is the maximum of a random variable?
Let’s start with the expectation of the moment generating function given as: \(E{{e}^{t{{X}_{i}}}}\) then, it is bounded by \(E{{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\) i.e. \(E{{e}^{t{{X}_{i}}}}\le E{{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\). Now, can we bound the maximum of \(E{{e}^{t{{X}_{i}}}}\)or alternatively, what is the \(E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\) ?
Let’s, start with \(E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\) and pre-multiply this with \(t\) and exponentiate. i.e. \(\exp \left\{ tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}} \right\}\), bounding this gives also is same as bounding \(E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\). Now with Jensen’s inequality we can write:
\[{{e}^{tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}}}\le E{{e}^{t\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}}}=E\underset{1\le i\le n}{\mathop{\max }}\,{{e}^{t{{X}_{i}}}}\le \sum\limits_{i=1}^{n}{E{{e}^{t{{X}_{i}}}}}\le n{{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\] \[{{e}^{tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}}}\le n{{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\] Taking \(\log\) on the both sides \[tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \log n+\log {{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\] \[tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \log n+\frac{{{t}^{2}}{{\sigma }^{2}}}{2}\] Dividing both sides by \(t\), we get \[E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \frac{\log n}{t}+\frac{t{{\sigma }^{2}}}{2}\] Now, let’s take this \(\frac{\log n}{t}+\frac{t{{\sigma }^{2}}}{2}\) and optimize w.r.t \(t\) we get: \(\log n=\frac{{{t}^{2}}{{\sigma }^{2}}}{2}\) and \(t={{\sigma }^{-1}}\sqrt{2\log n}\). Now plugging this value, we get:
\[E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \frac{\log n}{t}+\frac{t{{\sigma }^{2}}}{2}=\frac{2\log n+{{t}^{2}}{{\sigma }^{2}}}{2t}=\frac{2\log n+{{\left( {{\sigma }^{-1}}\sqrt{2\log n} \right)}^{2}}{{\sigma }^{2}}}{2{{\sigma }^{-1}}\sqrt{2\log n}}\] \[E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \frac{2\log n+{{\sigma }^{-2}}2\log n{{\sigma }^{2}}}{2{{\sigma }^{-1}}\sqrt{2\log n}}=\frac{2\sqrt{2}\sqrt{2}\log n}{2{{\sigma }^{-1}}\sqrt{2}\sqrt{\log n}}=\sigma \sqrt{2\log n}\] Hence \[E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \sigma \sqrt{2\log n}\]