Probability Inequality

Theorem 1: Gaussian Tail Inequality

Given \({{x}_{1}},\cdots ,{{x}_{n}}\sim N\left( 0,1 \right)\) then, \(P\left( \left| X \right|>\varepsilon \right)\le \frac{2{{e}^{-{{{\varepsilon }^{2}}}/{2}\;}}}{\varepsilon }\) and \(P\left( \left| {{{\bar{X}}}_{n}} \right|>\varepsilon \right)\le \frac{2}{\sqrt{n}\varepsilon }{{e}^{-{n{{\varepsilon }^{2}}}/{2}\;}}\overset{l\arg e\ n}{\mathop{\le }}\,{{e}^{-{n{{\varepsilon }^{2}}}/{2}\;}}\).

Proof of Gaussian Tail Inequality

Consider a univariate \({{x}_{1}},\cdots ,{{x}_{n}}\sim N\left( 0,1 \right)\), then the probability density function is given as \(\phi \left( x \right)=\frac{1}{\sqrt{2\pi }}{{e}^{-\frac{{{x}^{2}}}{2}}}\). Let’s take the derivative w.r.t \(x\) we get: \[\frac{d\phi \left( x \right)}{dx}={\phi }'\left( x \right)=\frac{d\left( \frac{1}{\sqrt{2\pi }}{{e}^{-\frac{{{x}^{2}}}{2}}} \right)}{dx}=\frac{1}{\sqrt{2\pi }}\frac{d\left( \,{{e}^{-\frac{{{x}^{2}}}{2}}} \right)}{dx}=\frac{1}{\sqrt{2\pi }}\frac{d\left( \,{{e}^{-\frac{{{x}^{2}}}{2}}} \right)}{d\left( -\frac{{{x}^{2}}}{2} \right)}\frac{d\left( -\frac{{{x}^{2}}}{2} \right)}{dx}=\frac{1}{\sqrt{2\pi }}{{e}^{-\frac{{{x}^{2}}}{2}}}\left( -x \right)=-x\phi \left( x \right)\] Let’s define the gaussian tail inequality. \[P\left( X>\varepsilon \right)\le {{\varepsilon }^{-1}}\int_{\varepsilon }^{\infty }{s\phi \left( s \right)ds}\] \[P\left( X>\varepsilon \right)\le {{\varepsilon }^{-1}}\int_{\varepsilon }^{\infty }{s\phi \left( s \right)ds}=-{{\varepsilon }^{-1}}\int_{\varepsilon }^{\infty }{{\phi }'\left( s \right)ds}=-{{\varepsilon }^{-1}}\left. {\phi }'\left( s \right) \right|_{\varepsilon }^{\infty }=-{{\varepsilon }^{-1}}\left[ {\phi }'\left( \infty \right)-{\phi }'\left( \varepsilon \right) \right]\] We know that \(x\phi \left( x \right)=-{\phi }'\left( x \right)\) \[P\left( X>\varepsilon \right)\le -{{\varepsilon }^{-1}}\left[ 0-{\phi }'\left( \varepsilon \right) \right]=\frac{{\phi }'\left( \varepsilon \right)}{\varepsilon }=\frac{1}{\varepsilon \sqrt{2\pi }}{{e}^{-\frac{{{\varepsilon }^{2}}}{2}}}\le \frac{{{e}^{-{{{\varepsilon }^{2}}}/{2}\;}}}{\varepsilon }\] Now, by the symmetry of distribution, \[P\left( \left| X \right|>\varepsilon \right)\le \frac{2{{e}^{-{{{\varepsilon }^{2}}}/{2}\;}}}{\varepsilon }\]

Proof for Gaussian Tail Inequality for distribution of mean

Now, let’s consider \({{x}_{1}},\cdots ,{{x}_{n}}\sim N\left( 0,1 \right)\) and \({{\bar{X}}_{n}}={{n}^{-1}}\sum\limits_{i=1}^{n}{{{x}_{i}}}\sim N\left( 0,{{n}^{-1}} \right)\) therefore, \({{\bar{X}}_{n}}\overset{d}{\mathop{=}}\,{{n}^{-{1}/{2}\;}}Z\) where \(Z\sim N\left( 0,1 \right)\) and by Gaussian Tail Inequalities \[P\left( \left| {{{\bar{X}}}_{n}} \right|>\varepsilon \right)=P\left( {{n}^{-{1}/{2}\;}}\left| Z \right|>\varepsilon \right)=P\left( \left| Z \right|>\sqrt{n}\varepsilon \right)\le \frac{2}{\sqrt{n}\varepsilon }{{e}^{-{n{{\varepsilon }^{2}}}/{2}\;}}\]

Exercise:

Imagine \({{x}_{1}},\cdots ,{{x}_{n}}\sim N\left( 0,{{\sigma }^{2}} \right)\)and prove the gaussian tail inequality that \[P\left( \left| X \right|>\varepsilon \right)\le \frac{{{\sigma }^{2}}}{\varepsilon }\frac{1}{\sqrt{2\pi {{\sigma }^{2}}}}2{{e}^{-{{{\varepsilon }^{2}}}/{\left( 2{{\sigma }^{2}} \right)}\;}}\]

Theorem 2: Markov’s Inequality

Let \(X\) be a non-negative random variable and \(E\left( X \right)\) exists, For any \(t>0\); \(P\left( X>t \right)\le \frac{E\left( X \right)}{t}\)

Proof of Markov’s Inequality

For \(X>0\) we can write expectation of \(X\) as: \[E\left( X \right)=\int\limits_{0}^{\infty }{xp\left( x \right)dx}=\int\limits_{0}^{t}{xp\left( x \right)dx}+\int\limits_{t}^{\infty }{xp\left( x \right)dx}\ge \int\limits_{t}^{\infty }{xp\left( x \right)dx}\] \[E\left( X \right)\ge \int\limits_{t}^{\infty }{xp\left( x \right)dx}\ge t\int\limits_{t}^{\infty }{p\left( x \right)dx}=tP\left( X>t \right)\] \[\frac{E\left( X \right)}{t}\ge P\left( X>t \right)\] \[P\left( X>t \right)\le \frac{E\left( X \right)}{t}\]

Theorem 3: Chebyshev’s Inequality

Let \(\mu =E\left( X \right)\) and \(Var\left( X \right)={{\sigma }^{2}}\), then \(P\left( \left| X-\mu \right|\ge t \right)\le \frac{{{\sigma }^{2}}}{{{t}^{2}}}\) and \(P\left( \left| Z \right|\ge k \right)\le \frac{1}{{{k}^{2}}}\)where \(Z=\frac{X-\mu }{{{\sigma }^{2}}}\) and in particular \(P\left( \left| Z \right|>2 \right)\le \frac{1}{4}\) and \(P\left( \left| Z \right|>3 \right)\le \frac{1}{9}\).

Proof of Chebyshev’s Inequality

Let’s take \[P\left( \left| X-\mu \right|>t \right)=P\left( {{\left| X-\mu \right|}^{2}}>{{t}^{2}} \right)\le \frac{E{{\left( X-\mu \right)}^{2}}}{{{t}^{2}}}=\frac{{{\sigma }^{2}}}{{{t}^{2}}}\] Let’s take \[P\left( \left| \frac{X-\mu }{\sigma } \right|>\sigma k \right)=P\left( {{\left| \frac{X-\mu }{\sigma } \right|}^{2}}>{{\sigma }^{2}}{{k}^{2}} \right)\le \frac{E{{\left( X-\mu \right)}^{2}}}{{{\sigma }^{2}}{{k}^{2}}}=\frac{{{\sigma }^{2}}}{{{\sigma }^{2}}{{k}^{2}}}=\frac{1}{{{k}^{2}}}\]

Theorem 4: Hoeffding Inequality

In probability theory, Hoeffding’s inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding’s inequality was proven by Wassily Hoeffding in 1963. This inequality is sharper than Markov inequality and we can create upper bound without knowing the variance.

If \(a<X<b\) and \(\mu =E\left( X \right)\) then \(P\left( \left| X-\mu \right|>\varepsilon \right)\le 2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\).

Proof (part-A)

Let’s assume \(\mu =0\). If data is don’t have \(\mu =0\), we can always center the data and \(a<X<b\). Now

\[X=\gamma a+\left( 1-\gamma \right)b\] where \(0<\gamma <1\) and \(\gamma =\frac{X-a}{b-a}\). With convexity we can write: \[{{e}^{tX}}\le \gamma {{e}^{tb}}+\left( 1-\gamma \right){{e}^{ta}}=\frac{X-a}{b-a}{{e}^{tb}}+\left( 1-\frac{X-a}{b-a} \right){{e}^{ta}}=\frac{X-a}{b-a}{{e}^{tb}}+\frac{b-X}{b-a}{{e}^{ta}}\] \[{{e}^{tX}}\le \frac{X{{e}^{tb}}-a{{e}^{tb}}+b{{e}^{ta}}-X{{e}^{ta}}}{b-a}=\left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)+\frac{X\left( {{e}^{tb}}-{{e}^{ta}} \right)}{b-a}\] Let’s take the expectation on the both sides: \[E\left( {{e}^{tX}} \right)\le E\left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)+E\frac{X\left( {{e}^{tb}}-{{e}^{ta}} \right)}{b-a}=\left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)+\frac{\left( {{e}^{tb}}-{{e}^{ta}} \right)}{b-a}E\left( X \right)\] Since \(\mu =E\left( X \right)=0\) therefore, \[E\left( {{e}^{tX}} \right)\le \left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)+0\] \[E\left( {{e}^{tX}} \right)\le \left( \frac{-a{{e}^{tb}}+b{{e}^{ta}}}{b-a} \right)=\frac{{{e}^{ta}}\left( b-a{{e}^{t\left( b-a \right)}} \right)}{b-a}\]

Let’s define \[{{e}^{g\left( t \right)}}=\frac{{{e}^{ta}}\left( b-a{{e}^{t\left( b-a \right)}} \right)}{b-a}\]

Taking \(log\) on the both sides:

\[\log \left( {{e}^{g\left( t \right)}} \right)=\log \left( \frac{{{e}^{ta}}\left( b-a{{e}^{t\left( b-a \right)}} \right)}{b-a} \right)\]

\[g\left( t \right)=\log \left( {{e}^{ta}}\left( b-a{{e}^{t\left( b-a \right)}} \right) \right)-\log \left( b-a \right)\]

\[g\left( t \right)=\log \left( {{e}^{ta}} \right)+\log \left( b-a{{e}^{t\left( b-a \right)}} \right)-\log \left( b-a \right)\]

\[\begin{equation} g\left( t \right)=ta+\log \left( b-a{{e}^{t\left( b-a \right)}} \right)-\log \left( b-a \right) \end{equation}\]

Taylor series expansion

For a univariate function \(g(x)\)evaluated at \({{x}_{0}}\) , we can express with Taylor series expansion as: \[g(x)=g({{x}_{0}})+{{g}^{(1)}}({{x}_{0}})(x-{{x}_{0}})+\frac{1}{2!}{{g}^{(2)}}({{x}_{0}}){{(x-{{x}_{0}})}^{2}}+\cdots +\frac{1}{(m-1)!}{{g}^{(m-1)}}({{x}_{0}}){{(x-{{x}_{0}})}^{m-1}}+\frac{1}{(m)!}{{g}^{(m)}}({{x}_{0}}){{(x-{{x}_{0}})}^{m}}+\cdots \] For a univariate function \(g(x)\)evaluated at \({{x}_{0}}\) that is \(m\) times differentiable, we can express with Taylor series expansion as: \[g(x)=g({{x}_{0}})+{{g}^{(1)}}({{x}_{0}})(x-{{x}_{0}})+\frac{1}{2!}{{g}^{(2)}}({{x}_{0}}){{(x-{{x}_{0}})}^{2}}+\cdots +\frac{1}{(m-1)!}{{g}^{(m-1)}}({{x}_{0}}){{(x-{{x}_{0}})}^{m-1}}+\frac{1}{(m)!}{{g}^{(m)}}(\xi ){{(x-{{x}_{0}})}^{m}}\] where \({{g}^{(s)}}={{\left. \frac{{{\partial }^{s}}g(x)}{\partial {{x}^{2}}} \right|}_{x={{x}_{0}}}}\) and and \(\xi\) lies between \(x\) and \({{x}_{0}}\).

Proof (part-B)

Now, let’s evaluate \(g\left( t=0 \right)\) we get: \[\begin{equation} g\left( t=0 \right)=g\left( 0 \right)=0+\log \left( b-a \right)-\log \left( b-a \right)=0 \end{equation}\]

Let’s evaluate \({g}'\left( t=0 \right)\) but before that lets find \({g}'\left( t \right)\). \[\frac{dg\left( t \right)}{dt}={g}'\left( t \right)=\frac{d\left( ta+\log \left( b-a{{e}^{t\left( b-a \right)}} \right)-\log \left( b-a \right) \right)}{dt}\] \[{g}'\left( t \right)=\frac{d\left( ta \right)}{dt}+\frac{d\log \left( b-a{{e}^{t\left( b-a \right)}} \right)}{dt}+\frac{d\left( -\log \left( b-a \right) \right)}{dt}\] \[{g}'\left( t \right)=a+\frac{d\log \left( b-a{{e}^{t\left( b-a \right)}} \right)}{d\left( b-a{{e}^{t\left( b-a \right)}} \right)}\frac{d\left( b-a{{e}^{t\left( b-a \right)}} \right)}{dt}+\underbrace{\frac{d\left( -\log \left( b-a \right) \right)}{dt}}_{0}\] \[{g}'\left( t \right)=a+\frac{-a\left( b-a \right){{e}^{t\left( b-a \right)}}}{b-a{{e}^{t\left( b-a \right)}}}\] Consider the second term: \[\frac{-a\left( b-a \right){{e}^{t\left( b-a \right)}}}{b-a{{e}^{t\left( b-a \right)}}}=\frac{-a\left( b-a \right)}{\left( b-a{{e}^{t\left( b-a \right)}} \right){{e}^{-t\left( b-a \right)}}}=\frac{-a\left( b-a \right)}{b{{e}^{-t\left( b-a \right)}}-a{{e}^{t\left( b-a \right)}}{{e}^{-t\left( b-a \right)}}}=\frac{-a\left( b-a \right)}{b{{e}^{-t\left( b-a \right)}}-a}=\frac{a\left( b-a \right)}{a+b{{e}^{-t\left( b-a \right)}}}\] \[{g}'\left( t \right)=a+\frac{a\left( b-a \right)}{a+b{{e}^{-t\left( b-a \right)}}}\] Now Let’s evaluate \({g}'\left( t=0 \right)\), we get \[\begin{equation} {g}'\left( t=0 \right)=a+\frac{-a\left( b-a \right){{e}^{0\left( b-a \right)}}}{b-a{{e}^{0\left( b-a \right)}}}=a+\frac{-a\left( b-a \right)}{b-a}=0 \end{equation}\]

Now let’s take\({{g}'}'\left( t \right)\).

\[\frac{d{g}'\left( t \right)}{dt}={{g}'}'\left( t \right)=\frac{d\left( a+\frac{a\left( b-a \right)}{a+b{{e}^{-t\left( b-a \right)}}} \right)}{dt}=\frac{d\left( \frac{a\left( b-a \right)}{a+b{{e}^{-t\left( b-a \right)}}} \right)}{dt}\]

\[{{g}'}'\left( t \right)=\frac{-a\left( b-a \right)\left( -b \right)\left( -\left( b-a \right){{e}^{-t\left( b-a \right)}} \right)}{{{\left( a+b{{e}^{-t\left( b-a \right)}} \right)}^{2}}}=\frac{ab{{\left( b-a \right)}^{2}}\left[ -{{e}^{t\left( b-a \right)}} \right]}{{{\left( a{{e}^{t\left( b-a \right)}}-b \right)}^{2}}}\]

Now we can compare following two terms. \[a{{e}^{t\left( b-a \right)}}\ge a\]

Negate \(b\) and square on the both sides:

\[{{\left( a{{e}^{t\left( b-a \right)}}-b \right)}^{2}}\ge {{\left( a-b \right)}^{2}}={{\left( b-a \right)}^{2}}\]

\[\frac{1}{{{\left( a{{e}^{t\left( b-a \right)}}-b \right)}^{2}}}\le \frac{1}{{{\left( b-a \right)}^{2}}}\]

From above inequality, we can write:

\[{{g}'}'\left( t \right)=\frac{-ab{{\left( b-a \right)}^{2}}\left[ {{e}^{t\left( b-a \right)}} \right]}{{{\left( a{{e}^{t\left( b-a \right)}}-b \right)}^{2}}}\le \frac{-ab{{\left( b-a \right)}^{2}}}{{{\left( b-a \right)}^{2}}}\]

\[\begin{equation} {g}''\left( t \right)\le -ab=\frac{{{\left( a-b \right)}^{2}}-{{\left( b-a \right)}^{2}}}{4}\le \frac{{{\left( b-a \right)}^{2}}}{4} \end{equation}\]

Now, with Taylor series expansion we have:

\[g\left( t \right)=g\left( 0 \right)+t{g}'\left( 0 \right)+\frac{1}{2!}{{t}^{2}}{{g}'}'\left( 0 \right)+\cdots\]

And with truncating Taylor series expansion, we can write. (Note this is not approximation, its exact)

\[g\left( t \right)=g\left( 0 \right)+t{g}'\left( 0 \right)+\frac{1}{2!}{{t}^{2}}{{g}'}'\left( \xi \right)=\frac{1}{2!}{{t}^{2}}{{g}'}'\left( \xi \right)\le \frac{1}{2!}{{t}^{2}}\frac{{{\left( b-a \right)}^{2}}}{4}\]

\[\begin{equation} g\left( t \right)\le \frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8} \end{equation}\]

Proof (part-C)

We have bound \(E\left[ {{e}^{tX}} \right]\le {{e}^{g\left( t \right)}}\) and \({{e}^{g\left( t \right)}}\le {{e}^{\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}\) .

Consider \[P\left( X>\varepsilon \right)=P\left( {{e}^{X}}>{{e}^{\varepsilon }} \right)=P\left( {{e}^{tX}}>{{e}^{t\varepsilon }} \right)\]

And Now with Markov inequality:

\[P\left( {{e}^{tX}}>{{e}^{t\varepsilon }} \right)\le \frac{E\left( {{e}^{tX}} \right)}{{{e}^{t\varepsilon }}}={{e}^{-t\varepsilon }}E\left( {{e}^{tX}} \right)\le {{e}^{-t\varepsilon }}{{e}^{\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}={{e}^{^{-t\varepsilon +\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}}\]

Now we can make it sharper by following argument:

\[P\left( {{e}^{tX}}>{{e}^{t\varepsilon }} \right)\le \underset{t\ge 0}{\mathop{\inf }}\,\frac{E\left( {{e}^{tX}} \right)}{{{e}^{t\varepsilon }}}={{e}^{-t\varepsilon }}E\left( {{e}^{tX}} \right)\le {{e}^{-t\varepsilon }}{{e}^{\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}={{e}^{^{-t\varepsilon +\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}}\]

Proof for sharper version with Chernoff’s method

let define: \(u=t\varepsilon -\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}\) and find the minima as setting FOC as \({u}'\left( t \right)=\varepsilon -\frac{2t{{\left( b-a \right)}^{2}}}{8}\overset{set}{\mathop{=}}\,0\) and \({{t}^{*}}=\frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}\) then substituting to get:

\[{{u}_{\min }}=t\varepsilon -\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}=\varepsilon \frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}-{{\left( \frac{4\varepsilon }{{{\left( b-a \right)}^{2}}} \right)}^{2}}\frac{{{\left( b-a \right)}^{2}}}{8}\]

\[{{u}_{\min }}=\varepsilon \frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}=\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}\]

The reason we want to get \({{u}_{\min }}\) is to make sharper argument for the inequality or alternatively we would like to bound for the minima. Now substituting:

\[P\left( X>\varepsilon \right)=P\left( {{e}^{X}}>{{e}^{\varepsilon }} \right)=P\left( {{e}^{tX}}>{{e}^{t\varepsilon }} \right)\le {{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]

\[P\left( \left| X \right|>\varepsilon \right)\le 2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]

This is very important results, because there is no mean or variance, so this result is very cogitative. If we observe any type of random variable whose functional form is unknown, the above statement is true.

Proof for random variable with non zero mean.

Now we can apply with the mean ie. \(\mu =E\left( X \right)\) and \(Y=x-\mu\) i.e. \(a-\mu <Y<b-\mu\). And:

\[P\left( \left| Y \right|>\varepsilon \right)=P\left( \left| X-\mu \right|>\varepsilon \right)\le 2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-\mu -a+\mu \right)}^{2}}}}}=2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]

So, \(P\left( \left| X-\mu \right|>\varepsilon \right)\le 2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\) is known as Hoeffding’s Inequality. This shows that the variation of the random variable beyond its mean by certain amount \(\varepsilon\) is upper bounded by \(2{{e}^{-\frac{2{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\). This is true for any random variable so it’s very powerful generalization.

Proof for bound of mean

Let’s define \({{\bar{Y}}_{n}}=\sum\limits_{i=1}^{n}{{{Y}_{i}}}\) and \({{Y}_{i}}\) are i.id then let’s bound it as:

\[P\left( {{{\bar{Y}}}_{n}}>\varepsilon \right)=P\left( {{n}^{-1}}\sum\limits_{i=1}^{n}{{{Y}_{i}}}>\varepsilon \right)=P\left( \sum\limits_{i=1}^{n}{{{Y}_{i}}}>n\varepsilon \right)=P\left( {{e}^{\sum\limits_{i=1}^{n}{{{Y}_{i}}}}}>{{e}^{n\varepsilon }} \right)=P\left( {{e}^{t\sum\limits_{i=1}^{n}{{{Y}_{i}}}}}>{{e}^{tn\varepsilon }} \right)\]

Note, we introduce \(t\) there that’s for the flexibility that later, I can choose \(t\). Now with Markov inequality we can write under the assumption of i.i.d of \({{Y}_{i}}\)

\[P\left( {{{\bar{Y}}}_{n}}>\varepsilon \right)=P\left( {{e}^{t\sum\limits_{i=1}^{n}{{{Y}_{i}}}}}>{{e}^{tn\varepsilon }} \right)\le {{e}^{-tn\varepsilon }}E\left[ {{e}^{t\sum\limits_{i=1}^{n}{{{Y}_{i}}}}} \right]={{e}^{-tn\varepsilon }}{{\left( E{{e}^{t{{Y}_{i}}}} \right)}^{n}}\]

Since, we have bound \(E\left[ {{e}^{tX}} \right]\le {{e}^{g\left( t \right)}}\) as \({{e}^{g\left( t \right)}}\le {{e}^{\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}\) , therefore,

\[P\left( {{{\bar{Y}}}_{n}}>\varepsilon \right)\le {{e}^{-tn\varepsilon }}{{\left( E{{e}^{t{{Y}_{i}}}} \right)}^{n}}\le {{e}^{-tn\varepsilon }}{{e}^{n\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}\]

Let’s try to put sharper bound and try and solve for \(u\left( t \right)=tn\varepsilon -n\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}\) and the FOC is \({u}'\left( t \right)=n\varepsilon -n\frac{2t{{\left( b-a \right)}^{2}}}{8}\overset{set}{\mathop{=}}\,0\) and solving we get \[{{t}^{*}}=\frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}\] and the plugging the value of \({{t}^{*}}\) on \(u\left( t \right)\) gives:

\[{{u}_{\min }}={{t}^{*}}n\varepsilon -n\frac{{{t}^{*}}^{2}{{\left( b-a \right)}^{2}}}{8}=\frac{4\varepsilon }{{{\left( b-a \right)}^{2}}}n\varepsilon -n\frac{{{\left( 4\varepsilon \right)}^{2}}}{{{\left( {{\left( b-a \right)}^{2}} \right)}^{2}}}\frac{{{\left( b-a \right)}^{2}}}{8}=\frac{4n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}-\frac{2n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}=\frac{2n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}\]

Then,

\[P\left( {{{\bar{Y}}}_{n}}>\varepsilon \right)\le \underset{t\ge 0}{\mathop{\inf }}\,{{e}^{-tn\varepsilon }}{{e}^{n\frac{{{t}^{2}}{{\left( b-a \right)}^{2}}}{8}}}={{e}^{\frac{-2n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]

Then,

\[P\left( \left| {{{\bar{Y}}}_{n}} \right|>\varepsilon \right)\le 2{{e}^{\frac{2n{{\varepsilon }^{2}}}{{{\left( b-a \right)}^{2}}}}}\]

Hence, this gives the bound on the mean.

Proof for Binominal

Hoeffding’s inequality for the \({{Y}_{1}}\sim Ber\left( p \right)\) and it’s upper bound is \(1\) and lower bound is \(0\) so \({{\left( b-a \right)}^{2}}=1\) and with Hoeffding inequality \[P\left( \left| {{{\bar{X}}}_{n}}-p \right|>\varepsilon \right)\le 2{{e}^{-2n{{\varepsilon }^{2}}}}\]

Theorem 5: Kullback Leibler Distance

Proof for distance between density is greater than zero.

Proof that the distance between any two density \(p\) and \(q\) whose random variable is \(X\tilde{\ }p\) (\(p\) is some distribution) is always greater than or equal to zero.

Prior answering this, let’s quickly note two inequality, namely Cauchy-Swartz Inequality and Jensen’s inequality.

Cauchy-Swartz Inequality

\(\left| EXY \right|\le E\left| XY \right|\le \sqrt{E\left( {{X}^{2}} \right)}\sqrt{E\left( {{Y}^{2}} \right)}\).

Jensen’s inequality

If \(g\) is convex then\(Eg\left( X \right)\ge g\left( EX \right)\). If \(g\) is concave, then\(Eg\left( X \right)\le g\left( EX \right)\).

Kullback Leibler Distance

The distance between two density \(p\) and \(q\) is defined by the Kullback Leibler Distance, and given as:

\[D\left( p,q \right)=\int{p\left( x \right)\log \left( \frac{p\left( x \right)}{q\left( x \right)} \right)}dx\]

Before, I move ahead, note that the self-distance between density \(p\) to \(p\) is zero and given as: \(D\left( p,p \right)=0\) and by definition distance is always greater than and equal to zero so, one thing we have to confirm is that distance between two density \(p\) and \(q\) i.e. \(D\left( p,q \right)\ge 0\). But we will use Jensen inequality to proof \(D\left( p,q \right)\ge 0\). Since the \(\log\) function is concave in nature, so we can write, Jensen inequality that:

\[-D\left( p,q \right)=E\left[ \log \left( \frac{p\left( x \right)}{q\left( x \right)} \right) \right]\le \log \left[ E\left( \frac{p\left( x \right)}{q\left( x \right)} \right) \right]=\log \int{p\left( x \right)\frac{q\left( x \right)}{p\left( x \right)}dx}=\log \int{q\left( x \right)dx}=\log \left( 1 \right)=0\]

i.e

\[-D\left( p,q \right)\le 0\] i.e. \[D\left( p,q \right)\ge 0\]

Theorem 6: Maximum of a random variable

Let \({{X}_{i}},\ldots {{X}_{n}}\) be random variable. Suppose there exist \(\sigma >0\) such that \(E\left( {{e}^{t{{X}_{i}}}} \right)\le {{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\). Then, \(E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \sigma \sqrt{2\log n}\).

Maximum of a random variable represents how to bound maximum value of a random variable? i.e. Say the random variable is \({{X}_{1}},\ldots ,{{X}_{n}}\) and say it is arranged in ascending order such that \({{X}_{\left( 1 \right)}}\le {{X}_{\left( 2 \right)}}\le \ldots \le {{X}_{\left( n \right)}}\) and \({{X}_{\left( n \right)}}={{E}_{\max }}\left\{ {{X}_{1}},\cdots ,{{X}_{n}} \right\}\)how to compute the distribution of that maximum value? Now the interesting thing is, say, that we don’t know the exact distribution of \(X\), so can we say in general without knowing the distribution that what is the maximum of a random variable?

Let’s start with the expectation of the moment generating function given as: \(E{{e}^{t{{X}_{i}}}}\) then, it is bounded by \(E{{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\) i.e. \(E{{e}^{t{{X}_{i}}}}\le E{{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\). Now, can we bound the maximum of \(E{{e}^{t{{X}_{i}}}}\)or alternatively, what is the \(E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\) ?

Let’s, start with \(E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\) and pre-multiply this with \(t\) and exponentiate. i.e. \(\exp \left\{ tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}} \right\}\), bounding this gives also is same as bounding \(E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\). Now with Jensen’s inequality we can write:

\[{{e}^{tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}}}\le E{{e}^{t\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}}}=E\underset{1\le i\le n}{\mathop{\max }}\,{{e}^{t{{X}_{i}}}}\le \sum\limits_{i=1}^{n}{E{{e}^{t{{X}_{i}}}}}\le n{{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\] \[{{e}^{tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}}}\le n{{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\] Taking \(\log\) on the both sides \[tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \log n+\log {{e}^{\frac{{{t}^{2}}{{\sigma }^{2}}}{2}}}\] \[tE\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \log n+\frac{{{t}^{2}}{{\sigma }^{2}}}{2}\] Dividing both sides by \(t\), we get \[E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \frac{\log n}{t}+\frac{t{{\sigma }^{2}}}{2}\] Now, let’s take this \(\frac{\log n}{t}+\frac{t{{\sigma }^{2}}}{2}\) and optimize w.r.t \(t\) we get: \(\log n=\frac{{{t}^{2}}{{\sigma }^{2}}}{2}\) and \(t={{\sigma }^{-1}}\sqrt{2\log n}\). Now plugging this value, we get:

\[E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \frac{\log n}{t}+\frac{t{{\sigma }^{2}}}{2}=\frac{2\log n+{{t}^{2}}{{\sigma }^{2}}}{2t}=\frac{2\log n+{{\left( {{\sigma }^{-1}}\sqrt{2\log n} \right)}^{2}}{{\sigma }^{2}}}{2{{\sigma }^{-1}}\sqrt{2\log n}}\] \[E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \frac{2\log n+{{\sigma }^{-2}}2\log n{{\sigma }^{2}}}{2{{\sigma }^{-1}}\sqrt{2\log n}}=\frac{2\sqrt{2}\sqrt{2}\log n}{2{{\sigma }^{-1}}\sqrt{2}\sqrt{\log n}}=\sigma \sqrt{2\log n}\] Hence \[E\underset{1\le i\le n}{\mathop{\max }}\,{{X}_{i}}\le \sigma \sqrt{2\log n}\]

Related