Probability and Statistics
Here we present AI-relevant Probability and Statistics mathematical necessary background.
- Abstract
- Events and Random Variables
- Distributions
- Statistics and distributions
- Conditional probability
- Expectance and Parameter estimation
- Sampling methods
- Infromation theory
- Summary
Abstract
- There are many online books and courses in probability. For example here.
Events and Random Variables
Events
- Simple event = one possible outcome of an experiment; Complex event = a group of possible outcomes of an experiment.
Events and Random Variables
Distributions
- More fundamentally about randomness is here.
- Pdf and Cdf are related via $f(x)=\frac{d}{dx}F(x)$, and they are used for example in Sampling.
Statistics and distributions
- Range could be defined either as [minimum value, maximum value] or as a single value describing it, e.g. the mean of a range.
- Here for example, is about Poisson (which fits a specifc type of variable).
- The example with dice was taken from here.
- Unlike the dice example, going from sum of 1 variable, to a sum of 2 variables, and so on - we can see alternitavely calculating $X+Y$ by convolution of the $(X,Y)$ distribution. Performing it over and over multiple times derives the gaussian curve as CLT states. See this here.
- More about Weak vs. Strong Law of large numbers can be found here.
- Some sources for the “Bernoulli $\rightarrow$ Binomial $\rightarrow$ Normal distribution” derivation: here, here, here, and here. Also, there is t-distribution, which generalizes the standard normal distribution where the sample size is too small.
Conditional probability
- If $A\perp B$ then also their completing counterparts and their combinations are also independent: $\overline{A}\perp B$ ; $A\perp \overline{B}$ ; $\overline{A}\perp \overline{B}$.
- If $A$ and $B$ are mutually exlusive events (and $P(A),P(B)>0$),
i.e. $P(A\cap B) = 0 \ne P(A)P(B)$, which means $A,B$ are not independent (i.e. dependent). And vice versa: $P(A\cap B) = P(A)P(B) \ne 0$. - Note that all of what we talked about in this video is appropriate both for events or for random variables, since they are equivalent. And it is appropriate both for discrete probabilities ($p$) and continuous ones ($f$).
Expectance and Parameter estimation
Expectance
- A “Moment generating function” is defined as $M_X(t) = E[e^{tX}]$, whose derivatives are the moments. This is similar to the relation between cdf and pdf.
- Another argument in Expectance is: $E[g(Y)|X]=E[X]$, also referred to as “Law of total expectation”. Also: $E[X] = \sum\limits_{y} \underbrace{E[X|Y=y]}_\text{g(Y)}P(Y=y) = E[g(Y)] = E[E[X|Y]] $, and $E[h(X)Y|X] = h(X)E[Y|X]$.
- We will see later more derivations:
- $Var(X) = E[(X-E[X])^2] = E[X^2]-(E[X])^2$.
- For $X,Y$ independent variables: $P(X,Y)=P(X)P(Y)$ or
$f(X,Y)=f(X)f(Y)$, and $E[X\cdot Y]=E[X]E[Y]$. See this independence also in COVARIANCE definition in the linear regression slide. - Also for new variable $Z=X+Y$: $E[Z]=E[X]+E[Y]$,
$Var(Z)=Var(X)+Var(Y)+2\cdot Cov(X,Y)$.
Also $Z$ ~ Convlolution of $p(X)$ and $p(Y)$ or $f(X)$ and $f(Y)$. See proof in here.
- Note that expectance and (co)variance rules and charachteristics are similar also for conditional probability, since instead of given all possibilities ($\Omega$), the all possibilities shrink to some other known event, like a constraint over all possibilities.
Empirical versus Theoretical models
- Note that the theoretical parameters could be also calculated for the continuous case.
- See here why the empirical STD shown here is biased.
- See how the relative occurrence estimation is a solution of MLE for normal distribution assumption/prior of i.i.d data samples, here.
Parameter estimation
- More about the comparison of the three classification mehods is in here and here.
- Naive Bayes and Logistic regression can be compared via graphical representation: Bayesian or Causal Networks, see in State Space section.