Getting (more) confortable with statistics

During this summer, I’ve taken the decision of getting better at Statistics. My relationship with statistics was a kind of relation that I coukd stare at it and maybe get an intuition of what’s going on, but no more than that. Anything that involved, for example, understanding how is a Pearson’ correlation matrix calculated and what’s the meaning of the math behind that gave me the chills since I really had bad foundations.

Well, so during this summer I’ve decided to join the Master in Analysis and Engineering of Big Data at FCT NOVA - at least partially, since working at Feedzai will still occupy most of my time. This started in the mid of September, and it’s going to end at the beginning of January.

The courses I’m taking are:

Multivariate Stats

You can find my online book of this course here: Multivariate Stats

The goal of this course is to put the students familiar with the inference of multivariate means and co-variance matrices, as well as Gaussian (populations) linear models and dimensionality reduction techniques. In order to apply the knowledge gathered, it is then applied on data discrimination and classification.

At this point, this made me get more comfortable with matrix operations as well as method’s assumption on normality.

For example, let’s talk a little bit about the determinant of a matrix.

Matrix Determinant

It was during the time I was studying this subject that I got a grasp on what is, semantically, the determinant of a matrix - kudos to 3blue1brow for his amazing job on explaining that!

Let me try to summarize this in a few lines of codes and plots.

Assume that we have the following data with the following co-variance matrix:

set.seed(1)
df <- data.frame(
  v1 = rnorm(20, 4,2),
  v2 = rchisq(20, 2)
)

plot <- ggplot(df, aes(v1, v2)) 
plotly::ggplotly(plot + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white"))

dt.cov <- cov(df)
kableExtra::kable(round(dt.cov,2)) %>% kableExtra::kable_styling(position = "center")

	v1	v2
v1	3.34	0.08
v2	0.08	2.86

We can interpret this matrix as:

The first variable (\(v1\)) has a variance of 3.34
The second variable (\(v2\)) has a variance of 2.86
The first and second variable have a co-variance of 0.08

So, \(v1\) and \(v2\) do not vary a lot between them. Another way to see this would be to normalize the co-variance value by the multiplication of each variable’s variance - also known as the Pearson correlation:

\[ r(i,j) = \frac{S_{ij}}{\sqrt{S_{ii} \times S_{jj}}} \]

where:

\(r(i,j)\) represents the correlation between variables \(i\) and \(j\),
\(S_{ij}\) represents the covariance between variables \(i\) and \(j\),
\(S_{ii}\) represents the covariance between variables \(i\) and \(i\) - also known as variance of variable \(i\) (the same goes to \(S_{jj}\))

You can think about this normalization as being the span of standard deviation (thus the sqrt of the variances) in these 2 dimensions (thus you are multiplying).

dt.cor <- cor(df)
kableExtra::kable(round(dt.cor,2)) %>% kableExtra::kable_styling(position = "center")

	v1	v2
v1	1.00	0.03
v2	0.03	1.00

So we can really check that, at least in a linear way, they are pretty independent. Another way to say this is that these two variables together gather more information than just one.

Now we can see the determinant of this matrix as being the following:

drawMatrixWithDet(dt.cor,dim(dt.cor)[1])

You can imagine determinant (or this area) as being the “area of information” that the matrix contains.

Let’s see another example, now with variables a little more correlated:

set.seed(1)
v <- rnorm(20, 0,2)
df2 <- data.frame(
  v1 = v,
  v2 = v*0.3 + rnorm(20,0,0.3)
)

plot <- ggplot(df2, aes(v1, v2)) 
plotly::ggplotly(plot + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white"))

Yup, that linear pattern really indicates some correlation!

dt2.cov <- cov(df2)
kableExtra::kable(round(dt2.cov,2)) %>% kableExtra::kable_styling(position = "center")

	v1	v2
v1	3.34	0.90
v2	0.90	0.31

Hm… but this co-variance matrix is not very expressive about this. Let’s check correlation matrix, which will tell us this right away!

	v1	v2
v1	1.00	0.89
v2	0.89	1.00

And here it is! They seem to be pretty correlated!

So, how is now the area of information on this matrix compared to the previous one?

And how is this useful?

Well, let’s take an example on 3d now!

The data:

set.seed(1)
v <- rnorm(20, 0,2)
df3 <- data.frame(
  v1 = v,
  v2 = v*0.3 + abs(rnorm(20,0,0.3)),
  v3 = rchisq(20,4)
)

a <- list(title = "v1 VS v2")
b <- list(title = "v3 VS v2")
c <- list(title = "v3 VS v1")

p1 <- plotly::ggplotly(ggplot(df3, aes(v1, v2)) + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white")) %>% layout(xaxis = a)
p2 <- plotly::ggplotly(ggplot(df3, aes(v3, v2)) + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white")) %>% layout(xaxis = b)
p3 <- plotly::ggplotly(ggplot(df3, aes(v3, v1)) + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) + geom_point(colour = "white")) %>% layout(xaxis = c)
plotly::subplot(
  p1,p2,p3,nrows =1,titleX = TRUE
)

Now we have seen that there is a correlation between \(v1\) and \(v2\), but not so much between neither \(v3\) and \(v2\) nor \(v3\) and \(v1\):

dt3.cor <- cor(df3)
kableExtra::kable(round(dt3.cor,2)) %>% kableExtra::kable_styling(position = "center")

	v1	v2	v3
v1	1.00	0.98	-0.29
v2	0.98	1.00	-0.34
v3	-0.29	-0.34	1.00

And now let’s observe the determinant of this matrix:

drawMatrixWithDet(dt3.cor,dim(dt3.cor)[1])

(Please notice that you can rotate the above image, as well as reset the axis at the top right corner of the image)

We can see that the volume of information of this matrix is almost a plane instead of a space! This is also pretty noticeable when you take into consideration the value of the determinant, which is 0.03.

What does it means?

It means that the matrix havs a vector that is a linear combination of other vector - or it can be seen as the matrix having more dimensions than necessary, and to represent the information it could be done with less dimensions.

Pretty cool, don’t you think?

Computational Stats

You can find my online book of this course here: Computational Stats

The goal of this course is to put the students familiar with algorithms of type Newton-Raphson, Monte Carlo, resampling techniques (Bootstrap and Jackknife), sampling-resampling techniques and iterative simulation (Monte Carlo via Markov Chain, also known as MCMC method)

And at this point, for example, I’m much more confortable with understanding the needs on intervalar estimation and the calculation of their confidence interval!

Let’s go trought an example about bootstraping in machine learning to access model performance

Bootstraping in Machine Learning

In machine learning it is usual for us to have a dataset which can be interpreted as a sample of a Population. Whatever we do with the model, there will be always the following question:

If we had get another sample from the same population, how would my model behave?

All in all

All in all I believe this is being really useful and I’m learning a lot - more important, I’m getting familiar with a lot of terms and concepts in both multivariate stats and computational stats! Which is awesome! Both courses will end at the beginning of January 2019, so I’m really excited to continue learning more while this lasts!