bilha analytics

Multicollinearity and VIF

The general setup for your regression problem may look something like below. The model coefficients ($ \beta_i $) may then be interpreted in a manner that indicates the amount of change in your predictor variables ($X$) that results in a unit change in your dependent variable $y$. A problem arises when there are significant correlations between your predictor variables so that a change in one such variable not only causes a change in $y$ but in the other correlated predictor variables as well, thus misestimating the model coefficients and making their interpretation difficult. This is the problem of multicollinearity.

\[y = \sum_{i=1}^{p} \beta_i X_{i}^{n} + \epsilon\]

Implications of multicollinearity: Multicollinearity may not affect model accuracy much and is mainly a concern when interpreting model coefficients. If you need to speak to the importance of a feature in a model that assumes linear regression, then multicollinearity is something to watch out for. Some of the models affected include linear regression and SVM models using a linear kernel.

Variance Inflation Factor (VIF) is one way to quantify multicollinearity. It measures how much the variance (and thus the standard errors) of the model coefficients are inflated. The VIF of the ith coefficient is computed as below, where $R^2_i$ is $R^2$ of the model obtained by regressing the ith predictor variable on the other predictor variables.

\[VIF_{i} = \frac{1}{1 - R^2_i}\]

How to use it

An ideal value for VIF could be 1, indicating no inflation of standard errors ($SE$) and, therefore, no multicollinearity for that predictor variable. Another way to think about it is that VIF is a multiplier factor on the variance, therefore $\sqrt{VIF}$ is the multiplier factor on the standard errors of the model coefficients. So, when $VIF = 1$ $ \implies 1 * SE $, thus no inflation. If $VIF=4$, then, $\sqrt{4} = 2$ factor inflation, meaning that the SE of that model coefficients is two times larger than if there were no multicollinearity with other predictor variables.

read more

Matplotlib styles

This entry assumes that you’re already familiar with python, matplotlib and seaborn and are looking to be more productive when using these tools for your research work. If you’re looking for introductory coding material, there are a few links at the end of the article to get you started. All the same, this entry should still be able to frame things and you can go into the specific coding tutorials.

The idea here is to set up a reusable theme/style and find suitable settings for publication-quality plots. That way, you have consistent styling in your plots and, of course, by scripting your process, it is easier to update your report as your experiments or your output media change.

read more

Retrieval-based Chatbot

Chatbots automatically provide answers to common or well-known issues in a manner that simulates conversational interactions. In this project, we build a retrieval-based chatbot using cosine similarity on a database of frequently asked questions about COVID-19 as at 31-Mar-2020.

read more