Data Science the smart way: Regression

Data Science the smart way: Regression — Part II

We are continuing our series on Data Science the smart way. This series helps you to master data science smartly through questions which is beneficial for concept clearance and interview preparation.

You can visit the first part of the blog here.

We have discussed questions about the regression in our first part of this blog. You can check the blog here. We have collected a brilliant set of questions from here.

Today we will discuss the following questions in detail. I will rate the questions as per their difficulty level.

What are the methods for solving linear regression do you know? ‍[Medium]
What is the normal equation? ‍[Medium]
What is gradient descent? How does it work? ‍[Medium]
What is SGD — stochastic gradient descent? What’s the difference with the usual gradient descent? ‍[Medium]
Which metrics for evaluating regression models do you know? [Easy]
What are MSE and RMSE? [Easy]

Let’s get started.

1. What are the methods for solving linear regression do you know?

There are many different methods used to solve the linear regression problem. We would discuss a few here.

i. Sklearn’s Linear Regression

ii. Gradient Descent

iii. Least Square Method/Normal Equation Method

iv. Singular Value Decomposition (SVD).

i. Sklearn’s Linear Regression

Linear Regression is a regression technique, falls in the category of supervised learning in machine learning. Linear regression is a predictive analysis technique that generally finds the relationship between the dependent variable and the independent variable. We visualize it by plotting the dependent(x-axis) and independent(y-axis) variables in 2D graph.

The implementation of Linear Regression using Sklearn is quite simple.

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X,y)

lr.intercept_, lr.coef_
(array([4.21509616]), array([[2.77011339]]))

lr.predict(X_new)
array([[4.21509616],[9.75532293]])

ii. Singular Value Decomposition (SVD)

The linear regression class is based on the scipy.linalg.lstsq() function (the name stands for “least squares”).

theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
>>> theta_best_svd
array([[4.21509616],
[2.77011339]])

The function computes:

Where X^+ is the pseudoinverse of X. The pseudoinverse is computed using a standard matrix factorization technique called the singular vector Decomposition (SVD) that can decompose the training et matrix X into the matrix multiplication of three matrices U Σ V⊺. The decomposition helps in matrix multiplication if it is not invertible.

This approach is more efficient than computing the Normal equation, plus it handles edge cases nicely. The normal equation may not work if the matrix X⊺X is not invertible(i.e., singular), such as if m<n if some features are redundant, but the pseudoinverse is always defined.

Lets now see the implementation:

np.linalg.pinv(X_b).dot(y)
array([[4.21509616], [2.77011339]])

2. What is the normal equation?

Before explaining the normal equation. Let's understand in short how does linear regression works.

A linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term (also called intercept term).

In this equation:
• ŷ is the predicted value.
• n is the number of features.
• xi is the ith feature value.
• θj is the jth model parameter (including the bias term θ0 and the feature weights
θ1, θ2, ⋯, θn)

This can be written in the vectorized form:

θ is the model’s parameter vector, containing the bias term θ0 and the feature weights θ1 to θn.
• x is the instance’s feature vector, containing x0 to xn, with x0 always equal to 1.
• θ · x is the dot product of the vectors θ and x, which is of course equal to θ0x0 + θ1x1 + θ2x2 + … + θnxn.
• hθ is the hypothesis function, using the model parameters θ

We know that to find the performance of the regression problem we minimize the MSE, RMSE.

The MSE of a Linear Regression hypothesis hθ on a training set X is calculated using:

Normal Equation:

To find the value of θ that minimizes the cost function, there is an equation that gives results directly. This is called the normal equation.

In this equation:
• θ is the value of θ that minimizes the cost function.
• y is the vector of target values containing y(1) to y(m).

3. What is gradient descent? How does it work?

Gradient Descent is a generic optimization algorithm that is used to finding an optimal solution to a wide range of problems. In Gradient Descent we tweak parameters iteratively in order to minimize the code function.

Suppose its raining in a hill. The water would flow towards the slope to the ground below. That's what exactly gradient descent does: It measures the local gradient of the error function with regards to the parameter θ, and it goes to the direction of the descending gradient. Once the gradient is zero, we reach the minimum spot.

In the above picture of gradient descent, the model parameters are initialized randomly and get tweaked repeatedly to minimize the cost function; the learning step size is proportional to the slope of the cost function. The steps gradually get smaller and smaller as the parameters approach the minimum.

The important parameter in GD is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time. On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side.

4. What is SGD — stochastic gradient descent? What’s the difference with the usual gradient descent?

Firstly we would understand what is Batch Gradient Descent then we will learn the Stochastic gradient descent.

To implement gradient descent we need to calculate how much the cost function will change if you change θj just a little bit. This method is called the partial derivative.

The partial derivative of the cost function is given by the equation. Note this is for a single case.

Instead of computing these partial derivatives individually, we can compute all of them in one go. The following equation contains all the partial derivatives of the cost function (one for each model parameter).

After solving the equation we can subtract ∇θMSE(θ) from θ to get the next step. This is where the learning rate η comes into play. We multiply the gradient vector by η to determine the size of the downhill step.

Stochastic Gradient Descent

The main problem with the gradient descent is that it uses the whole training set to compute the gradients at every step, which make sit very slow when the training set is large.

On the other hand, Stochastic gradient descent picks a random instance in the training set at every step and computes the gradients based only on that single instance. It makes processing very fast.

This algorithm is much less regular than Batch gradient descent. The cost function bounces up and down, decreasing only n average. Finally, it reaches the minimum.

The code implementation of the SGD is as under:

n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters

def learning_schedule(t):
    return t0 / (t + t1)
theta = np.random.randn(2,1) # random initialization

for epoch in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch * m + i)
        theta = theta - eta * gradients

>>> theta
array([[4.21076011],
[2.74856079]])

5. Which metrics for evaluating regression models do you know?

The various metrics used to evaluate the results of the prediction are :

Mean Squared Error(MSE)
Root-Mean-Squared-Error(RMSE)
Mean-Absolute-Error(MAE) etc.

There are many other evaluation metrics. However, the above three are mainly used the most.

Mean Squared Error(MSE)

MSE or Mean Squared Error is one of the most preferred metrics for the regression task. It is simply the squared difference between the target value and the value predicted by the regression model.

Root Mean Squared Error

RMSE is the square root of the averaged difference between the target value and the value predicted by the model. It is preferred where we want to eliminate the large errors. It makes the errors positive or on a single scale and does a high penalty on large errors.

Mean Absolute Error

MAE is the absolute difference between the target value and the predicted value. MAE is more robust to the outliers and doesn't penalize the errors as extremely as MSE. It is not suitable for the applications with much prone to the outliers.

That's pretty much it. I hope both the parts of the blog have given you quite a good understanding of the regression and the type of questions asked in the interviews. Thank you.

Don’t forget to give us your 👏 !

Data Science the smart way: Regression — Part II was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Becoming Human: Artificial Intelligence Magazine - Medium

Data Science the smart way: Regression — Part II