Logistic regression in simple terms

In this post, we will discuss the basics of logistic regression. This post is part of a series where I will discuss certain topics from machine learning in simple terms.

There are many classifiers in machine learning, but often the simplest is the best. Furthermore, in highly regulated industries, you are limited to which models you could use. Today we will focus on logistic regression. There are many articles on logistic regression, but they are either too technical or too hand waving. In this relatively short post, I hope to provide an informative and intuitive understanding of logistic regression. I hope to avoid technical terminology whenever possible – I’ll even try to avoid the word probability.

Let’s say we have a table of training data consisting of three columns: (T, pH, R), where R is either 1 or 0. An R value of 1 suggests that the aquaponics system is at risk of failing, while an R value of 0 means the system is stable. Here, T stands for temperature in Celsius and R stands for risk. Hence, in this classifying problem, we have two feature columns and one label column.

To be as concrete as possible, we suppose that we have the following data points:

(24, 6.5, 0), (25, 6.4, 0), (24.5, 7, 0), (26, 6, 1), (30, 6, 1)

We would like to find a function f such that f(T,pH) = R is satisfied for each of our points above. For example, we would like f such that f(24,6.5) = 0. Sure, you could always find a polynomial that goes through all of the above points using Lagrange Interpolation; however, it will be ineffective against unseen data points. What we want instead is a function such that for each point (T,pH,R), the value f(T, pH) is as close to R as possible so that f could still be effective against unseen data points. Furthermore, we want our function f to be simple and interpretable. Consider the following function:

f(T,pH) = \dfrac{1}{1+e^{-(\alpha \cdot T + \beta \cdot pH)}}
(T,pH) \rightarrow z = \alpha \cdot T + \beta \cdot pH \rightarrow \dfrac{1}{1+e^{-z}} 

Yikes! We’ve gotten technical. As can be seen above, the function f consists of two parts: linear function and a non-linear function. We have total control over the linear part by solving for alpha and beta.

So, what’s really going on? Imagine that you are shooting a laser down a path (pew pew), but the target is around a corner. Of course, you can’t blast a curved path, so what you would do instead is to use a mirror to change the path to the target. You have control in the placement of the mirror – that is equivalent to solving for alpha and beta in our case, and the mirror is the logit function given by:

\text{logit}(z) = \dfrac{1}{1+e^{-z}}

As you can see above, the function f is simple and interpretable! It will make sure the output is between 0 and 1. Great, now how do we solve for alpha and beta? Well, in machine learning, we always have to minimize or maximize something. Note that maximizing and minimizing are equivalent. Indeed, maximizing a function g is the same as minimizing -g.

If we have a point of the form (T, pH,1), we want to maximize f(T, pH), and if we have a point of the form (T, pH, 0), we want to minimize f(T, pH). Instead of minimizing f(T, pH), we could maximize 1 – f(T, pH). So, we want to maximize the following for each data point

g(T, pH) = f(T, pH)^R (1 - f(T, pH)^{1-R}

This takes some thoughts – just plug in R = 0 and 1 to see what happens. Since we want the above value to be as close to 1 as possible for each data point, we want to maximize their product:

\displaystyle \prod  f(T, pH)^R (1 - f(T, pH)^{1-R} 

However, maximizing a product is hard, so let’s turn it into a sum by taking the logarithm, multiply by -1, and divide by 5.

\displaystyle L(\alpha, \beta) = -\dfrac{1}{5}\sum R \log(f(T, pH) + (1-R) \log(1 - f(T, pH))

We multiply by -1 to turn a maximizing problem into a minimizing problem. We divide by 5, which is the number of data points, to normalize the loss function L. We normalize the loss function so that we could compare the loss value across any number of points. If we don’t normalize, having more data points means higher loss, which doesn’t make sense. Of course, if we have N data points, we would divide by N. Finally, to solve for alpha and beta, we use gradient descent!

The output of f is a value between 0 and 1. We can determine the threshold to classify a point (T, pH) as either 0 or 1. In general, we would want our threshold to be 0.5, but in a more serious scenario such as classifying cancer, we would want this threshold to be really low. For medical classifications, it’s better to have false positive than to have false negative.

Response to “Logistic regression in simple terms”

  1. Imbalanced classes with logistic regression – Sophelen Research

    […] the post Logistic regression in simple terms, there is a probability function p that is the composition of the following two […]

    Like

Leave a comment