Data Science

Logistic Regression in Python

Logistic regression is a machine learning classification algorithm. Logistic regression is also similar to linear regression. But the main difference between logistic regression and linear regression is that the logistic regression output values are always binary (0, 1) and not numeric. The logistic regression basically creates a relationship between independent variables (one or more than one) and dependent variables. The dependent variable is a binary variable which has mostly two cases:

  • 1 for true or
  • 0 for false

The key importance of Logistic Regression:

  1. The independent variables must not be multicollinearity; if there is some relationship, then it should be very little.
  2. The dataset for the logistic regression should be large enough to get better results.
  3. Only those attributes should be there in the dataset, which has some meaning.
  4. The independent variables must be according to the log odds.

To build the model of the logistic regression, we use the scikit-learn library. The process of the logistic regression in python is given below:

  1. Import all the required packages for the logistic regression and other libraries.
  2. Upload the dataset.
  3. Understand the independent dataset variables and dependent variables.
  4. Split the dataset into training and test data.
  5. Initialize the logistic regression model.
  6. Fit the model with the training dataset.
  7. Predict the model using the test data and calculate the accuracy of the model.

Problem: The first steps are to collect the dataset on which we want to apply the Logistic Regression. The dataset which we are going to use here is for the MS admission dataset. This dataset has four variables and out of which three are independent variables (GRE, GPA, work_experience), and one is a dependent variable (admitted). This dataset will tell whether the candidate will get admission or not to a prestigious University based on their GPA, GRE, or work_experience.

Step 1: We import all the required libraries which we required for the python program.

Step 2: Now, we are loading our ms admission dataset using the read_csv pandas function.

Step 3: The dataset looks like below:

Step 4: We check all the columns available in the dataset and then set all independent variables to variable X and dependent variables to y, as shown in the below screenshot.

Step 5: After setting the independent variables to X and the dependent variable to y, we are now printing here to cross-check X and y using the head pandas function.

Step 6: Now, we are going to divide the whole dataset into training and test. For this, we are using the train_test_split method of sklearn. We have given 25% of the whole dataset to the test and the remaining 75% of the dataset to the training.

Step 7: Now, we are going to divide the whole dataset into training and test. For this, we are using the train_test_split method of sklearn. We have given 25% of the whole dataset to the test and the remaining 75% of the dataset to the training.

Then we create the Logistic Regression model and fit the training data.

Step 8: Now, our model is ready for prediction, so we are now passing the test (X_test) data to the model and got the results. The results show (y_predictions) that values 1 (admitted) and 0 (non admitted).

Step 9: Now, we print the classification report and the confusion matrix.

The classification_report shows that the model can predict the results with an accuracy of 69%.
The confusion matrix shows the total X_test data details as:
TP = True Positives = 8
TN = True Negatives = 61
FP = False Positives = 4
FN = False Negatives = 27

So, the total accuracy according to the confusion_matrix is:

Accuracy = (TP+TN)/Total = (8+61)/100 = 0.69

Step 10: Now, we are going to cross-check the result through print. So, we just print the top 5 elements of the X_test and y_test (actual true value) using the head pandas function. Then, we also print the top 5 results of the predictions as shown below:

We combine all three results in a sheet to understand the predictions as shown below. We can see that except for the 341 X_test data, which was true (1), the prediction is false (0) else. So, our model predictions work 69%, as we have already shown above.

Step 11 : So, we understand how the model predictions are done on the unseen dataset like X_test. So, we created just a randomly new dataset using a pandas dataframe, passed it to the trained model, and got the result shown below.

The complete code in python given below:

The code for this blog, along with the dataset, is available at the following link
https://github.com/shekharpandey89/logistic-regression

About the author

Shekhar Pandey