Data Science

Machine Learning – Stroke Prediction

Stroke is one of the dangerous diseases that impacts the health and causes of death. If we predict whether stroke occurs or not in the prior itself, we can save our lives. One of the ways to predict the stroke is through Machine Learning.

In this guide, we will predict whether the person is affected by stroke or not by considering his/her working type, living area, health conditions like heart disease, hypertension, average glucose levels, and Body Mass Index (BMI). We utilize the Voting Classifier model to train our data by passing the Random Forest and Decision Tree classifiers to this model. Also, we will see the various types of Voting Classifiers by building the Machine Learning models for both of them.

Factors of Stroke

The following are some of the factors that can cause stroke in a person. All these are considered in our dataset:

  • A person suffering from Hypertension
  • A person suffering from Heart Disease
  • Smoke and consumption of alcohol
  • Higher glucose levels
  • Working environments where a person works

Classification

In Machine Learning, if you are working with categorical data, you need to understand Classification. Classification is a Supervised Learning Technique in Machine Learning which is used to identify the category of the new observations.

Ensembling Technique

Ensembling means combining multiple models. We train the dataset samples with each model and combine the outcome of all the models. Bagging is one of the Ensembling techniques.

Bagging is also known as BootStrap Aggregation in which we combine multiple models. These are known as Weak Learners. For each of them, we need to provide the sample data (D1,D2,…Dn) from the existing dataset (D) such that D1<D, D2<D,…etc. This is known as “Row Sampling with replacement”. Finally, we combine all the outcomes from all the models. This is known as “Aggregation”. This can be done using the Voting Classifier. The majority of the outcome will be the final outcome.

Voting Classifier

Basically, a Voting Classifier takes multiple ensemble models and combines the outcome that is returned by each model and returns the final outcome. The outcome depends on the type of voting.

Hard Voting Classifier

As we discussed, the Voting Classifier combines all the models and gives the outcome. So, this Hard Voting gets the maximum outcome that is returned by the models.

For example: There are 10 models that passed to the Voting Classifier to predict stroke. The first six models return “Yes” and the last four return “No”. Here, the majority is “Yes”. So, the final outcome will be “Yes”.

Soft Voting Classifier

Instead of giving one outcome, it gives probabilities. Based on the probabilities, it calculates the average of probabilities for all the outcomes and returns the outcome for the maximum average.

For example: There are two models that are passed to the Soft Voting Classifier. The first model gives “Yes” as 0.8 % probability and “No” as 0.2% probability and the second model gives “Yes” as 0.4 % probability and “No” as 0.6% probability. We can write these as Yes: {0.8, 0.4} and No: {0.2, 0.6}. It calculates the average and returns the maximum average value as the final outcome. So, the average of “Yes” is 0.6 and the average of “No” is 0.4. The final outcome will be “Yes”.

The Voting Classifier is available in the “sklearn.ensemble” module. The following is the model with parameters. We will use only some parameters to build our model. So, we will discuss only those parameters.

Syntax:

  • The estimators take a list of tuples that include the model name (string) as the first value and the model as the second value in a tuple. Each tuple is separated by a comma operator.
  • The Voting Classifier is considered as hard if we pass “hard” to the voting parameter and the Voting Classifier is considered as soft if we pass “soft” to the voting parameter.

Implementation

Let’s predict if the person is affected by stroke or not based on the health impacting factors like heart_disease, Body Mass Index, smoking, hypertension, average glucose levels, and other factors based on the person surroundings (working type and living areas).

Loading the Data

Download this dataset (StrokeData.csv) from here.

The read_csv() is the function that is available in the Pandas module that is used to load the CSV data into a variable. It takes the file name as the parameter. After the load, display the DataFrame after loading the data. Then, we use the shape attribute to display the total number of rows and columns that are present in the DataFrame. It returns a tuple of values. The first value refers to the total number of rows and the second value refers to the total number of columns/attributes. Also, display all the columns using the “pandas.DataFrame.columns” attribute.

import pandas

# Load the StrokeData.csv into the train_data variable
train_data=pandas.read_csv('StrokeData.csv')

# Get the dimensionality of the train_data
print(train_data.shape,"\n")

# Get the column names
print(train_data.columns)

 

Output:

Our dataset holds 5110 records/rows and 12 columns.

Data Cleaning

In this stage, we can get rid of missing values if they exist. Use the pandas.DataFrame.info() function to get the count of non-null values in each column along with the data type.

print(train_data.info(),"\n")

 

Output:

There is only one column, i.e. “bmi”, that has 121 NaN (missing) values. It is of type float64. The DataFrame[‘column_name’].fillna(DataFrame[‘column_name’].mean()) is used to replace the missing values with the mean in the specific column. Also, we don’t need an “id” column. Let’s remove it from the “train_data” using the drop() function.

# Fill the missing values in the bmi column with mean value
# of the existing column values
train_data['bmi']= train_data['bmi'].fillna(train_data['bmi'].mean())

# Remove the id column from the DataFrame
train_data= train_data.drop(['id'],axis=1)

print(train_data.info(),"\n")

 

Output:

Now, the “id” column doesn’t exist and the missing values are replaced.

Data Visualization and Analysis

We analyze the data by grouping different health factors with stroke for all the people (gender). First, we will look at the different categorical values that are present in the “Object” data type columns one by one.

Use the hist() function to view the Histograms for the numeric type (int64, float64) columns.

# Histogram
train_data.hist(color='green',figsize=(6,8))

 

Output:

Histogram is generated for age, hypertension, heart_disease, avg_glucose_level, bi, and stroke columns.

Create a pie chart to get all the categories that are present in the smoking_status column. We can create it using the pyplot. After that, use the value_counts() function to get the count of values that are present in each category.

from matplotlib import pyplot

# Create Pie chart
pyplot.pie(train_data['smoking_status'].value_counts().reset_index()['smoking_status'], labels=train_data['smoking_status'].value_counts().reset_index()['index'], autopct='%1.2f%%',shadow = True, startangle = 90)

# Set the title to the Pie chart
pyplot.title('SMOKING STATUS')

# Display the Pie chart
pyplot.show()

# Get the count for each category
print(train_data['smoking_status'].value_counts())

 

Output:

There are four categories in the smoking_status column.

  • Never Smoked – 1892
  • Unknown – 1544
  • Formerly Smoked – 885
  • Smokes – 789

Create a pie chart to get all the categories that are present in the work_type column and get the count of values that are present in each category.

from matplotlib import pyplot

# Create Pie chart
pyplot.pie(train_data['work_type'].value_counts().reset_index()['work_type'], labels=train_data['work_type'].value_counts().reset_index()['index'], autopct='%1.2f%%')

# Set the title to the Pie chart
pyplot.title('WORK TYPE')

# Display the Pie chart
pyplot.show()

# Get the count for each category
print(train_data['work_type'].value_counts())

 

Output:

There are five categories in the work_type column.  See the total values that are present in each category in the output.

Similarly, create a pie chart for the Residence_type column.

# Create Pie chart
pyplot.pie(train_data['Residence_type'].value_counts().reset_index()['Residence_type'], labels=train_data['Residence_type'].value_counts().reset_index()['index'], autopct='%1.2f%%',shadow = True, startangle = 90)

# Set the title to the Pie chart
pyplot.title('RESIDENCE TYPE')

# Display the Pie chart
pyplot.show()

# Get the count for each category
print(train_data['Residence_type'].value_counts())

 

Output:

There are two categories in the work_type column. See the total values that are present in each category in the output.

Create a bar chart for the gender column. Use the value_counts() function to get the count in each category.

# Get the count for each category
print(train_data['gender'].value_counts())

# Barplot for the gender column
train_data['gender'].value_counts().plot(kind='bar', xlabel='Category', ylabel='Count',figsize=(3,3))

 

Output:

The total categories is 3. The Female is 2994, the Male is 2115, and Other is 1.

Create a bar chart for the gender column. Use the value_counts() function to get the count in each category.

# Get the count for each category
print(train_data['ever_married'].value_counts())

# Barplot for the ever_married column
train_data['ever_married'].value_counts().plot(kind='bar', xlabel='Married Status', ylabel='Count',figsize=(3,3))

 

Output:

Total categories – 2. Yes – 3353 and No – 1757.

Return the count of hypertension for each worktype. We group the hypertension and work_type and return the count.

# Count of hypertension for each worktype
train_data.groupby(['hypertension','work_type']).count()[['gender']].unstack()

 

Output:

  • Under the “Govt_job” work_type, there are 584 people without hypertension and 73 people who are suffering from hypertension.
  • Under the “Never_worked” work_type, there are 22 people without hypertension and No one is suffering from hypertension.
  • Under the “Private” work_type, there are 2644 people without hypertension and 281 people who are suffering from hypertension.
  • Under the “Self-employed” work_type, there are 675 people without hypertension and 144 people who are suffering from hypertension.
  • Under the “children” work_type, there are 687 people without hypertension and No one is suffering from hypertension.

Return the count with and without hypertension. The grouping columns are “hypertension” and “stroke”.

# People count with & without hypertension
train_data.groupby(['hypertension','stroke']).count()[['gender']].unstack()

 

Output:

  • There are 4429 people who are not suffering from either hypertension or stroke.
  • There are 183 people who are not suffering from hypertension but are having stroke.
  • There are 432 people who are suffering from hypertension but are not having stroke.
  • There are 66 people who are suffering from hypertension along with stroke.

Return the count with and without heart_disease. The grouping columns are “heart_disease” and “stroke”.

# People count with & without heart_disease
train_data.groupby(['heart_disease','stroke']).count()[['gender']].unstack()

 

Output:

  • There are 4632 people who are not suffering from either heart disease or stroke.
  • There are 202 people who are not suffering from heart disease but are having stroke.
  • There are 229 people who are suffering from heart disease but are not having stroke.
  • There are 47 people who are suffering from heart disease along with stroke.

Data Transformation

Let’s convert all the categorical elements to categorical numeric values in each column. The pandas.DataFrame[‘column_name’].replace({‘existing’:new,……}). It takes a dictionary of values in which the key is the existing categorical elements and the value is the numeric value that updates the key. Here, we will do it for five columns: “ever_married”, “Residence_type”, “gender”, “smoking_status”, and “work_type”.

# Convert Categorical features into Categorical Numeric values

train_data['ever_married']=train_data['ever_married'].replace({'No':0,'Yes':1})
train_data['Residence_type']=train_data['Residence_type'].replace({'Urban':0,'Rural':1})
train_data['gender']=train_data['gender'].replace({'Other':0,'Male':1,'Female':2})
train_data['smoking_status']=train_data['smoking_status'].replace({'never smoked':0,'formerly smoked':1,'smokes':2,'Unknown':3})
train_data['work_type']=train_data['work_type'].replace({'Private':0,'Self-employed':1,'children':2,'Govt_job':3,'Never_worked':4})

print(train_data.info(),"\n")

 

Output:

Now, you can see that all the column types are now numeric.

It is possible to improve the accuracy of predictions by binning the values that are present in the age column. Using the pandas.cut() function, we can create nine bins with labels.

Bin 1: Age ranges from 0 to 10. The numeric category is 1.

Bin 2: Age ranges from 11 to 20. The numeric category is 2.

Bin 3: Age ranges from 21 to 30. The numeric category is 3.

Bin 4: Age ranges from 31 to 40. The numeric category is 4.

Bin 5: Age ranges from 41 to 50. The numeric category is 5.

Bin 6: Age ranges from 51 to 60. The numeric category is 6.

Bin 7: Age ranges from 61 to 70. The numeric category is 7.

Bin 8: Age ranges from 71 to 80. The numeric category is 8.

Bin 9: Age ranges from 81 to 90. The numeric category is 9.

print("Maximum age in the dataset: ",train_data['age'].max())
print("Minimum age in the dataset: ",train_data['age'].min())

# Bins for the age column - [(0, 10] - 1,(10, 20] - 2,(20, 30] - 3,(30, 40] - 4,
                        #   (40, 50] - 5,(50, 60] - 6,(60, 70] - 7,(70, 80] - 8, (80, 90] - 9]
train_data['age'] = pandas.cut(train_data['age'], bins=[0,10,20,30,40,50,60,70,80,90],labels=[1,2,3,4,5,6,7,8,9])

print(train_data.info(),"\n")

train_data['age'].value_counts()

 

Output:

Preparing the Train and Test Data

It is a good practice to separate the data for training and testing within the given dataset that is used for training. For example: If the training dataset holds 100 records, 70 records are used to train the Machine Learning Model and 30 records are used to test the model (Prediction). The train_test_split() method automatically splits the dataset into the Train data and Test data. It is available in the sklearn.model_selection module. Basically, the train_test_split() takes five parameters.

  • The Training_set is the actual dataset with independent attributes.
  • The target represents the class label (dependent attribute).
  • The test_size takes a decimal value which represents the size of the data from the Training_set that is used for testing. For example, if it takes 0.3, 30% of the Training_set is used for testing.
  • The random_state is used to fetch the records for testing. If it is 0, every time the same records are taken for testing. This parameter is optional.
  • If you don’t want to shuffle the Training_set before splitting, you can set this parameter to False.

It returns the Training_Independant_Attribtes, Training_target, Testing_Independant_Attribtes, and Testing_target.

In our case, “stroke” is the class label/target. Let’s store it in the target variable and drop it from the train_data. Then, we split the data – Testing: 30% and Training: 70%.

from sklearn.model_selection import train_test_split

# Store the stroke column values in a target variable
target = train_data['stroke']

# Include the independent attributes by dropping the target attribute
train_data = train_data.drop(['stroke'], axis=1)

# Split the train_data into the train and test sets
X_train, X_test, y_train, y_test = train_test_split(train_data,target,test_size=0.3)

print(X_train.shape)
print(X_test.shape)

 

Output:

After splitting the train_data, out of 5110 records, 3577 records are used for training and 1533 records are used for testing the model.

Model Fitting and Evaluation

 

  • Import the model from the specific module.
  • Define the model.
  • Fit the model using the fit() method. It takes the Training – Independent data as the first parameter and the Training – Target data as the second parameter.
  • Predict the model for the records that are used for testing (test the independent attributes). The predict() function takes the array of records to be predicted.
  • Finally, use the score() function to get the model1 accuracy. It takes the Testing – Independent data as the first parameter and the Testing – Target data as the second parameter.

RandomForestClassifier

 

Let’s build a model with the RandomForestClassifier and display the score of this model.

from sklearn.ensemble import RandomForestClassifier

# Define the model with 100 trees
model1 = RandomForestClassifier(n_estimators=100,criterion="gini",bootstrap=True)

# Fit the model
model1.fit(X_train, y_train)

# model1 score
print(model1.score(X_test,y_test) * 100)

 

Output:

DecisionTreeClassifier

Let’s build a model with the RandomForestClassifier and display the score of this model.

from sklearn.tree import DecisionTreeClassifier

# Define the DecisionTreeClassifier model
model2 = DecisionTreeClassifier()

# Fit the model
model2.fit(X_train, y_train)

# model2 score
print(model2.score(X_test,y_test) * 100,"\n")

 

Output:

Soft Voting Classifier

Now, pass these two models to the Voting Classifier with a soft voting parameter.

from sklearn.ensemble import VotingClassifier

# Define the VotingClassifier model
voting_model = VotingClassifier(estimators=[('RandomForest', model1), ('DecisionTree', model2)], voting='soft')

# Fit the model
voting_model.fit(X_train, y_train)

# Predict the data and get the model score
y_pred = voting_model.predict(X_test)
print(voting_model.score(X_test,y_test) * 100,"\n")
print(y_pred)

 

Output:

The accuracy is 91.19%.

Hard Voting Classifier

Now, pass these two models to the Voting Classifier with a hard voting parameter.

from sklearn.ensemble import VotingClassifier

# Define the VotingClassifier model
voting_model2 = VotingClassifier(estimators=[('RandomForest', model1), ('DecisionTree', model2)], voting='hard')

# Fit the model
voting_model2.fit(X_train, y_train)

# Predict the data and get the model score
y_pred = voting_model2.predict(X_test)
print(voting_model2.score(X_test,y_test) * 100,"\n")
print(y_pred)

 

Output:

The accuracy is 94.97%. This model is best for this data set. Let’s use this model to predict one sample record.

print(voting_model.predict([[1, 40013082.0935.7000002]]))

 

Output:

[0]

 

There is no stroke for this data.

Conclusion

The main sources of stroke are Hypertension, Heart Disease, and Higher Glucose Levels. We get this result by analyzing the dataset. Majority of the people are from the private work_type who are suffering from Hypertension. In this guide, we learned how to analyze these factors and visualized the columns using the Pie and Bar charts. We utilized the Voting Classifier by passing the Random Forest and Decision Tree Classifiers to it as estimators. Hard voting got the best accuracy when compared to Soft voting.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain