Exploring the Trends Within Heart Disease and Predicting Diagnosis

Introduction and Motivation

Heart disease is the leading cause of death in many developed first-world nations. As a result, the research, awareness, and prevention of heart disease garners a lot of attention in both the medical and public scene. Although the fear of heart disease occuring to oneself resides almost exclusively in the older population, the death of a loved one can greatly impact affected families as a whole.

Created by Yuan Qi on 05/16/2022

Firstly, here are libraries and imports used in this project

The Process

1. The Dataset Used

The dataset I used comes from UC Irvine (UCI)'s Machine Learning Repository, and is downloadable from this link. It includes the processed Cleveland heart disease dataset which I used, and was created by the Cleveland Clinic Foundation. The original dataset includes 76 features, but the processed one only includes 14 of them. There are a lot of technical terms involved in this dataset, so I've included links for them listed below if you'd like to learn more about the different features. The 14 features used in the processed database are as follows:

  1. age: the age of the patient
  2. sex: the sex of the patient
  3. cp : type of chest pain, takes 4 values
    • 1: typical angina
    • 2: atypical angina
    • 3: non-angina pain (i.e. pain not caused by heart disease)
    • 4: asymptomatic (i.e. no symptoms of chest pain)
  4. trestbps: resting blood pressure (in mmHg)
  5. chol: serum cholesterol level (in mg/dl)
  6. fbs: fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
  7. restecg: resting electrocardiographic results, takes 3 values
    • 0: normal
    • 1: having ST-T wave abnormality
    • 2: probable left ventricular hypertrophy
  8. thalach: maximum attained heart rate
  9. exang: exercise-induced angina (1 = yes, 0 = no) (i.e. does exercise cause angina?)
  10. oldpeak: ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment, takes 3 values
    • 1: upsloping
    • 2: flat
    • 3: downsloping
  12. ca: number of major vessels (0 - 3) colored by flouroscopy
  13. thal: status of the blood disorder thalassemia, takes 3 values
    • 3: None
    • 6: fixed defect
    • 7: reversable defect
  14. Diagnosis of angiographic heart disease status, takes 2 values
    • 0: < 50% artery diameter narrowing
    • 1: > 50% artery diameter narrowing

2. Data Loading and Pre-processing

The dataset can be accessed by clicking on "Data Folder" at the top, and "processed.cleveland.data" should be listed. The data is comma-separated, so reading it can be done very easily. While doing so, let's give each column a meaningful name to better interpret its attribute.

Next, let's see how many rows we are working with. 303! Not bad, but pre-processing in the next step may reduce this number.

Alright, let's now check for any missing data. This step is crucial because bad data can mess with our analysis and needs to be addressed. Let's see which rows, if any, are "missing". Notice how every value in the dataset can be represented by a floating point decimal, so let's see if any rows contain values which aren't. If so, those rows represent some data "missing".

It turns out that 6 rows have invalid data. We have a few options to deal with this, but the simplest one is to just remove those 6 rows, since 6 rows is not a lot. An alternative approach is to impute the missing values in each row with the median or mean, in which we replace the missing values with the median or mean of their respective columns. Both approaches are fine with their own benefits and drawbacks. Deleting the rows can be bad because we are removing data that could be useful, as the more data be have, the better. Imputing could be bad because we could be introducing bias into our data, since we are merely estimating their values. But, for our purposes, both should work fine as we are missing only a small number of rows.

Additionally, the description given to the diagnosis column is that it should take a value of 0 or 1, but the dataset actually has values between 0 and 4. Fortunately, most of the rows are 0 or 1, so to be safe, let's only keep the rows with a diagnosis of 0 or 1. We don't actually know what a diagnosis value of 2, 3, or 4 represent!

This gives us a total of 214 rows after pre-processing.

3. Exploratory Data Analysis

After preprocessing and tidying up our dataset, we can move onto visualizing and discovering trends in our data, while using statistical approaches to reinforce those trends. This will also allow us to gain a better understanding of our data as a whole, which is important if we want to make meaningful conclusions from it!

To start off, let's make a correlation heatmap for our dataset, to get a general picture of what to expect. This heatmap represents the Pearson (r) correlation coefficient between each pair of attributes in our dataset. The Pearson correlation coefficient takes a value between -1 and 1, and tells us the linear correlation between two variables. A value of exactly -1 or 1 means the two variables form an exact negative or positive sloped line, respectively. Realistically though, this will never happen, but a value close to either endpoints means there is a strong linear correlation present.

Note: I also included a "mask" which basically eliminates the top right side of the heatmap, as it is redundant. Other examples of heatmaps you encounter might not do this, and they also might have different color schemes, which can be confusing at first. The important thing is to look at the values which the colors represent. If you'd like to learn about heatmaps more and how to create your own, I'd highly recommend seaborn's heatmap documentation.

Looking at the heatmap, we can infer a lot of information. We can see which attributes have a strong linear correlation, like for instance, age and max heart rate do, since their Pearson coefficient is -0.45. Also, ST segment depression and slope of peak ST are also linearly correlated, which is unsurprising given their definitions and how they relate.

To dive deeper and gain more insight, we should further expand on our observations. Focusing on age vs max heart rate, let's see what else we can find.

Let's try plotting the distribution of max heart rates for different year groups. First, we need to group the years into separate numerical bins, since they are numerical data. Grouping data is very useful because it allows us to make observations about a specific group of data and compare it to others. Looking at the unique ages, we can see that there is a range of 48 years, so 4 bins of size = 12 years each would work nicely. We can accomplish this task with pd.groupby() to group our data, and pd.cut() to separate the ages into the bins. More information about using groupby() and bins() can be found here. Let's make our bins and group our data by age to visualize the distribution of max heart rate for each age group.

Now, before plotting, we should be able to "predict" what the distributions should look like. Since the r coefficient is negative, this means that heart rate should decrease as the year groups increase. Let's see if this is the case!

As predicted, we can see from the distributions that the average max heart rate goes down as age increases, which is consistent with the r coefficient we found for the values.

Heart rate can be a telltale sign of cardiovascular health (though not always). An abnormally low heart beat can be a sign of heart disease or a heart attack, while a high one can be a sign of other cardiovascular problems, such as if the heart's pumping functionalities is reduced. Visualizing this relationship between age and max heart rate firsthand is beneficial to better understanding heart health and normal heart rates between different ages.

Apart from age groups, gender of a person is also relevant in regards to heart disease. For one, there have been numerous studies which show men are much more likely (around twice as much!!) to have heart attacks and heart disease than women. Let's see if we can conclude something similar with our data.

Like with age, we will be using pd.groupby(), but with gender/sex this time. This will allow all men's and women's data to be separated into their own dataframes, so we can make comparisons between them. First, let's plot the proportion of women who got diagnosed with angiographic heart disease to that of men using a double bar plot.

As we can see, 11% of women in our dataset got diagnosed with angiographic heart disease which is far less than 33% of men. This means that according to our dataset, men are three times more likely to be diagnosed with heart disease than women, which is higher than the value of two times from the research study. This difference could be due to the fact that we need more data to get a more accurate result, since we only have 80 women and 134 men in our dataset. However, there is clearly a higher proportion of men having heart disease than women, as expected.

As we are already grouped by sex, let's take this opportunity to make further inferences from our data. Let's take a look at the distribution of cholesterol levels between men and women, as our correlation heatmap indicates somewhat of a linear correlation between them. To do this, we can graph the histograms for each on the same plot, since there's only 2 histograms. To visualize their overlap clearly, we can change the transparency level of the colors to 50% (alpha=0.5). Let's see our results!

Looking at the two distributions, it seems that women generally have a higher cholesterol level than men, though it's pretty close. It's also been shown that a high cholesterol level increases the chances for heart disease, but however, women generally have higher levels of cholesterol but a lower chance of heart disease compared to men, which seems contradictory. This is likely because women's cholesterol levels aren't that much higher than men's, as shown by our plot, but more importantly, cholesterol is not the only determining factor for risk of heart disease!

Next, going back to age vs max heart rate, we saw that as age increased max heart rate decreased. But perhaps the rate of decrease is different for men vs women? Let's find out if their rates are different, by plotting a scatterplot of age vs max heart rate for men and women, and calculating a regression line for each using np.polyfit().

Each regression line minimizes the squared error of its respective dataset, and we can use that line to estimate the max heart rate given an age. To better understand how linear regression work, check out this article. The slope of the regression lines also give the rate in which max heart rate decreases for every year of age increase. In our case, the slope for men is a bit steeper than that of women's, (-1.138 vs -0.859) but not by much. Let's try to examine the lines further by calculating their r coefficient values.

As expected, the two values are similar since the slopes of the lines are similar. The coefficents are both negative, which indicates a negative relationship between age and max heart rate (i.e as age increases, max heart rate decreases)

Using the calculated r coefficient, we can also obtain the r-squared value by squaring our r coefficient. This r-squared value represents the amount of variance that can be explained by our linear model (y = mx + b, using np.polyfit()). In basically all cases, there will be variance present, because the data points won't match exactly to the regression line. The closer they do though, the high the r-squared value, which means that a high amount of variance in the data can be represented by the model. In our case however, the r-squared value is not that high, which means our model only explains a part of the total variance.

4. Hypothesis Testing using T-test

Lastly, looking at the graph of our data, it seems that the average value for max heart rate is about the same for men and women. However, according to research, men's max heart rate tends to be higher than women's. To see if our data follows trend as well, we can use a t-test! To perform one, let's use the ttest_ind() from the scipy.stats library. Note that in order to do this, the two samples have to be independent. In our case they are, since men and women datasets are independent. To learn more about hypothesis testing and t-tests, I encourage reading this article.

We pass our two samples into ttest_ind(), and use "less" as our alternative hypothesis, since we are testing if the max heart rate mean for women is less than that of men. We get a pval = ~0.29, which is greater than a choosen significance level of 0.05. Hence, there is not enough evidence to reject the null hypothesis, which means the average heart beat for women is about the same as that of men. This is not exactly the result we were expecting, so let's print out the sample means to see what's going on.

So, it seems that the men's data in our dataset only had a slightly higher mean max heart rate. This slight increase, however, is not evident enough to say that it's definitely higher than the women's, which is why our t-test concluded that the means were about the same.

5. Using Logistic Regression to Predict Angiographic Diagnosis

Finally, after visualizing our data and gaining a much better understanding of our data, we can use some the insights we gained to create a machine learning model to predict the angiographic diagnosis of a new patient, given the other attributes. In our case, diagnosis is a classification problem which takes two classes, 0 and 1. For classification problems, a good algorithm to try is logistic regression, so let's give that a shot! Before we do that though, there's just one thing we should change about our data, which is adding one-hot encoding to some of our columns.

One-hot encoding aims to solve the issue of assigning categorical variables to different integers. The reason this is a problem is because many machine learning algorithms use numbers to indicate significance, in other words, 2 is bigger than 1 and is therefore more important. But for categorical variables, let's say for pets, 2 might mean "cat", and 1 might mean "dog", but that doesn't mean cats are more important than dogs in our model. So, if we leave it like that without one-hot encoding, our machine learning model might interpret it incorrectly as cats being more important than dogs.

More specifically, one-hot encoding transforms numerical integers into boolean vectors of size n, where n is the number of categories for that variable. To do this with a pandas dataframe, we call the pd.get_dummies() method, which will create n new columns, one for each categorical value, and drop the old column. In our dataset, notice that we have three categorical variables with little meaning if we keep their numerical values unchanged. For example, one of them is chest pain type, which takes an integer from 1-4. However, those integers don't relate to one other at all, so it's better to use one-hot encoding instead. This will result in 4 new columns being created, since there are 4 different values for chest pain type, and the old chest pain type column being dropped. The same will be done with resting ECG and Thalassemia variables, each with their own number of categories.

For transparency's sake, one-hot encoding can have some drawbacks. For one, since we are splitting some variable into more than 1, it can produce more multicollinearity, as more variables will have correlation now. Increasing the number of columns can also lead to longer training times if we have a large amount of data, but for our small dataset, this won't be an issue. When dealing with a very large dataset with many more features however, this may be an issue.

Now, we can train our Logistic Regression model using our dataset. To test the accuracy of our model, we will use K-Folds cross validation, which tests the dataset multiple times with different training and testing slices to ensure every row in our dataset is used for testing and training. This gives a better representation of the accuracy of our model, as opposed to just using one training and testing split.

Overall, we got an accuracy of around 80%, which is pretty good. We can also use our model to predict the diagnosis of a new patient given their attributes, as shown below. Obviously this can't be used for medical purposes, but it's interesting to be able to do so with our dataset. For further work, one could train a similar model on a different dataset with different attributes, or perhaps one from a different region (other than Cleveland). Also, the model I used for classification was logistic regression, but there are many other classification models I could've used, like SVMs and decision trees. For a list of all possible models (and classification models), check out sklearn's wonderful documentation.

And just a refresher, the predicted class means the corresponding angiographic disease status diagnosis. A value of

I've also included a link to more information as to what this means near the beginning of my tutorial, when I went over the different attributes of the dataset.

Summary

As mentioned, heart disease is an important topic which requires lots of research and investigation. My tutorial aims to provide a great understanding of the trends and factors related to heart disease, such as age, sex, and cholesterol levels, and using data science techniques and statistic analysis (Pearson coefficients, t-test) to prove those points. And to close it off, we trained a logistic regression classifier to predict the diagnosis of angiographic heart disease of new patients. I hope that users reading my tutorial gained a better understanding of heart disease and data science and are able to expand their knowledge to new datasets and different regression models, with an understanding of how and why it works.