Goal

This analysis will investigate potential factors towards a patient’s survival of breast cancer.

Such as:

Understanding what may influence survival could help more people struggling with breast cancer.

Understanding the Data

Data was obtained from Kaggle

Libraries

# Loading libraries necessary for analysis 
library(tidyverse) 
library(janitor)
library(ggpubr) # Obtain function needed for Correlation Coefficient

View the Data

  • View the first few rows to see how the data was organized
cancer_data <- read_csv('Breast_Cancer_Data.csv')
TABLE 1.1: Glimpse of the Data
Age Gender Protein1 Protein2 Protein3 Protein4 Tumour_Stage Histology Surgery_type Surgery_Date Last_Visit_Date Patient_Status Surgery_Year
36 FEMALE 0.08 0.43 0.55 0.27 III Infiltrating Ductal Carcinoma Modified Radical Mastectomy 2017-01-15 2017-06-19 Alive 2017
56 FEMALE 0.35 -0.21 -0.19 0.12 II Infiltrating Ductal Carcinoma Modified Radical Mastectomy 2017-01-25 2017-07-12 Alive 2017
53 FEMALE -0.07 1.42 -0.36 0.39 II Infiltrating Ductal Carcinoma Simple Mastectomy 2017-02-04 2018-02-07 Alive 2017
40 FEMALE -0.57 1.27 -0.29 0.19 II Infiltrating Lobular Carcinoma Other 2017-02-14 2017-12-15 Alive 2017
75 FEMALE 0.17 0.03 -0.16 0.67 I Infiltrating Lobular Carcinoma Simple Mastectomy 2017-02-24 2017-04-05 Alive 2017
41 FEMALE 0.14 1.08 0.21 0.97 I Infiltrating Lobular Carcinoma Simple Mastectomy 2017-03-06 2019-08-09 Alive 2017

Data’s Structure

  • View the column names
colnames(cancer_data)
##  [1] "Age"             "Gender"          "Protein1"        "Protein2"       
##  [5] "Protein3"        "Protein4"        "Tumour_Stage"    "Histology"      
##  [9] "Surgery_type"    "Surgery_Date"    "Last_Visit_Date" "Patient_Status" 
## [13] "Surgery_Year"
  • There are 334 different patients with 13 columns of patient information
glimpse(cancer_data)
## Rows: 334
## Columns: 13
## $ Age             <dbl> 36, 56, 53, 40, 75, 41, 62, 45, 62, 50, 61, 43, 74, 56…
## $ Gender          <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEM…
## $ Protein1        <dbl> 0.080353, 0.345090, -0.069535, -0.565700, 0.171640, 0.…
## $ Protein2        <dbl> 0.426380, -0.211470, 1.418300, 1.266800, 0.029656, 1.0…
## $ Protein3        <dbl> 0.547150, -0.193040, -0.361050, -0.293460, -0.158900, …
## $ Protein4        <dbl> 0.273680, 0.124270, 0.391580, 0.193950, 0.674710, 0.97…
## $ Tumour_Stage    <chr> "III", "II", "II", "II", "I", "I", "II", "I", "II", "I…
## $ Histology       <chr> "Infiltrating Ductal Carcinoma", "Infiltrating Ductal …
## $ Surgery_type    <chr> "Modified Radical Mastectomy", "Modified Radical Maste…
## $ Surgery_Date    <date> 2017-01-15, 2017-01-25, 2017-02-04, 2017-02-14, 2017-…
## $ Last_Visit_Date <date> 2017-06-19, 2017-07-12, 2018-02-07, 2017-12-15, 2017-…
## $ Patient_Status  <chr> "Alive", "Alive", "Alive", "Alive", "Alive", "Alive", …
## $ Surgery_Year    <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …

Comparing Data Statistics

The following tables highlight:

TABLE 2.1: Data Set Statistics
Patients Protein 1 Protein 2 Protein 3 Protein 4 Survival Rate Unknown Stage I Stage II Stage III
334 -0.03 0.95 -0.09 0.01 76.35% 13 19% 57% 24%
TABLE 2.2: Yearly Statistics
Surgery Year Patients Protein 1 Protein 2 Protein 3 Protein 4 Survival Rate Unknown Stage I Stage II Stage III
2017 61 0.08 0.94 -0.09 0.16 75.41% 3 25% 57% 18%
2018 152 -0.07 0.91 -0.14 -0.10 77.63% 4 18% 56% 26%
2019 121 -0.03 1.00 -0.02 0.07 75.21% 6 17% 57% 26%

Result:

Future analysis needed to grasp the difference between the Protein expression levels throughout the years.

Comparing Protein Expression Levels

The following graphs’ axis are the patient’s expression levels for two different proteins

Protein 1 vs 2

Chart

  • Protein 1 and Protein 2 have no significant positive correlation
    • R Value: 0.24
      • The closer to 1 or -1 , the stronger the correlation
    • The p value indicates this is a significantly insignificant result
  • Although there are no deceased patients when both the proteins are low, the lack of a significant correlation makes these findings unreliable.

Code

ggplot(data = cancer_data,mapping= aes(x = protein1,y = protein2)) +
  geom_point(aes(color = patient_status)) +
  geom_smooth(method = lm, se = FALSE) +
  stat_cor(method = "pearson", label.x = -1, label.y = 3) +
  labs(title = "Expression Levels Between Protein 1 & Protein 2",
       x = "Protein 1",
       y = "Protein 2", 
       color = "Patient Status") +
  theme(plot.background = element_rect(fill = '#FBD7CD'),
        legend.background = element_rect(fill = "#FBD7CD"))

Protein 1 vs 3

Chart

  • Protein 1 and Protein 3 have no significant negative correlation
    • R Value: -0.13
    • The p value indicates this is a significantly insignificant result
  • Although there are no deceased patients when protein 1 is low and protein 3 is high, the lack of a significant correlation makes these findings unreliable.

Code

ggplot(data = cancer_data,mapping = aes(x = protein1,y=protein3)) +
  geom_point(aes(color = patient_status)) +
  geom_smooth(method = lm, se = FALSE) +
  stat_cor(method = "pearson", label.x = -1, label.y = 3) +
  labs(title = "Expression Levels Between Protein 1 & Protein 3",
       x = "Protein 1",
       y = "Protein 3", 
       color = "Patient Status") +
  theme(plot.background = element_rect(fill = '#FBD7CD'),
        legend.background = element_rect(fill = "#FBD7CD"))

Protein 1 vs 4

Chart

  • Protein 1 and Protein 3 have an insignificant positive correlation
    • R Value: 0.28
    • The p value indicates this is a significantly insignificant result
  • Although there are no deceased patients when both proteins are low, the lack of a significant correlation makes these findings unreliable.

Code

ggplot(data = cancer_data, mapping = aes(x = protein1,y=protein4)) +
  geom_point(aes(color = patient_status)) +
  geom_smooth(method = lm, se = FALSE) +
  stat_cor(method = "pearson", label.x = -1, label.y = 3) +
  labs(title = "Expression Levels Between Protein 1 & Protein 4",
       x = "Protein 1",
       y = "Protein 4", 
       color = "Patient Status") +
  theme(plot.background = element_rect(fill = '#FBD7CD'),
        legend.background = element_rect(fill = "#FBD7CD"))

Protein 2 vs 3

Chart

  • Protein 2 and Protein 3 have an insignificant negative correlation
    • R Value: -0.42
    • The p value indicates this is a significantly insignificant result
  • The Patient Status is scattered throughout the plot

Code

ggplot(data = cancer_data, mapping = aes(x = protein2,y=protein3)) +  
  geom_point(aes(color = patient_status)) +
  geom_smooth(method = lm, se = FALSE) +
  stat_cor(method = "pearson", label.x = -1, label.y = 3) +
  labs(title = "Expression Levels Between Protein 2 & Protein 3",
       x = "Protein 2",
       y = "Protein 3", 
       color = "Patient Status") +
  theme(plot.background = element_rect(fill = '#FBD7CD'),
        legend.background = element_rect(fill = "#FBD7CD"))

Protein 2 vs 4

Chart

  • Protein 2 and Protein 4 have an insignificant positive correlation
    • R Value: 0.088
  • The p value indicates this is unreliable result
    • Some form of bias in data

Code

ggplot(data = cancer_data, mapping = aes(x = protein2,y=protein4)) +
  geom_point(aes(color = patient_status)) +
  geom_smooth(method = lm, se = FALSE) +
  stat_cor(method = "pearson", label.x = -1, label.y = 3) +
  labs(title = "Expression Levels Between Protein 2 & Protein 4",
       x = "Protein 2",
       y = "Protein 4", 
       color = "Patient Status") +
  theme(plot.background = element_rect(fill = '#FBD7CD'),
        legend.background = element_rect(fill = "#FBD7CD"))

Protein 3 vs 4

Chart

  • Protein 3 and Protein 4 have an insignificant positive correlation
    • R Value: 0.065
  • The p value indicates this is unreliable result
    • Some form of bias in data

Code

ggplot(data = cancer_data, mapping = aes(x = protein3,y=protein4)) +
  geom_point(aes(color = patient_status)) +
  geom_smooth(method = lm, se = FALSE) +
  stat_cor(method = "pearson", label.x = -1, label.y = 3) +
  labs(title = "Expression Levels Between Protein 3 & Protein 4",
       x = "Protein 3",
       y = "Protein 4", 
       color = "Patient Status") +
  theme(plot.background = element_rect(fill = '#FBD7CD'),
        legend.background = element_rect(fill = "#FBD7CD"))

Conclusion

  • The Protein Expression levels hold no strong correlation with each other
    • Cannot be used to predict a patient’s expression levels
    • Unreliable patterns such as:
      • No deceased patients when protein 1 and protein 2 are low
      • Could be sample bias
    • No discernible pattern with patient survival

Surgery Type

Next is an attempt to identify whether the type of surgery patient their survival later in life

Chart

Code

# Grouped  bar chart showing the patient status for each surgery type divided by cancer stage
ggplot(data = cancer_data, aes(x = tumour_stage))+
  geom_bar(aes(fill = patient_status),position = 'dodge') + 
  geom_text(aes(label = after_stat(count),group = patient_status), stat ="count", 
            size = 3, position = position_dodge(0.8), vjust=-0.5) +
  facet_wrap(~surgery_type,scales = 'free')  + 
  ylim(0,45) +
  labs(title = "Surgeries Performed at Different Stages of Cancer",
       subtitle = "Split By Patient Survival",
       x = "Tumor Stage",
       y = "Amount of Patients", 
       fill = "Patient Status") +
  theme(plot.background = element_rect(fill = '#FBD7CD'),
        legend.background = element_rect(fill = "#FBD7CD"))

Stage 1:

Stage 2:

Stage 3:

Conclusion:

The type of surgery seems to correlate with the likelihood a patient will live years after surgery.

Some surgeries perform far better in different stages:

Limitation:

The data set doesn’t specify if the patient’s cause of death is related to the cancer or the result of something else.

Cancer Stages Spread Across Ages

The goal is to identify how patient’s ages spread in the data set.

Chart

Code

cancer_data %>% 
  filter(patient_status != 'Unknown') %>% 
  ggplot() +
    geom_histogram(aes(x = age, fill = tumour_stage),position = position_dodge(4),
                    binwidth = 5) +
    labs(title = "Age Range of Patients",
         subtitle = "Split By Tumor Stage",
         x = "Age",
         y = "Amount of Patients", 
         fill = "Tumor Stage",
         caption = "Bin Range of 5") +
    theme(plot.background = element_rect(fill = '#FBD7CD'),
          legend.background = element_rect(fill = "#FBD7CD"),
          plot.caption.position = 'plot') +
    facet_wrap(~patient_status)

There is a notable bell curve in the data:

Patients that are alive:

Patients that are deceased:

Patient Survival By Surgery Year

Next, it is to investigate whether the year a patient had their surgery would affect their chances of a future.

Chart

Although 2018 seemed to perform the best, there are a few things to consider:

  1. There are far more patients in the data set whose surgery took place in 2018
    • Table 2.2:
      • 2017 : 61 Patients
      • 2018 : 152 Patients
      • 2019 : 121 Patients
  2. The survival rate is almost the same, regardless of the amount of patients
    • Table 2.2:
      • 2018 : 75.41%
      • 2018 : 77.63%
      • 2018 : 75.21%

It is unlikely the year a patient had their surgery would affect their chances of survival.

Code

cancer_data %>% 
  group_by(surgery_year,patient_status) %>% 
  summarize(count = n()) %>% 
  ggplot(mapping =aes(x = surgery_year, y = patient_status,fill = count)) +
    geom_tile(color = 'black', lwd = 1.5,) +
    geom_text(aes(label = count), color = 'black',size = 6) +
    coord_fixed() +
  scale_fill_distiller(palette = 'OrRd',direction = +1) +
  labs(title = "Current Patient Survival Split By Year",
       y = "Patient Status",
       x = "Year of Surgery", 
       fill = "Number of Patients",
       caption = 'Some patients whereabouts are unknown') +
  theme(plot.background = element_rect(fill = '#FBD7CD'),
        panel.background = element_rect(fill = '#FBD7CD'),
        legend.background = element_rect(fill = "#FBD7CD"),
        plot.caption.position = 'plot')

Breast Cancer Dashboard

Summarizes key points discussed:

Additionally, filters allow for a more in-depth look at the data

Allows for user to specify:

The Histology filter helps further identify any patterns among the different surgeries

Stage 1

  • Lumpectomy had two patients deceased for Stage 1
    • Both are from an Infiltrating Ductal Carcinoma
    • 3-0 for Infiltrating Lobular Carcinoma & 2-0 Mucinous Carcinoma
  • Modified Radical Masectomy performance dependent on type of cancer:
    • Infiltrating Ductal Carcinoma: 4-2
    • Mucinous Carcinoma : 1-0

Stage 2

  • Previously stated Simple Mastectomy had 38 alive patients, while having the least deceased (5)
    • Filter showed it performs best against Infiltrating Ductal Carcinoma (21-4) but not as impressive against Infiltrating Lobular Carcinoma
      • Simple Mastectomy : 7-1
      • Modified Radical Mastectomy : 13-1
      • Lumpectomy : 7-1
  • Lumpectomy was shown to have the least patients alive (29), while having concerning amount of deaths (7)
    • Yet filters highlight its good performance against Infiltrating Ductal Carcinoma (21-4) & Infiltrating Lobular Carcinoma (7-1)
    • Performed poorly against Mucinous Carcinoma (1-2)

Stage 3

  • Lumpectomy had no deaths against Infiltrating Ductal Carcinoma & Infiltrating Lobular Carcinoma
    • No patients had Mucinous Carcinoma for Stage 3
  • Simple Mastectomy is still unreliable against Infiltrating Ductal Carcinoma (4-4)
    • Although against Infiltrating Lobular Carcinoma it had no deaths (1-0)
  • Modified Radical Mastectomy is great against Infiltrating Ductal Carcinoma (23-4)
    • Conscerning against Infiltrating Lobular Carcinoma (6-5)

Conclusion

Utilizing filters allowed for a deeper analysis on the performances of different surgeries

  • Highlights which surgeries were better for which type of breast cancer

These factors seem to influence a patient’s chances of survival post-surgery:

  • Different stages of cancer
  • The type of Breast Cancer
  • The type of Surgery

Final Thoughts

  • It is unlikely the year a patient had their surgery would affect their chances of survival.
    • The survival rate is consistent
      • About 75%-77%
  • The Protein Expression levels hold no strong correlation with each other
    • Cannot be used to predict a patient’s expression levels
    • No discernible pattern with patient survival
  • Those with cancer are likely to be between ages 40-80
    • Stage 2 is the most common
    • Cases rise as patients grow closer to 60, but decrease the older patients are
    • Very rarely did a patient die from Stage 1, regardless of age.
  • Factors seemingly influence a patient’s chances of survival post-surgery:
    • Different stages of cancer
    • The type of Breast Cancer
    • The type of Surgery

Limitations:

  • Unspecific cause of death
  • Lack of information on the “Unknown” Patients
  • Possible sample bias in data set
  • Unequal spread of type of cancer patients had

Further analysis needed to ensure these limitations didn’t impact the results.