Breast Cancer Analysis
This analysis will investigate potential factors towards a patient’s survival of breast cancer.
Such as:
Understanding what may influence survival could help more people struggling with breast cancer.
Data was obtained from Kaggle
# Loading libraries necessary for analysis
library(tidyverse)
library(janitor)
library(ggpubr) # Obtain function needed for Correlation Coefficient
cancer_data <- read_csv('Breast_Cancer_Data.csv')
| Age | Gender | Protein1 | Protein2 | Protein3 | Protein4 | Tumour_Stage | Histology | Surgery_type | Surgery_Date | Last_Visit_Date | Patient_Status | Surgery_Year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36 | FEMALE | 0.08 | 0.43 | 0.55 | 0.27 | III | Infiltrating Ductal Carcinoma | Modified Radical Mastectomy | 2017-01-15 | 2017-06-19 | Alive | 2017 |
| 56 | FEMALE | 0.35 | -0.21 | -0.19 | 0.12 | II | Infiltrating Ductal Carcinoma | Modified Radical Mastectomy | 2017-01-25 | 2017-07-12 | Alive | 2017 |
| 53 | FEMALE | -0.07 | 1.42 | -0.36 | 0.39 | II | Infiltrating Ductal Carcinoma | Simple Mastectomy | 2017-02-04 | 2018-02-07 | Alive | 2017 |
| 40 | FEMALE | -0.57 | 1.27 | -0.29 | 0.19 | II | Infiltrating Lobular Carcinoma | Other | 2017-02-14 | 2017-12-15 | Alive | 2017 |
| 75 | FEMALE | 0.17 | 0.03 | -0.16 | 0.67 | I | Infiltrating Lobular Carcinoma | Simple Mastectomy | 2017-02-24 | 2017-04-05 | Alive | 2017 |
| 41 | FEMALE | 0.14 | 1.08 | 0.21 | 0.97 | I | Infiltrating Lobular Carcinoma | Simple Mastectomy | 2017-03-06 | 2019-08-09 | Alive | 2017 |
colnames(cancer_data)
## [1] "Age" "Gender" "Protein1" "Protein2"
## [5] "Protein3" "Protein4" "Tumour_Stage" "Histology"
## [9] "Surgery_type" "Surgery_Date" "Last_Visit_Date" "Patient_Status"
## [13] "Surgery_Year"
glimpse(cancer_data)
## Rows: 334
## Columns: 13
## $ Age <dbl> 36, 56, 53, 40, 75, 41, 62, 45, 62, 50, 61, 43, 74, 56…
## $ Gender <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEM…
## $ Protein1 <dbl> 0.080353, 0.345090, -0.069535, -0.565700, 0.171640, 0.…
## $ Protein2 <dbl> 0.426380, -0.211470, 1.418300, 1.266800, 0.029656, 1.0…
## $ Protein3 <dbl> 0.547150, -0.193040, -0.361050, -0.293460, -0.158900, …
## $ Protein4 <dbl> 0.273680, 0.124270, 0.391580, 0.193950, 0.674710, 0.97…
## $ Tumour_Stage <chr> "III", "II", "II", "II", "I", "I", "II", "I", "II", "I…
## $ Histology <chr> "Infiltrating Ductal Carcinoma", "Infiltrating Ductal …
## $ Surgery_type <chr> "Modified Radical Mastectomy", "Modified Radical Maste…
## $ Surgery_Date <date> 2017-01-15, 2017-01-25, 2017-02-04, 2017-02-14, 2017-…
## $ Last_Visit_Date <date> 2017-06-19, 2017-07-12, 2018-02-07, 2017-12-15, 2017-…
## $ Patient_Status <chr> "Alive", "Alive", "Alive", "Alive", "Alive", "Alive", …
## $ Surgery_Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
The following tables highlight:
The amount of patients
Average expression levels of Proteins 1 - 4
Percentage of Patients alive
Amount of patients whose survival status is unknown
Percentage highlighting the amount of cases of stages 1-3
The following table shows the statistics between all 334 patients:
| Patients | Protein 1 | Protein 2 | Protein 3 | Protein 4 | Survival Rate | Unknown | Stage I | Stage II | Stage III |
|---|---|---|---|---|---|---|---|---|---|
| 334 | -0.03 | 0.95 | -0.09 | 0.01 | 76.35% | 13 | 19% | 57% | 24% |
| Surgery Year | Patients | Protein 1 | Protein 2 | Protein 3 | Protein 4 | Survival Rate | Unknown | Stage I | Stage II | Stage III |
|---|---|---|---|---|---|---|---|---|---|---|
| 2017 | 61 | 0.08 | 0.94 | -0.09 | 0.16 | 75.41% | 3 | 25% | 57% | 18% |
| 2018 | 152 | -0.07 | 0.91 | -0.14 | -0.10 | 77.63% | 4 | 18% | 56% | 26% |
| 2019 | 121 | -0.03 | 1.00 | -0.02 | 0.07 | 75.21% | 6 | 17% | 57% | 26% |
Result:
Future analysis needed to grasp the difference between the Protein expression levels throughout the years.
The following graphs’ axis are the patient’s expression levels for two different proteins
ggplot(data = cancer_data,mapping= aes(x = protein1,y = protein2)) +
geom_point(aes(color = patient_status)) +
geom_smooth(method = lm, se = FALSE) +
stat_cor(method = "pearson", label.x = -1, label.y = 3) +
labs(title = "Expression Levels Between Protein 1 & Protein 2",
x = "Protein 1",
y = "Protein 2",
color = "Patient Status") +
theme(plot.background = element_rect(fill = '#FBD7CD'),
legend.background = element_rect(fill = "#FBD7CD"))
ggplot(data = cancer_data,mapping = aes(x = protein1,y=protein3)) +
geom_point(aes(color = patient_status)) +
geom_smooth(method = lm, se = FALSE) +
stat_cor(method = "pearson", label.x = -1, label.y = 3) +
labs(title = "Expression Levels Between Protein 1 & Protein 3",
x = "Protein 1",
y = "Protein 3",
color = "Patient Status") +
theme(plot.background = element_rect(fill = '#FBD7CD'),
legend.background = element_rect(fill = "#FBD7CD"))
ggplot(data = cancer_data, mapping = aes(x = protein1,y=protein4)) +
geom_point(aes(color = patient_status)) +
geom_smooth(method = lm, se = FALSE) +
stat_cor(method = "pearson", label.x = -1, label.y = 3) +
labs(title = "Expression Levels Between Protein 1 & Protein 4",
x = "Protein 1",
y = "Protein 4",
color = "Patient Status") +
theme(plot.background = element_rect(fill = '#FBD7CD'),
legend.background = element_rect(fill = "#FBD7CD"))
ggplot(data = cancer_data, mapping = aes(x = protein2,y=protein3)) +
geom_point(aes(color = patient_status)) +
geom_smooth(method = lm, se = FALSE) +
stat_cor(method = "pearson", label.x = -1, label.y = 3) +
labs(title = "Expression Levels Between Protein 2 & Protein 3",
x = "Protein 2",
y = "Protein 3",
color = "Patient Status") +
theme(plot.background = element_rect(fill = '#FBD7CD'),
legend.background = element_rect(fill = "#FBD7CD"))
ggplot(data = cancer_data, mapping = aes(x = protein2,y=protein4)) +
geom_point(aes(color = patient_status)) +
geom_smooth(method = lm, se = FALSE) +
stat_cor(method = "pearson", label.x = -1, label.y = 3) +
labs(title = "Expression Levels Between Protein 2 & Protein 4",
x = "Protein 2",
y = "Protein 4",
color = "Patient Status") +
theme(plot.background = element_rect(fill = '#FBD7CD'),
legend.background = element_rect(fill = "#FBD7CD"))
ggplot(data = cancer_data, mapping = aes(x = protein3,y=protein4)) +
geom_point(aes(color = patient_status)) +
geom_smooth(method = lm, se = FALSE) +
stat_cor(method = "pearson", label.x = -1, label.y = 3) +
labs(title = "Expression Levels Between Protein 3 & Protein 4",
x = "Protein 3",
y = "Protein 4",
color = "Patient Status") +
theme(plot.background = element_rect(fill = '#FBD7CD'),
legend.background = element_rect(fill = "#FBD7CD"))
Next is an attempt to identify whether the type of surgery patient their survival later in life
# Grouped bar chart showing the patient status for each surgery type divided by cancer stage
ggplot(data = cancer_data, aes(x = tumour_stage))+
geom_bar(aes(fill = patient_status),position = 'dodge') +
geom_text(aes(label = after_stat(count),group = patient_status), stat ="count",
size = 3, position = position_dodge(0.8), vjust=-0.5) +
facet_wrap(~surgery_type,scales = 'free') +
ylim(0,45) +
labs(title = "Surgeries Performed at Different Stages of Cancer",
subtitle = "Split By Patient Survival",
x = "Tumor Stage",
y = "Amount of Patients",
fill = "Patient Status") +
theme(plot.background = element_rect(fill = '#FBD7CD'),
legend.background = element_rect(fill = "#FBD7CD"))
Stage 1:
Stage 2:
Stage 3:
Conclusion:
The type of surgery seems to correlate with the likelihood a patient will live years after surgery.
Some surgeries perform far better in different stages:
Limitation:
The data set doesn’t specify if the patient’s cause of death is related to the cancer or the result of something else.
The goal is to identify how patient’s ages spread in the data set.
cancer_data %>%
filter(patient_status != 'Unknown') %>%
ggplot() +
geom_histogram(aes(x = age, fill = tumour_stage),position = position_dodge(4),
binwidth = 5) +
labs(title = "Age Range of Patients",
subtitle = "Split By Tumor Stage",
x = "Age",
y = "Amount of Patients",
fill = "Tumor Stage",
caption = "Bin Range of 5") +
theme(plot.background = element_rect(fill = '#FBD7CD'),
legend.background = element_rect(fill = "#FBD7CD"),
plot.caption.position = 'plot') +
facet_wrap(~patient_status)
There is a notable bell curve in the data:
Patients that are alive:
Patients that are deceased:
Next, it is to investigate whether the year a patient had their surgery would affect their chances of a future.
Although 2018 seemed to perform the best, there are a few things to consider:
It is unlikely the year a patient had their surgery would affect their chances of survival.
cancer_data %>%
group_by(surgery_year,patient_status) %>%
summarize(count = n()) %>%
ggplot(mapping =aes(x = surgery_year, y = patient_status,fill = count)) +
geom_tile(color = 'black', lwd = 1.5,) +
geom_text(aes(label = count), color = 'black',size = 6) +
coord_fixed() +
scale_fill_distiller(palette = 'OrRd',direction = +1) +
labs(title = "Current Patient Survival Split By Year",
y = "Patient Status",
x = "Year of Surgery",
fill = "Number of Patients",
caption = 'Some patients whereabouts are unknown') +
theme(plot.background = element_rect(fill = '#FBD7CD'),
panel.background = element_rect(fill = '#FBD7CD'),
legend.background = element_rect(fill = "#FBD7CD"),
plot.caption.position = 'plot')
Summarizes key points discussed:
Additionally, filters allow for a more in-depth look at the data
Allows for user to specify:
The Histology filter helps further identify any patterns among the different surgeries
Utilizing filters allowed for a deeper analysis on the performances of different surgeries
These factors seem to influence a patient’s chances of survival post-surgery:
Limitations:
Further analysis needed to ensure these limitations didn’t impact the results.