Write the answer for below

Below r the document just write

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

predictors-of-cancer-recurrence-1
October 23, 2023
[1]: # Importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns
import collections
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split,␣
↪GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,␣
↪ExtraTreesClassifier, GradientBoostingClassifier, AdaBoostClassifier,␣
↪VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report,␣
↪accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelBinarizer
from ucimlrepo import fetch_ucirepo
import warnings
warnings.filterwarnings(‘ignore’)
print (‘Libraries successfully imported!’)
Libraries successfully imported!
[3]: pip install ucimlrepo
Collecting ucimlrepo
Obtaining dependency information for ucimlrepo from https://files.pythonhosted
.org/packages/85/8b/aab8a1c1344af158feb0b7f13d15ae184bc1e93625cea98d9c783b2e29d4
/ucimlrepo-0.0.2-py3-none-any.whl.metadata
Downloading ucimlrepo-0.0.2-py3-none-any.whl.metadata (5.3 kB)
1
Downloading ucimlrepo-0.0.2-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.2
Note: you may need to restart the kernel to use updated packages.
[4]: import urllib.request
url = “http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/
↪breast-cancer.data”
urllib.request.urlretrieve(url, “breast-cancer.data”)
[4]: (‘breast-cancer.data’, )
[5]: # Downloading the data directly from the Repo – http://archive.ics.uci.edu/ml/
↪datasets/Breast+Cancer
!wget https://archive.ics.uci.edu/dataset/14/breast+cancer
‘wget’ is not recognized as an internal or external command,
operable program or batch file.
[6]: # Naming the feature columns, originally unnamed
# The meaning of each feature is detailed here on pg 2 https://www.causeweb.org/
↪usproc/sites/default/files/usclap/2018-1/
↪Predictors_for_Breast_Cancer_Recurrence.pdf
df = pd.read_csv(‘breast-cancer.data’, sep=’,’, names=[‘RecClass’, ‘Age’,␣
↪’Menopause’,
‘TumorSize’, ‘InvNodes’,␣
↪’NodeCaps’,
‘DegMalig’, ‘Breast’,␣
↪’Quadrant’, ‘Radiation’])
# Shuffle data
df = df.sample(frac = 1).reset_index(drop = True)
# Make a copy of the data
data = df.copy()
[7]: # Exploring the first few observations
data.head()
[7]:
0
1
RecClass
no-recurrence-events
no-recurrence-events
Age Menopause TumorSize InvNodes NodeCaps
50-59
ge40
30-34
0-2
no
60-69
ge40
15-19
0-2
no
2
\
2
3
4
no-recurrence-events
no-recurrence-events
no-recurrence-events
0
1
2
3
4
DegMalig Breast
1 right
1
left
1
left
2
left
2
left
50-59
40-49
50-59
premeno
premeno
premeno
10-14
20-24
40-44
0-2
0-2
0-2
no
no
no
Quadrant Radiation
right_up
no
right_low
no
left_low
no
right_low
no
left_up
no
[8]: # Objective
“””To predict breast cancer recurrence based on selected features”””
[8]: ‘To predict breast cancer recurrence based on selected features’
[9]: # Checking the number of observations and features
data.shape
[9]: (286, 10)
[10]: # Checking the data type, existence of null cases within the observations
data.info()
RangeIndex: 286 entries, 0 to 285
Data columns (total 10 columns):
#
Column
Non-Null Count Dtype
— ——————- —-0
RecClass
286 non-null
object
1
Age
286 non-null
object
2
Menopause 286 non-null
object
3
TumorSize 286 non-null
object
4
InvNodes
286 non-null
object
5
NodeCaps
286 non-null
object
6
DegMalig
286 non-null
int64
7
Breast
286 non-null
object
8
Quadrant
286 non-null
object
9
Radiation 286 non-null
object
dtypes: int64(1), object(9)
memory usage: 22.5+ KB
[11]: # Listing out the feature labels
3
data.keys()
[11]: Index([‘RecClass’, ‘Age’, ‘Menopause’, ‘TumorSize’, ‘InvNodes’, ‘NodeCaps’,
‘DegMalig’, ‘Breast’, ‘Quadrant’, ‘Radiation’],
dtype=’object’)
[12]: # Exploring the unique classes within each feature
# Un-comment each one to check the classes
data[‘Age’].unique()
# data[‘Breast’].unique()
# data[‘DegMalig’].unique()
# data[‘InvNodes’].unique()
# data[‘Menopause’].unique()
# data[‘NodeCaps’].unique()
# data[‘Quadrant’].unique()
# data[‘Radiation’].unique()
# data[‘TumorSize’].unique()
# data[‘RecClass’].unique()
[12]: array([’50-59′, ’60-69′, ’40-49′, ’30-39′, ’70-79′, ’20-29′], dtype=object)
[13]: # Creation of dummy variables for the various features and prevention of the␣
↪dummy variable trap (multicollinearity)
# Converting the Age group variable into dummy variables and dropping the first␣
↪column of the AgeGroup category
# Original unique Age group categories ’60-69′, ’40-49′, ’30-39′, ’50-59′,␣
↪’70-79′, ’20-29′ (20 – 29 AgeGroup is dropped to prevent Multicollinearity)
Age = pd.get_dummies(data[‘Age’], drop_first=True)
data = data.drop(‘Age’, axis=1)
Age = Age.add_prefix(‘AgeGroup ‘)
data = pd.concat([data, Age], axis=1)
[14]: # Converting the Menopause variable into dummy variables and dropping the first␣
↪column of the Menopause category
# Original unique Menopause categories ‘ge40’, ‘premeno’, ‘lt40’ (ge40 gets␣
↪dropped)
Menopause = pd.get_dummies(data[‘Menopause’], drop_first=True)
4
data = data.drop(‘Menopause’, axis=1)
Menopause = Menopause.add_prefix(‘Menopause ‘)
data = pd.concat([data, Menopause], axis=1)
[15]: # Converting the TumorSize variables into dummies and droping the first column
# Original unique TumorSize categories ’30-34′, ’15-19′, ’25-29′, ’20-24′,␣
↪’0-4′, ’35-39′, ’10-14′,’40-44′, ’50-54′, ’45-49’, ‘5-9’ (0-4 dropped)
TumorSize = pd.get_dummies(data[‘TumorSize’], drop_first=True)
data = data.drop(‘TumorSize’, axis=1)
TumorSize = TumorSize.add_prefix(‘TumorSize ‘)
data = pd.concat([data, TumorSize], axis=1)
[16]: # Converting the InvNodes variables into dummies and droping the first column
# Original unique InvNodes categories ‘0-2′, ’12-14’, ‘3-5′, ’15-17’, ‘6-8′,␣
↪’9-11′, ’24-26’ (0-2 dropped)
InvNodes = pd.get_dummies(data[‘InvNodes’], drop_first=True)
data = data.drop(‘InvNodes’, axis=1)
InvNodes = InvNodes.add_prefix(‘InvNodes ‘)
data = pd.concat([data, InvNodes], axis=1)
[17]: # Converting the NodeCaps variables into dummies and droping the first column
# Original unique NodeCaps categories ‘no’, ‘yes’, ‘? (freaking ? dropped, nice)
NodeCaps = pd.get_dummies(data[‘NodeCaps’], drop_first=True)
data = data.drop(‘NodeCaps’, axis=1)
NodeCaps = NodeCaps.add_prefix(‘NodeCaps ‘)
data = pd.concat([data, NodeCaps], axis=1)
[18]: # Converting the Breast variables into dummies and droping the first column
# Original unique Breast categories ‘left’, ‘right’, (left dropped)
5
Breast = pd.get_dummies(data[‘Breast’], drop_first=True)
data = data.drop(‘Breast’, axis=1)
Breast = Breast.add_prefix(‘Breast ‘)
data = pd.concat([data, Breast], axis=1)
[19]: # Converting the Quadrant variables into dummies and droping the first column
# Original unique Quadrant categories ‘left_low’, ‘right_low’, ‘central’,␣
↪’left_up’, ‘right_up’, ‘?’ (freaking ? dropped again, nice)
Quadrant = pd.get_dummies(data[‘Quadrant’], drop_first=True)
data = data.drop(‘Quadrant’, axis=1)
Quadrant = Quadrant.add_prefix(‘Quadrant ‘)
data = pd.concat([data, Quadrant], axis=1)
[20]: # Converting the Radiation variables into dummies and droping the first column
# Original unique Radiation categories ‘no’, ‘yes’ (no dropped)
Radiation = pd.get_dummies(data[‘Radiation’], drop_first=True)
data = data.drop(‘Radiation’, axis=1)
Radiation = Radiation.add_prefix(‘Radiation ‘)
data = pd.concat([data, Radiation], axis=1)
[21]: # Checking a sample of 20 observations from the dataset
data.sample(10)
[21]:
228
252
10
42
39
91
111
138
RecClass
no-recurrence-events
no-recurrence-events
recurrence-events
recurrence-events
no-recurrence-events
recurrence-events
recurrence-events
recurrence-events
DegMalig
2
1
3
1
2
2
3
3
AgeGroup 30-39
0
0
0
0
1
0
0
0
6
AgeGroup 40-49
0
1
1
1
0
1
0
1
\
269
180
no-recurrence-events
recurrence-events
3
3
0
0
0
1
228
252
10
42
39
91
111
138
269
180
AgeGroup 50-59
1
0
0
0
0
0
1
0
1
0
AgeGroup 60-69
0
0
0
0
0
0
0
0
0
0
AgeGroup 70-79
0
0
0
0
0
0
0
0
0
0
Menopause lt40
0
0
0
0
0
0
0
0
0
0
228
252
10
42
39
91
111
138
269
180
Menopause premeno
0
1
1
1
1
1
0
1
0
1
228
252
10
42
39
91
111
138
269
180
NodeCaps yes
1
0
1
0
0
0
1
0
0
1
228
252
10
42
39
91
111
Quadrant left_up
1
0
0
1
0
1
0
TumorSize 10-14
0
0
0
0
0
0
0
0
0
0
Breast right
1
0
1
1
1
1
1
0
1
0











InvNodes 9-11
0
0
0
0
0
0
0
0
0
0
Quadrant central
0
0
0
0
1
0
0
0
0
0
Quadrant right_low
0
0
0
0
0
0
0
7
\
NodeCaps no
0
1
0
1
1
1
0
1
1
0
Quadrant left_low
0
1
0
0
0
0
0
0
0
1
Quadrant right_up
0
0
1
0
0
0
1
\
\
Radiation yes
0
0
0
0
0
0
0
138
269
180
0
1
0
0
0
0
[10 rows x 34 columns]
[20]: data.info()
RangeIndex: 286 entries, 0 to 285
Data columns (total 34 columns):
#
Column
Non-Null Count
— ——————0
RecClass
286 non-null
1
DegMalig
286 non-null
2
AgeGroup 30-39
286 non-null
3
AgeGroup 40-49
286 non-null
4
AgeGroup 50-59
286 non-null
5
AgeGroup 60-69
286 non-null
6
AgeGroup 70-79
286 non-null
7
Menopause lt40
286 non-null
8
Menopause premeno
286 non-null
9
TumorSize 10-14
286 non-null
10 TumorSize 15-19
286 non-null
11 TumorSize 20-24
286 non-null
12 TumorSize 25-29
286 non-null
13 TumorSize 30-34
286 non-null
14 TumorSize 35-39
286 non-null
15 TumorSize 40-44
286 non-null
16 TumorSize 45-49
286 non-null
17 TumorSize 5-9
286 non-null
18 TumorSize 50-54
286 non-null
19 InvNodes 12-14
286 non-null
20 InvNodes 15-17
286 non-null
21 InvNodes 24-26
286 non-null
22 InvNodes 3-5
286 non-null
23 InvNodes 6-8
286 non-null
24 InvNodes 9-11
286 non-null
25 NodeCaps no
286 non-null
26 NodeCaps yes
286 non-null
27 Breast right
286 non-null
28 Quadrant central
286 non-null
29 Quadrant left_low
286 non-null
30 Quadrant left_up
286 non-null
31 Quadrant right_low 286 non-null
32 Quadrant right_up
286 non-null
33 Radiation yes
286 non-null
Dtype
—-object
int64
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
8
1
0
0
0
0
0
dtypes: int64(1), object(1), uint8(32)
memory usage: 13.5+ KB
[22]: # Checking the final shape of the dataset
data.shape
[22]: (286, 34)
[23]: # Mapping the input features into the X dataframe and the labels into y
X = data._get_numeric_data()
y = data[‘RecClass’]
# Creating the train and test data splits
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,␣
↪random_state = 1919)
[24]: # Checking the size and dimensions of the data splits
print (f’X_train: {X_train.shape}, y_train: {y_train.shape}’)
print (f’X_test: {X_test.shape}, y_test: {y_test.shape}’)
X_train: (214, 33), y_train: (214,)
X_test: (72, 33), y_test: (72,)
[25]: # Exploring the overall class distribution
# To see if the observations and the categories are fairly evenly distributed␣
↪between the data splits
class_count = dict(collections.Counter(y))
train_class_count = dict(collections.Counter(y_train))
test_class_count = dict(collections.Counter(y_test))
print (f’classes: {class_count}’)
print (f”no-rec:rec = {class_count[‘no-recurrence-events’]/
↪class_count[‘recurrence-events’]:.2f}”)
print (f’train classes: {train_class_count}’)
print (f”train no-rec:rec = {train_class_count[‘no-recurrence-events’]/
↪train_class_count[‘recurrence-events’]:.2f}”)
print (f’test classes: {test_class_count}’)
print (f”test no-rec:rec = {test_class_count[‘no-recurrence-events’]/
↪test_class_count[‘recurrence-events’]:.2f}”)
classes: {‘no-recurrence-events’: 201, ‘recurrence-events’: 85}
no-rec:rec = 2.36
train classes: {‘no-recurrence-events’: 150, ‘recurrence-events’: 64}
9
train no-rec:rec = 2.34
test classes: {‘no-recurrence-events’: 51, ‘recurrence-events’: 21}
test no-rec:rec = 2.43
[26]: # Standardising the data (mean = 0, std = 1) using the training data
X_scaler = StandardScaler().fit(X_train)
# Not necessary to standardise the data using the StandardScaler class of␣
↪Sklearn as the orders of magnitude are in single digits
[27]: # Applying the scaler on training and test data (not necessary to standardise␣
↪outputs for classification)
X_train = X_scaler.transform(X_train)
X_test = X_scaler.transform(X_test)
[28]: # Check (mean should approx 0 and std should be approx 1)
print (f’X_train[0]: mean: {np.mean(X_train[:, 0], axis = 0):.1f}, std: {np.
↪std(X_train[:, 0], axis=0):.1f}’)
print (f’X_test[1]: mean: {np.mean(X_test[:, 1], axis = 0):.1f}, std: {np.
↪std(X_test[:, 1], axis=0):.1f}’)
X_train[0]: mean: -0.0, std: 1.0
X_test[1]: mean: 0.2, std: 1.2
[29]: # Base Model
y.value_counts(normalize=True)
# A base model that predicts no recurrence events would be correct 70% of the␣
↪time
[29]: no-recurrence-events
0.702797
recurrence-events
0.297203
Name: RecClass, dtype: float64
[30]: # Logistic Regression
# Running a pipeline of logistic regression
pipe = Pipeline(steps=[(‘lr’, LogisticRegression())])
# setting parameters
params = {‘lr__penalty’: [‘l1’],
‘lr__C’: [1],
‘lr__solver’: [‘liblinear’]}
10
gs_lr = GridSearchCV(pipe, param_grid=params, cv=5, scoring=’accuracy’,␣
↪n_jobs=-2)
# Fitting the logistic regression model on the training data split
gs_lr.fit(X_train, y_train)
gs_lr.best_estimator_
[30]: Pipeline(steps=[(‘lr’,
LogisticRegression(C=1, penalty=’l1′, solver=’liblinear’))])
[31]: # Predicting the first 5 observations
log_reg_y_pred = gs_lr.best_estimator_.predict(X_test[:5])
log_reg_y_pred
[31]: array([‘recurrence-events’, ‘no-recurrence-events’,
‘no-recurrence-events’, ‘recurrence-events’,
‘no-recurrence-events’], dtype=object)
[32]: # Checking the actual labels for the first 5 observations
print (f’Actual labels: {y_test[:5]}’)
Actual labels: 177
no-recurrence-events
39
no-recurrence-events
230
no-recurrence-events
131
recurrence-events
147
recurrence-events
Name: RecClass, dtype: object
[33]: # Scoring the model (training score, cross validation and test score)
print (f’training score: {gs_lr.score(X_train, y_train)}’)
print (f”cross validation score: {cross_val_score(gs_lr.best_estimator_, X, y,␣
↪cv=5).mean()}”)
print (f”test score: {gs_lr.score(X_test, y_test)}”)
# Inference: this model performs worse than the base model on the test data␣
↪split
training score: 0.7523364485981309
cross validation score: 0.7165154264972776
test score: 0.7361111111111112
[34]: # K-Nearest Neighbors Classifier
11
# Running a pipeline of KNN classifiers
pipe = Pipeline(steps=[(‘sc’, StandardScaler()), (‘knn’,␣
↪KNeighborsClassifier())])
# Setting the parameters
params = {‘knn__n_neighbors’: [21], ‘knn__p’: [1]}
gs_knn = GridSearchCV(pipe, param_grid=params, cv=5, scoring=’accuracy’)
# Fitting the KNN model on the training data split
gs_knn.fit(X_train, y_train)
gs_knn.best_estimator_
[34]: Pipeline(steps=[(‘sc’, StandardScaler()),
(‘knn’, KNeighborsClassifier(n_neighbors=21, p=1))])
[35]: # Scoring the model (training score, cross validation and test score)
print (f’training score: {gs_knn.score(X_train, y_train)}’)
print (f”cross validation score: {cross_val_score(gs_knn.best_estimator_, X, y,␣
↪cv=5).mean()}”)
print (f”test score: {gs_knn.score(X_test, y_test)}”)
# Inference: this model performs better than the base model and the logistic␣
↪regression model
File “C:\Users\Sellesta\anaconda3\Lib\sitepackages\joblib\externals\loky\backend\context.py”, line 199, in
_count_physical_cores
cpu_info = subprocess.run(
^^^^^^^^^^^^^^^
File “C:\Users\Sellesta\anaconda3\Lib\subprocess.py”, line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\Users\Sellesta\anaconda3\Lib\subprocess.py”, line 1026, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File “C:\Users\Sellesta\anaconda3\Lib\subprocess.py”, line 1538, in
_execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
training score: 0.7009345794392523
cross validation score: 0.6957652752571082
test score: 0.7222222222222222
[36]: # Decision Tree Classifer
12
# Running a pipeline of Decision Tree Classifiers
pipe = Pipeline(steps=[(‘tree’, DecisionTreeClassifier())])
# Setting the parameters
params = {‘tree__max_depth’: [6, 8]}
gs_tree = GridSearchCV(pipe, param_grid=params, cv=5, scoring=’accuracy’)
# Fitting the Decision Tree Model on the training data split
gs_tree.fit(X_train, y_train)
gs_tree.best_estimator_
[36]: Pipeline(steps=[(‘tree’, DecisionTreeClassifier(max_depth=6))])
[37]: # Scoring the model (training score, cross validation and test score)
print (f’training score: {gs_tree.score(X_train, y_train)}’)
print (f”cross validation score: {cross_val_score(gs_tree.best_estimator_, X,␣
↪y, cv=5).mean()}”)
print (f”test score: {gs_tree.score(X_test, y_test)}”)
# Inference: this model likely overfits on the training data split but performs␣
↪worse than the base model, the logistic regression and the KNN models on the␣
↪test data split
training score: 0.8785046728971962
cross validation score: 0.6994555353901997
test score: 0.6944444444444444
[38]: # Bagging Classifier Model
# Running a pipeline of Bagging Classifiers
pipe = Pipeline(steps = [(‘bag’, BaggingClassifier())])
# Setting the parameters
params = {‘bag__n_estimators’: [200]}
gs_bag = GridSearchCV(pipe, param_grid=params, cv=5, scoring=’accuracy’)
# Fitting the Bagging Classifier model on the training data split
gs_bag.fit(X_train, y_train)
gs_bag.best_estimator_
[38]: Pipeline(steps=[(‘bag’, BaggingClassifier(n_estimators=200))])
[39]: # Scoring the model (training score, cross validation and test score)
13
print (f’training score: {gs_bag.score(X_train, y_train)}’)
print (f”cross validation score: {cross_val_score(gs_bag.best_estimator_, X, y,␣
↪cv=5).mean()}”)
print (f”test score: {gs_bag.score(X_test, y_test)}”)
# Inference: model overfits on the training data split but performs better than␣
↪the
than the base model, decision tree and logistic regression
training score: 0.985981308411215
cross validation score: 0.7098608590441622
test score: 0.6944444444444444
[40]: # Random Forest Model
# Running a pipeline of Random Forest Classifiers
pipe = Pipeline(steps=[(‘forest’, RandomForestClassifier())])
# Setting the parameters
params = {‘forest__n_estimators’: [150], ‘forest__max_depth’:[15]}
gs_forest = GridSearchCV(pipe, param_grid=params, cv=5, scoring=’accuracy’)
# Fitting the Random Forest Classifier to the training data split
gs_forest.fit(X_train, y_train)
gs_forest.best_estimator_
[40]: Pipeline(steps=[(‘forest’,
RandomForestClassifier(max_depth=15, n_estimators=150))])
[41]: # Scoring the model (training score, cross validation and test score)
print (f’training score: {gs_forest.score(X_train, y_train)}’)
print (f”cross validation score: {cross_val_score(gs_forest.best_estimator_, X,␣
↪y, cv=5).mean()}”)
print (f”test score: {gs_forest.score(X_test, y_test)}”)
# Inference: model overfits the training data splits but performs better than␣
↪the base model and logistic regression but comparably to the Bagging␣
↪Classifier and the KNN
training score: 0.9766355140186916
cross validation score: 0.727223230490018
test score: 0.7361111111111112
[42]: # Extra Trees Model
# Running a pipeline of Extra Trees Classifier
14
pipe = Pipeline(steps=[(‘extra’, ExtraTreesClassifier())])
# Seeting the parameters
params = {‘extra__n_estimators’: [600], ‘extra__max_depth’: [None]}
gs_extra = GridSearchCV(pipe, param_grid=params, cv=5, scoring=’accuracy’)
# Fitting the Extra Tree’s Classifier to the training data split
gs_extra.fit(X_train, y_train)
gs_extra.best_estimator_
[42]: Pipeline(steps=[(‘extra’, ExtraTreesClassifier(n_estimators=600))])
[43]: # Scoring the model (training score, cross validation and test score)
print (f’training score: {gs_extra.score(X_train, y_train)}’)
print (f”cross validation score: {cross_val_score(gs_extra.best_estimator_, X,␣
↪y, cv=5).mean()}”)
print (f”test score: {gs_extra.score(X_test, y_test)}”)
# Inference: model overfits the training data split and performs comparably␣
↪worse as the decision tree and the logistic regression
training score: 0.985981308411215
cross validation score: 0.7202661826981245
test score: 0.7083333333333334
[44]: # AdaBoost Model
# Running a pipeline of AdaBoost Classifier
pipe = Pipeline(steps=[(‘ada’, AdaBoostClassifier())])
# Setting the parameters
params = {‘ada__n_estimators’: [10]}
gs_ada = GridSearchCV(pipe, param_grid=params, cv=5, scoring=’accuracy’)
# Fitting the AdaBoost Classifier model to the training data split
gs_ada.fit(X_train, y_train)
gs_ada.best_estimator_
[44]: Pipeline(steps=[(‘ada’, AdaBoostClassifier(n_estimators=10))])
[45]: # Scoring the model (training score, cross validation and test score)
print (f’training score: {gs_ada.score(X_train, y_train)}’)
15
print (f”cross validation score: {cross_val_score(gs_ada.best_estimator_, X, y,␣
↪cv=5).mean()}”)
print (f”test score: {gs_ada.score(X_test, y_test)}”)
# Inference: performs better than the basemodel on the traning data split and␣
↪performs comparably worse as the decision tree and the logistic regression␣
↪on the test data
training score: 0.7523364485981309
cross validation score: 0.7063520871143376
test score: 0.7222222222222222
[46]: # Gradient Boosting Classifier Model
# Running a pipeline of Gradient Boosting Classifier
pipe = Pipeline(steps=[(‘grad’, GradientBoostingClassifier())])
# Setting the parameters
params = {‘grad__n_estimators’: [300], ‘grad__max_depth’: [3]}
gs_grad = GridSearchCV(pipe, param_grid=params, cv=5, scoring=’accuracy’)
# Fitting the Gradient Boosting CLassifier model to the training data split
gs_grad.fit(X_train, y_train)
gs_grad.best_estimator_
[46]: Pipeline(steps=[(‘grad’, GradientBoostingClassifier(n_estimators=300))])
[47]: # Scoring the model (training score, cross validation and test score)
print (f’training score: {gs_grad.score(X_train, y_train)}’)
print (f”cross validation score: {cross_val_score(gs_grad.best_estimator_, X,␣
↪y, cv=5).mean()}”)
print (f”test score: {gs_grad.score(X_test, y_test)}”)
# Inference: model overfits the training data and performs comparably worse as␣
↪the decision tree, logistic regression and adaboost classifier
training score: 0.9719626168224299
cross validation score: 0.7134301270417424
test score: 0.6944444444444444
[48]: # Support Vector Classififer
# – Setting up the model
pipe = Pipeline(steps = [(“svc”, SVC())])
16
# – Setting the model parameters
params = {“svc__C”: [3]}
gs_svc = GridSearchCV(pipe, param_grid = params, cv = 5, scoring = “accuracy”)
# Fitting the SVC model to the training dataset
gs_svc.fit(X_train, y_train)
gs_svc.best_estimator_
[48]: Pipeline(steps=[(‘svc’, SVC(C=3))])
[49]: # Scoring the model (training score, cross validation and test score)
print (f’training score: {gs_svc.score(X_train, y_train)}’)
print (f”cross validation score: {cross_val_score(gs_svc.best_estimator_, X, y,␣
↪cv=5).mean()}”)
print (f”test score: {gs_svc.score(X_test, y_test)}”)
# Inference: model performs quite well on the trainin data split and best␣
↪performing of the models used on the test dataset
training score: 0.9345794392523364
cross validation score: 0.7342407743496672
test score: 0.6944444444444444
[50]: # Voting Classifier Model
knn_pipe = Pipeline([(‘ss’, StandardScaler()), (‘knn’, KNeighborsClassifier())])
# Running the pipeline for the Voting Classifier
vote = VotingClassifier([(‘rand’, RandomForestClassifier()),
(‘grad’, GradientBoostingClassifier()),
(‘lr’, LogisticRegression()),
(‘tree’, DecisionTreeClassifier()),
(‘bag’, BaggingClassifier()),
(‘ada’, AdaBoostClassifier()),
(‘extra’, ExtraTreesClassifier()),
(‘knn_pipe’, knn_pipe)], voting = ‘soft’)
# Setting the parameters
vote_params = {‘rand__n_estimators’: [150],
‘rand__max_depth’: [15],
‘grad__n_estimators’: [300],
‘tree__max_depth’: [8],
‘bag__n_estimators’: [200],
‘ada__n_estimators’: [10],
‘extra__n_estimators’: [600],
17
‘knn_pipe__knn__n_neighbors’: [21],
‘lr__penalty’: [‘l1’],
‘lr__C’: [1],
‘lr__solver’: [‘liblinear’]}
gs_vc = GridSearchCV(vote, param_grid=vote_params, cv=5, scoring=’accuracy’)
# Fitting the voting classifier to the training data split
gs_vc.fit(X_train, y_train)
gs_vc.best_estimator_
[50]: VotingClassifier(estimators=[(‘rand’,
RandomForestClassifier(max_depth=15,
n_estimators=150)),
(‘grad’,
GradientBoostingClassifier(n_estimators=300)),
(‘lr’,
LogisticRegression(C=1, penalty=’l1′,
solver=’liblinear’)),
(‘tree’, DecisionTreeClassifier(max_depth=8)),
(‘bag’, BaggingClassifier(n_estimators=200)),
(‘ada’, AdaBoostClassifier(n_estimators=10)),
(‘extra’, ExtraTreesClassifier(n_estimators=600)),
(‘knn_pipe’,
Pipeline(steps=[(‘ss’, StandardScaler()),
(‘knn’,
KNeighborsClassifier(n_neighbors=21))]))],
voting=’soft’)
[51]: # Scoring the model (training score, cross validation and test score)
print (f’training score: {gs_vc.score(X_train, y_train)}’)
print (f”cross validation score: {cross_val_score(gs_vc.best_estimator_, X, y,␣
↪cv=5).mean()}”)
print (f”test score: {gs_vc.score(X_test, y_test)}”)
# Inference, voting classifier performs slightly worse than the SVC on the test␣
↪data and almost overfits the training data.
# Also as voting classifiers are ensemble, they are difficult to interprete
training score: 0.9626168224299065
cross validation score: 0.7237749546279492
test score: 0.7083333333333334
[52]: # Model evaluation
# Getting the prediction from all the models
18
lr_preds = gs_lr.best_estimator_.predict(X_test)
knn_preds = gs_knn.best_estimator_.predict(X_test)
tree_preds = gs_tree.best_estimator_.predict(X_test)
bag_preds = gs_bag.best_estimator_.predict(X_test) # really good on confusion␣
↪matrix
forest_preds = gs_forest.best_estimator_.predict(X_test)
extra_preds = gs_extra.best_estimator_.predict(X_test)
ada_preds = gs_ada.best_estimator_.predict(X_test)
svc_preds = gs_svc.best_estimator_.predict(X_test) # as good as the bag preds␣
↪on confusion matrix
vc_preds = gs_vc.best_estimator_.predict(X_test)
grad_preds = gs_grad.best_estimator_.predict(X_test)
def pretty_confusion_matrix(y_true, y_pred):
“”” Creating a confusion in a nice chart.”””
# Handling the data
cm = confusion_matrix(y_true, y_pred)
labels = y_true.unique()
labels.sort()
# Plotting
sns.set(font_scale=1)
plt.figure(figsize=(6, 5))
chart = sns.heatmap(cm, annot=True, fmt=’g’, cmap=’coolwarm’,␣
↪xticklabels=labels, yticklabels=labels)
chart.set_yticklabels(chart.get_yticklabels(), rotation = 0)
plt.title(‘Confusion Matrix’)
plt.xlabel(‘Predicted Class’)
plt.ylabel(‘True Class’)
pretty_confusion_matrix(y_test, svc_preds) # Substitute the predictions in␣
↪based on the model of choice
19
[53]: # Computing the outcomes of the confusion matrix depending on the selected␣
↪model (SVC)
print(f’No-recurrence-events: {47/(47+4)}’)
print(f’Recurrence-events: {8/(13+8)}’)
No-recurrence-events: 0.9215686274509803
Recurrence-events: 0.38095238095238093
[54]: # Evaluating the Bagging Classifer, turns out to be one of the reasonable␣
↪models based on the confusion matrix
print (f’confusion matrix:\n {confusion_matrix(y_test, bag_preds)}’)
print (f’classification report:\n {classification_report(y_test, bag_preds)}’)
print (f’accuracy score:\n {accuracy_score(y_test, bag_preds)}’)
confusion matrix:
[[41 10]
[12 9]]
classification report:
precision
recall
20
f1-score
support
no-recurrence-events
recurrence-events
0.77
0.47
0.80
0.43
0.79
0.45
51
21
accuracy
macro avg
weighted avg
0.62
0.69
0.62
0.69
0.69
0.62
0.69
72
72
72
accuracy score:
0.6944444444444444
[55]: # Evaluating the Logistic Regression
print (f’confusion matrix:\n {confusion_matrix(y_test, lr_preds)}’)
print (f’classification report:\n {classification_report(y_test, lr_preds)}’)
print (f’accuracy score:\n {accuracy_score(y_test, lr_preds)}’)
confusion matrix:
[[44 7]
[12 9]]
classification report:
precision
recall
f1-score
support
no-recurrence-events
recurrence-events
0.79
0.56
0.86
0.43
0.82
0.49
51
21
accuracy
macro avg
weighted avg
0.67
0.72
0.65
0.74
0.74
0.65
0.72
72
72
72
accuracy score:
0.7361111111111112
[56]: # Evaluating the Support Vector Classifer, turns out to be one of the good␣
↪models based on the confusion matrix alongside the Bagging Classifier
print (f’confusion matrix:\n {confusion_matrix(y_test, svc_preds)}’)
print (f’classification report:\n {classification_report(y_test, svc_preds)}’)
print (f’accuracy score:\n {accuracy_score(y_test, svc_preds)}’)
confusion matrix:
[[44 7]
[15 6]]
classification report:
no-recurrence-events
recurrence-events
precision
recall
f1-score
support
0.75
0.46
0.86
0.29
0.80
0.35
51
21
21
accuracy
macro avg
weighted avg
0.60
0.66
0.57
0.69
0.69
0.58
0.67
72
72
72
accuracy score:
0.6944444444444444
[57]: # Mapping the variable labels to the featrues dataframe
features = data.keys()
features
[57]: Index([‘RecClass’, ‘DegMalig’, ‘AgeGroup 30-39’, ‘AgeGroup 40-49’,
‘AgeGroup 50-59’, ‘AgeGroup 60-69’, ‘AgeGroup 70-79’, ‘Menopause lt40’,
‘Menopause premeno’, ‘TumorSize 10-14’, ‘TumorSize 15-19’,
‘TumorSize 20-24’, ‘TumorSize 25-29’, ‘TumorSize 30-34’,
‘TumorSize 35-39’, ‘TumorSize 40-44’, ‘TumorSize 45-49’,
‘TumorSize 5-9’, ‘TumorSize 50-54’, ‘InvNodes 12-14’, ‘InvNodes 15-17’,
‘InvNodes 24-26’, ‘InvNodes 3-5’, ‘InvNodes 6-8’, ‘InvNodes 9-11’,
‘NodeCaps no’, ‘NodeCaps yes’, ‘Breast right’, ‘Quadrant central’,
‘Quadrant left_low’, ‘Quadrant left_up’, ‘Quadrant right_low’,
‘Quadrant right_up’, ‘Radiation yes’],
dtype=’object’)
[58]: # Mapping the input features into the keys dataframe
keys = list(X.keys())
keys
[58]: [‘DegMalig’,
‘AgeGroup 30-39’,
‘AgeGroup 40-49’,
‘AgeGroup 50-59’,
‘AgeGroup 60-69’,
‘AgeGroup 70-79’,
‘Menopause lt40’,
‘Menopause premeno’,
‘TumorSize 10-14’,
‘TumorSize 15-19’,
‘TumorSize 20-24’,
‘TumorSize 25-29’,
‘TumorSize 30-34’,
‘TumorSize 35-39’,
‘TumorSize 40-44’,
‘TumorSize 45-49’,
‘TumorSize 5-9’,
22
‘TumorSize 50-54’,
‘InvNodes 12-14’,
‘InvNodes 15-17’,
‘InvNodes 24-26’,
‘InvNodes 3-5’,
‘InvNodes 6-8’,
‘InvNodes 9-11’,
‘NodeCaps no’,
‘NodeCaps yes’,
‘Breast right’,
‘Quadrant central’,
‘Quadrant left_low’,
‘Quadrant left_up’,
‘Quadrant right_low’,
‘Quadrant right_up’,
‘Radiation yes’]
[59]: # Checking the unique output classes for the logistic regression model
print(gs_lr.classes_)
print(gs_lr.best_estimator_.steps[0][1].classes_)
# Checking the coefficients of the logistic regression
print(gs_lr.best_estimator_.steps[0][1].coef_)
[‘no-recurrence-events’ ‘recurrence-events’]
[‘no-recurrence-events’ ‘recurrence-events’]
[[ 0.34040697 0.27983221 0.
0.
0.09150529 0.06801665 -0.49905636 0.
0.07725386 0.
-0.06037721 -0.0545067
0.09457485 0.05055718 0.
0.
-0.27830247 0.00273593 -0.1264589
0.
-0.11154848 0.2244928
0.22011191]]
0.21733102 0.
0.09134133 0.06800892
-0.22548181 0.07800986
0.18496753 0.0474066
0.
-0.01226542
[60]: # Putting the coefficients of the logistic regression in a dataframe
# Check the unique classes of the output (y feature)
classes = gs_lr.best_estimator_.steps[0][1].classes_
# Map the coefficients of the best estimator to the coefs dataframe
coefs = gs_lr.best_estimator_.steps[0][1].coef_
# Input features mapped into the Feature dataframe
Feature = X.columns
# Mapping the coefs and the features into the coefs dataframe
23
coefs_df = pd.DataFrame(coefs[0], Feature, columns=[‘coef’]).
↪sort_values(by=’coef’, ascending=False)
# Resetting the coefs_df index (the features were originally the index)
coefs_df = coefs_df.reset_index()
# Renaming the index to features
coefs_df.rename(columns={‘index’:’features’}, inplace=True)
# Checking the coefs_df dataframe
coefs_df
[60]:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
features
coef
DegMalig 0.340407
AgeGroup 30-39 0.279832
Quadrant right_up 0.224493
Radiation yes 0.220112
AgeGroup 60-69 0.217331
InvNodes 6-8 0.184968
InvNodes 12-14 0.094575
Menopause lt40 0.091505
TumorSize 20-24 0.091341
TumorSize 50-54 0.078010
TumorSize 30-34 0.077254
Menopause premeno 0.068017
TumorSize 25-29 0.068009
InvNodes 15-17 0.050557
InvNodes 9-11 0.047407
NodeCaps yes 0.002736
InvNodes 3-5 0.000000
AgeGroup 40-49 0.000000
Quadrant left_low 0.000000
Quadrant central 0.000000
InvNodes 24-26 0.000000
AgeGroup 70-79 0.000000
AgeGroup 50-59 0.000000
TumorSize 35-39 0.000000
TumorSize 15-19 0.000000
Quadrant left_up -0.012265
TumorSize 45-49 -0.054507
TumorSize 40-44 -0.060377
Quadrant right_low -0.111548
Breast right -0.126459
TumorSize 5-9 -0.225482
NodeCaps no -0.278302
TumorSize 10-14 -0.499056
24
[61]: # Bar plot of the features and their coefficients
plt.figure(figsize=(18,8))
sns.barplot(x=’features’, y=’coef’, data=coefs_df)
plt.xticks(rotation=90)
[61]: (array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]),
[Text(0, 0, ‘DegMalig’),
Text(1, 0, ‘AgeGroup 30-39’),
Text(2, 0, ‘Quadrant right_up’),
Text(3, 0, ‘Radiation yes’),
Text(4, 0, ‘AgeGroup 60-69’),
Text(5, 0, ‘InvNodes 6-8’),
Text(6, 0, ‘InvNodes 12-14’),
Text(7, 0, ‘Menopause lt40’),
Text(8, 0, ‘TumorSize 20-24’),
Text(9, 0, ‘TumorSize 50-54’),
Text(10, 0, ‘TumorSize 30-34’),
Text(11, 0, ‘Menopause premeno’),
Text(12, 0, ‘TumorSize 25-29’),
Text(13, 0, ‘InvNodes 15-17’),
Text(14, 0, ‘InvNodes 9-11’),
Text(15, 0, ‘NodeCaps yes’),
Text(16, 0, ‘InvNodes 3-5’),
Text(17, 0, ‘AgeGroup 40-49’),
Text(18, 0, ‘Quadrant left_low’),
Text(19, 0, ‘Quadrant central’),
Text(20, 0, ‘InvNodes 24-26’),
Text(21, 0, ‘AgeGroup 70-79’),
Text(22, 0, ‘AgeGroup 50-59’),
Text(23, 0, ‘TumorSize 35-39’),
Text(24, 0, ‘TumorSize 15-19’),
Text(25, 0, ‘Quadrant left_up’),
Text(26, 0, ‘TumorSize 45-49’),
Text(27, 0, ‘TumorSize 40-44’),
Text(28, 0, ‘Quadrant right_low’),
Text(29, 0, ‘Breast right’),
Text(30, 0, ‘TumorSize 5-9’),
Text(31, 0, ‘NodeCaps no’),
Text(32, 0, ‘TumorSize 10-14′)])
25
[62]: # Plotting the top and bottom coefficients of breast cancer predictors
def coef_plot(category):
# Getting the top 10 coefficients
coefs_1 = coefs_df.sort_values(by=’coef’, ascending=False).head(10)
# Getting the bottom 10 coefficients
coefs_2 = coefs_df.sort_values(by=’coef’, ascending=False).tail(10)
# Merging both dataframes into 1
coefs = pd.concat([coefs_1, coefs_2], axis = 0)
# Plotting the importance
# Plotting the coefficients
plt.figure(figsize=(10, 10))
plt.title(f’Feature Coefficient for {category.title()} Predictors’, fontsize␣
↪= 10)
sns.set_style(‘darkgrid’)
sns.barplot(data = coefs,
x = ‘coef’,
y = ‘features’,
orient = ‘h’,
palette = ‘mako_r’)
plt.xlabel(‘coefficient’, fontsize = 10)
plt.ylabel(‘features’, fontsize = 10)
plt.tick_params(labelsize = 10)
26
[63]: coef_plot(‘recurrence-events’)
[64]: coef_plot(‘no-reccurrence-events’)
27
[65]: classes = gs_lr.best_estimator_.steps[0][1].classes_
odds = np.exp(gs_lr.best_estimator_.steps[0][1].coef_)
coefs_df = pd.DataFrame(odds[0], X.columns, columns=[‘coef’]).
↪sort_values(by=’coef’, ascending=False)
coefs_df
[65]:
DegMalig
AgeGroup 30-39
Quadrant right_up
Radiation yes
AgeGroup 60-69
InvNodes 6-8
coef
1.405519
1.322908
1.251688
1.246216
1.242755
1.203179
28
InvNodes 12-14
Menopause lt40
TumorSize 20-24
TumorSize 50-54
TumorSize 30-34
Menopause premeno
TumorSize 25-29
InvNodes 15-17
InvNodes 9-11
NodeCaps yes
InvNodes 3-5
AgeGroup 40-49
Quadrant left_low
Quadrant central
InvNodes 24-26
AgeGroup 70-79
AgeGroup 50-59
TumorSize 35-39
TumorSize 15-19
Quadrant left_up
TumorSize 45-49
TumorSize 40-44
Quadrant right_low
Breast right
TumorSize 5-9
NodeCaps no
TumorSize 10-14
1.099191
1.095823
1.095643
1.081133
1.080316
1.070383
1.070375
1.051857
1.048548
1.002740
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
0.987809
0.946952
0.941409
0.894448
0.881210
0.798132
0.757068
0.607103
[ ]:
29
1
Prediction of Cancer Recurrence in Breast
Cancer Survivors
First A. Author, Second B. Author, Jr., and Third C. Author, Member, IEEE

use the mean values provided in the data set.
Abstract—This report examines the use of various statistical
methods on a dataset of breast cancer survivors that includes
relevant information about their cancer and treatment processes,
divided into those who have experienced a recurrence and those
who have not. The methods utilized include logistic regression,
bagging, random forest, boosting, neural network, and VSCC
classification, with a focus on comparing their prediction
accuracy. The results show that the random forest model has the
lowest false negative rate (40%) and comparable overall accuracy
(75%) to other methods. These findings can aid in predicting the
likelihood of cancer recurrence for individual patients.
I. INTRODUCTION
B
cancer affects millions of people worldwide and is the most
frequently diagnosed cancer in women [1]. This disease
not only impacts women, but also poses a significant threat to
their lives as it is the second leading cause of cancer-related
deaths. With the availability of specific data related to women,
it is possible to use various classification techniques to predict
the chances of breast cancer diagnosis, prognosis, and
recurrence in female patients. This report aims to analyze a
dataset containing information on female patients with breast
cancer, utilizing a range of classification methods to identify the
most accurate approach and gather insights that can aid in
predicting the likelihood of cancer recurrence.
The data set used in this analysis was obtained from the UCI
Machine Learning Repository [2], a publicly available
database. It consists of follow-up data for patients with invasive
breast cancer, without distant metastases at the time of
diagnosis. These patients were seen by Dr. William H. Wolberg
at the University of Wisconsin Clinical Sciences Center since
1984. The data set includes characteristics extracted from
digitized images of fine needle aspirates (FNA) of breast
masses, specifically focusing on features of the cell nuclei
present in the image. The cases in this data set are divided into
two groups: cases where cancer recurrence occurred and cases
where it did not recur. In order to focus on our specific question
of interest, the cases where cancer did not recur were excluded.
As a result, the final data set consisted of 47 independent cases,
reduced from the original 198 cases. Although this data set has
been referenced in numerous papers, the units used in the data
are not clearly specified. Our analysis will present values in
terms of the original units used in the data set, as they remain
unknown except for the time until recurrence. In order to
calculate the mean values for each variable, our analysis will
II. DATASET
The dataset contains 286 entries with 9 different input variables
and one classification output variable. These input variables
provide insight into a patient’s past experience with breast
cancer, including details about the tumor location, size, and
type, as well as information about the treatment administered
and biological characteristics of the patient. The output variable
classifies the patient as either having no recurrence or a
recurrence of breast cancer, with a split of 201 and 85
observations, respectively [2].
III. EXPLORATORY DATA ANALYSIS
The first quantitative attribute in our dataset is the patient’s age
at the time of cancer diagnosis. This data is divided into
decades, with a range of 20-79 years. The average age of
diagnosis is approximately 47 years, with a standard deviation
of 10 years, as shown in Figure 1. The distribution of ages is
normally distributed, with a majority of patients falling within
the 40-49 or 50-59 age range. The other variable in our dataset
is a categorical one that is not directly related to the cancer
itself. This attribute indicates if the patient has entered
menopause before the age of 40, after the age of 40, or if the
patient is still premenopausal. The majority of patients fall into
the “menopause after 40” or “premenopausal” categories. It is
uncommon for menopause to occur before the age of 40, so it is
worth investigating any potential correlations between an early
menopause and other factors [2].
The following variables are related to the cancerous tumour and
its behavior. The first two are linear variables that measure the
size of the tumour in millimeters, using 5mm intervals. This
ranges from under 1mm to 54mm. The third variable is the
number of lymph nodes with cancerous cells, categorized in
groups of 3 nodes and ranging from 0 to 26 nodes. This is
closely tied to the fourth variable which is a binary variable
indicating if the cancer has spread beyond the lymph nodes and
metastasized to other parts of the body. These variables are all
related to the level of malignancy of the cancer, with the fourth
variable indicating the grade of the cancer on a scale of 1 to 3.
A grade 1 cancer is the least aggressive and has the highest
survival rate, while a grade 3 cancer is the most aggressive and
has the lowest survival rate [3].
2
The degree of malignancy is influenced by the three
aforementioned variables, and is used to categorize the cancer.
For example, a cancer with positive node capsule penetration is
likely to have a higher grade (no grade 1 cancers were observed
with node capsule penetration) and a greater risk of recurrence.
However, when building a generalized linear model, it will be
crucial to determine whether using the degree of malignancy as
the sole predictor is sufficient or if a combination of the other
linear variables yields more accurate results.
The next two characteristics provide data regarding the precise
position of the cancer on the patient’s breasts. This includes a
binary variable that indicates whether the cancer is present on
the left or right breast, as well as a categorical variable that
specifies the “quadrant” of the breast where the cancer is
located. The potential quadrants are upper left, lower left, upper
right, lower right, and central. It should be noted that cancer on
the left breast is more prevalent, potentially due to the slightly
larger size of the left breast, resulting in more tissue for cancer
to potentially develop [4]. Moreover, cancer is predominantly
found in the upper or lower quadrants of both breasts, which is
believed to be due to the higher concentration of breast tissue
in these areas [5].
Figure 1: Common involved lymph nodes in close proximity to the breast
(source: cancer.ca)
The last input variable is a binary indicator for radiation
treatment as part of the patient’s treatment plan. This variable is
complex and highly dependent on the size and location of the
tumor within the patient’s breast. In cases where the tumor is
small in comparison to the breast, a lumpectomy is typically
performed followed by radiation treatment. This helps to
eliminate any remaining cancer cells that may have been missed
during the lumpectomy. However, if the tumor is large or
invasive, a mastectomy is usually necessary and radiation
treatment may not be required unless the tumor was near the
chest wall. Generally, more invasive and larger tumors are
associated with higher grades of cancer, making this variable a
potentially valuable predictor for lower grades of cancer.
The classification of the output variable determines whether a
patient is at risk of cancer recurrence. For those who do
experience recurrence, it can have a profound impact on their
physical, emotional, and financial well-being, as well as that of
their loved ones. Therefore, accurately predicting the likelihood
of recurrence is of great significance. This allows patients to
take proactive steps to prevent recurrence or prepare for it if the
chances are high. Additionally, the medical industry can use
this information to personalize treatment plans and logistics for
patients based on their risk of recurrence, ensuring that they
receive the most effective treatment.
Our objective is to tackle the issue of accurately forecasting the
probability of cancer recurrence in a patient, as discussed
earlier. To achieve this, the dataset will be subjected to various
statistical techniques using the R programming language. These
techniques include logistic regression, classification tree
analysis, as well as neural network classification and variable
selection for classification, which are not commonly discussed
methods. By applying a supervised approach, the performance
of these methods will be compared, and their prediction
accuracy will be evaluated.
IV. METHODS AND MODELS
i.
Logistic Regression
Logistic regression is a statistical technique used to develop a
predictive model using a set of continuous input variables and
a binary output variable. The resulting model is typically
represented in the form of a mathematical equation. It is often
employed in situations where the outcome of interest is a
categorical or binary variable, such as yes or no, pass or fail, or
presence or absence.
A generalized linear model (GLM) was used to analyze the
dataset, with the assumption that the error follows a distribution
with a mean of 0 and a variance of π(X)[1 − π(X)], where π(X)
represents the probability of recurrence based on the linear
input variables (age, number of involved nodes, tumour size,
and degree of malignancy). This method is suitable for the
dataset, as it takes into account the multiple linear input
variables and binary output recurrence variable. Initially, the χ2
value was calculated for the full model, and then the model was
reduced by removing the predictor variables with the largest p
values. The resulting reduced model was compared to the full
model using deviance calculations to assess the improvement in
model fit. In addition, odds ratios were used to provide some
insight on which biological parameters may be significant
predictors of cancer recurrence. Odds ratios represent the
likelihood of recurrence when a particular predictor variable is
present, compared to when it is absent. Higher odds ratios
indicate a stronger association with recurrence. By analyzing
the data using a GLM and examining the χ2 value, p values,
deviance, and odds ratios, we can determine which biological
factors are most strongly associated with cancer recurrence.
This information can be useful in understanding the underlying
mechanisms of cancer recurrence and in developing more
effective treatment strategies to improve patient outcomes.
For our classification task, we implemented a Logistic
Regression model using a pipeline approach with various
parameters. The model utilized the l1 penalty for regularization
and employed the liblinear solver for optimization. In order to
find the best estimator, we performed a grid search with a cv
value of 5 and utilized accuracy as our scoring metric. We also
utilized 2 number of jobs for parallel computing. After fitting
the model on our training data, we obtained a training score of
78%. We then evaluated the model’s performance using cross
validation and obtained a score of 72%. Finally, we tested the
3
model on unseen data and achieved a score of 69%. Logistic
Regression model trained on the training data with the l1
penalty and liblinear solver, using the pipeline and grid search
approach, showed decent performance with an accuracy of 69%
on the test data.
false negatives (FN), and the bottom-right (9) represents true
positives (TP). This matrix helps you understand the model’s
performance in terms of correct and incorrect predictions.
V. RESULT AND DISCUSSION
Variable selection for clustering and classification (VSCC) is a
dimension reduction method similar to principal component
analysis or factor analysis. Its main objective is to identify a
subset of variables that minimizes within-group variability
while maximizing between-group variability [8]. Unlike other
techniques, such as logistic regression, VSCC automates the
variable selection and reduction process in a more efficient
manner. A comparison of the predictive accuracy between
VSCC and logistic regression could provide valuable insights
into which variables are most relevant in predicting cancer
recurrence. This approach has the potential to enhance the
accuracy and efficiency of predicting recurrence, and thus
contribute to the development of individualized treatment plans
for cancer patients.
ii.
Every time the X variable increases by one unit, the chances of
the observation being categorized in the y class are ‘coefficient’
times higher than the chances of the observation not being in
the y class, keeping all other variables constant. In this situation,
if the patient is in the AgeGroup 60-69, the odds of breast
cancer recurrence are more than 1.39 times higher compared to
the odds of no recurrence. Conversely, if the patient has a
TumorSize of 10-14, the odds of breast cancer recurrence are
only 0.69. For odds less than 1 or negative coefficients, we can
take the reciprocal (1/odds) to better understand them.
Therefore, as the TumorSize increases by 1, the odds of no
cancer recurrence are 1/0.69, meaning they are less likely than
the odds of the cancer recurring.
Confusion Matrix
The model’s predictions were obtained by training it on the test
data split. These predictions were then input into the following
function to generate a confusion matrix. The confusion matrix
for this case has two possible prediction classes: “no recurrence
events” and “recurrence events”. As shown in the image below,
True Positives (TP) represent cases where the model predicted
“recurrence events” and there were indeed recurrence events.
Similarly, True Negatives (TN) are cases where the model
predicted “no recurrence events” and there were no recurrence
events. In contrast, False Positives (FP) describe cases where
the model predicted “recurrence events” but there were actually
no recurrence events, while False Negatives (FN) represent
cases where the model predicted “no recurrence events” but
there were actually recurrence events.
Figure 3: Shows a top and bottom coefficients of breast
cancer predictors
Figure 2: Confusion Matrix
In this case, it’s a 2×2 matrix with four values. The top-left
value (44) represents the true negatives (TN), the top-right (7)
represents false positives (FP), the bottom-left (12) represents
On overfitting, the model was then tested with different node
sizes and decay rates. The results showed that decreasing the
node size to 4 and increasing the decay rate to 0.5 resulted in a
slightly better classification accuracy of ∼ 77.5%, with a lower
false negative rate at ∼ 50%. This suggests that smaller node
sizes and higher decay rates can help prevent overfitting in this
dataset. However, it should be noted that even with these
changes, the classification accuracy for this small dataset may
not be as high as desired. This is likely due to the limited
amount of training data available. In order to improve the
accuracy of the model, a larger training set would be necessary.
When working with a small dataset like this, it is important to
carefully consider the potential for overfitting and take steps to
prevent it, such as reducing the number of nodes and increasing
the decay rate. However, the limited amount of data may still
present challenges in achieving high classification accuracy.
Therefore, obtaining a larger dataset or finding alternative
4
methods of classification may be necessary to achieve more
accurate results.
Figure 4: plot of the features and their coefficients
The neural network model showed similar predictive accuracy
as other models previously implemented. However, its false
negative rate was slightly higher compared to logistic
regression, boosting, bagging, and general classification
models. We experimented with changing the size, decay, and
maximum iterations parameters to improve prediction accuracy
and decrease false negatives, but the original parameters
performed the best. While it may not be as effective as the
random forest approach in terms of false negatives, the neural
network model could potentially be more advantageous for
larger breast cancer datasets. If more observations were added
to the breast cancer dataset or if it was combined with another
dataset with similar patient attributes, the neural network model
could potentially become the top-performing predictive model.
VI. CONCLUSION
The issue of accurately predicting a patient’s risk of cancer
recurrence has been the subject of numerous statistical
techniques using varying dataset parameters. Across all models,
the achieved prediction accuracy falls within the 70-75% range,
which is considered reasonably accurate. When compared to
previous attempts to classify the data [8], it is revealed that the
current highest prediction accuracy is around 77%, consistent
with these results. These findings suggest that the likelihood of
cancer recurrence in a patient is inherently unpredictable and
sporadic, given the available data. Therefore, in the context of
the problem, these accuracies may not warrant significant
logistical or infrastructural changes within the healthcare
system, as the issue at hand involves patients’ well-being and
requires a high level of accuracy.
coefDegMalig1.405519AgeGroup
30-391.322908Quadrant
right_up1.251688Radiation
yes1.246216AgeGroup
60691.242755InvNodes
6-81.203179InvNodes
12141.099191Menopause
lt401.095823TumorSize
20241.095643TumorSize
50-541.081133TumorSize
30341.080316Menopause
premeno1.070383TumorSize
25291.070375InvNodes
15-171.051857InvNodes
9111.048548NodeCaps
yes1.002740InvNodes
351.000000AgeGroup
40-491.000000Quadrant
left_low1.000000Quadrant
central1.000000InvNodes
24261.000000AgeGroup
70-791.000000AgeGroup
50-
591.000000TumorSize
35-391.000000TumorSize
15191.000000Quadrant
left_up0.987809TumorSize
45490.946952TumorSize
40-440.941409Quadrant
right_low0.894448Breast
right0.881210TumorSize
590.798132NodeCaps no0.757068TumorSize 10-140.607103
Table 1: top and bottom coefficients of breast cancer
predictors
In addition to the crucial aspect of high accuracy, this report
also consistently addresses the significance of false negative
rates. In the particular context of this problem, it is vital for the
false negative percentage to be kept low because informing a
patient that they are unlikely to experience cancer recurrence
when they actually are at risk can have serious consequences.
Unfortunately, the majority of models utilized in this study
showed alarmingly high false negative percentages, rendering
them unfit for practical use in the field of oncology. However,
the random forest model showed the lowest false negative rate
at approximately 40%, coupled with a commendable prediction
accuracy compared to the other models. Although this model
may not be suitable as a standalone predictor, it could be
employed in conjunction with a more accurate model,
potentially making it useful in the field of oncology.
The secondary analysis revealed that each method had its own
advantages and disadvantages. While the logistic regression
method had a high false negative rate, it also generated odds
ratios that could be valuable for healthcare professionals to
make more informed prognoses for patients. The classification
methods were able to incorporate all the variables in the dataset,
making them more versatile for similar datasets. The neural
network approach, although on par with the other models, could
potentially improve its predictive accuracy with access to more
data for training. Lastly, the VSCC method provided a useful
comparison with other models and helped prioritize variables
and determine the necessary number of attributes.
REFERENCES
[1]
[2]
Breast Cancer Statistics Cancer Research UK,
cancerresearchuk.org/health-professional/cancer-statistics.
Zwitter,Matjaz and Soklic,Milan. (1988). Breast Cancer. UCI Machine
Learning Repository. https://doi.org/10.24432/C51P4M.
[3] Historical grading and prognosis in breast cancer, Bloom &
Richardson British Journal of Cancer, 1957.
[4] Breast size, handedness and breast cancer risk., Hsieh &
Trichopolous Harvard School of Public Health, 1991.
[5] Why is carcinoma of the breast more frequent in the upper outer
quadrant? A case series based on needle core biopsy diagnoses.,
Lee AH Nottingham City Hospital, 2005.
[6] Ensemble Methods, Zhou CRC Press 2012
[7] The Elements of Statistical Learning, Hastie et. al Springer 2008
Variable selection for clustering and classification, Andrews &
McNicholas Journal of Classification, 2014
[8] Supervised Classification on Breast Cancer Data, Vanschoren et.
al, OpenML https://www.openml.org/t/13
[9]

Still stressed with your coursework?
Get quality coursework help from an expert!