School of Medical Sciences, Universiti Sains Malaysia
Physician usually have these objectives when looking at patients data:
This talk is about the second objective.
For example,
physicians working at ED will want to guess the correct triage category when a patient is brought to ED with chest pain, sweating and history of diabetes mellitus
an epidemiologist may want to pick the best health outcome for patients who smoke cigarettes, practice sedentary life style and has diabetes mellitus.
However, physicians’ minds, no matter how bright or experienced will not able
Predictive analytics helps physicians make more accurate guess:
The end result is more accurate guess or prediction of a diagnosis or outcome.
I am a medical epidemiologist and a fellow of the American College of Epidemiology.
These are the links for my SCOPUS publication, my GitHub and personal webpage.
I am also on Twitter and I welcome opportunities for future collaborations and training.
Founding council member of the Malaysian Association of Epidemiology
Tidymodels
In clinical and medical research scenario:
Inferential statistics is formally defined as
The goal of inferential statistics is to make generalizations about a population. For example, physicians want to understand the relationship between certain variables (aka risk factors) with having a disease or having a certain outcome of the disease.
Read more here
Is this inference
In prediction, physicians use existing data set, and then they choose models or algorithms, so at the end they can reliably choose the correct diagnosis or outcome of a disease.
The outcome can be categorical
The outcome can be values
To perform prediction (predictive analytics), physicians use machine learning methods. For example
Usually predictive analytics can be grouped into
Supervised learning is a machine learning approach that’s defined by its use of labeled datasets.
Regression problems or models: For models predicting a numeric outcome. A type of supervised learning method that uses an algorithm to understand the relationship between dependent and independent variables. Regression models are helpful for predicting numerical values based on different data points, such as sales revenue projections for a given business.
Classification problems or models: For models predicting a categorical response. It uses an algorithm to accurately assign test data into specific categories, such as separating apples from oranges. Or, in the real world, supervised learning algorithms can be used to classify spam in a separate folder from your inbox.
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.
Open new R project, then load packages:
stroke_fatality.dta (in STATA format).data.frameVariables :
Rows: 226
Columns: 19
$ sex <fct> female, male, female, female, male, female, female, female, …
$ race <fct> malay, malay, malay, malay, malay, chinese, malay, malay, ma…
$ status2 <fct> dead, alive, alive, alive, alive, dead, alive, dead, alive, …
$ gcs <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15,…
$ sbp <dbl> 150, 152, 231, 110, 199, 190, 145, 161, 222, 161, 149, 153, …
$ dbp <dbl> 87, 108, 117, 79, 134, 101, 102, 96, 129, 107, 90, 61, 95, 1…
$ hr <dbl> 92, 87, 64, 90, 72, 63, 102, 81, 72, 94, 59, 81, 61, 120, 67…
$ hb <dbl> 10.4, 13.0, 11.0, 14.3, 15.7, 11.7, 13.4, 11.8, 12.6, 16.4, …
$ plat <dbl> 249, 156, 179, 233, 351, 133, 290, 251, 196, 188, 139, 306, …
$ wbc <dbl> 12.5, 7.4, 22.4, 9.6, 18.7, 11.3, 15.8, 8.5, 9.0, 9.5, 11.0,…
$ na <dbl> 138, 132, 135, 132, 138, 140, 134, 135, 129, 137, 141, 137, …
$ potas <dbl> 3.6, 4.1, 4.7, 3.8, 3.8, 3.0, 4.1, 4.0, 4.0, 3.7, 4.1, 4.2, …
$ gluc <dbl> NA, 5.1, NA, NA, NA, NA, NA, NA, NA, 5.5, NA, NA, NA, NA, NA…
$ cbs <dbl> 11.4, NA, 18.6, 6.8, 6.5, 8.4, 13.4, 14.9, 6.6, 5.8, 7.7, 6.…
$ chol <dbl> NA, 4.7, NA, NA, NA, NA, NA, NA, NA, 6.9, NA, NA, NA, NA, NA…
$ tg <dbl> NA, 1.1, NA, NA, NA, NA, NA, NA, NA, 1.0, NA, NA, NA, NA, NA…
$ urea <dbl> 5.1, 8.4, 11.0, 7.4, 3.8, 4.0, 3.3, 5.6, 5.8, 5.8, 7.3, 7.6,…
$ hpt2 <fct> yes, yes, yes, yes, yes, no, yes, yes, yes, no, yes, no, yes…
$ icd10cat2 <fct> "CI,Others", "CI,Others", "Haemorrhagic", "Haemorrhagic", "H…
Outcome variable :
A stratified random sample with 60/40 split within each of these data subsets
Then pool the results together.
In rsample, this is achieved using the strata argument. Usually split is done 80/20 or 70/30
Resource here
<Training/Testing/Total>
<134/92/226>
As the outcome is categorical, we will use logistic regression model from glmnet package
tune() will find the best value for making predictionsglmnet package:
Resource is here
recipe()step_dummy() : converts characters or factors numeric binary model termsstep_zv() : removes indicator variables that only contain a single unique value (e.g. all zeros).step_normalize() : centers and scales numeric variablesCreate a workflow for ML algorithm
Perform fine tuning
dials::grid_regular() : creates an expanded grid based on a combination of two hyperparameterstune::tune_grid() : trains 30 penalized logistic regression modelscontrol_grid()Get the validation set metrics:
roc_auc metric alone could lead us to multiple options for the best value for this hyperparameter:# A tibble: 15 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.00137 roc_auc binary 0.716 1 NA Preprocessor1_Model12
2 0.00452 roc_auc binary 0.718 1 NA Preprocessor1_Model17
3 0.00574 roc_auc binary 0.720 1 NA Preprocessor1_Model18
4 0.00728 roc_auc binary 0.733 1 NA Preprocessor1_Model19
5 0.00924 roc_auc binary 0.758 1 NA Preprocessor1_Model20
6 0.0117 roc_auc binary 0.777 1 NA Preprocessor1_Model21
7 0.0149 roc_auc binary 0.797 1 NA Preprocessor1_Model22
8 0.0189 roc_auc binary 0.811 1 NA Preprocessor1_Model23
9 0.0240 roc_auc binary 0.817 1 NA Preprocessor1_Model24
10 0.0304 roc_auc binary 0.804 1 NA Preprocessor1_Model25
11 0.0386 roc_auc binary 0.814 1 NA Preprocessor1_Model26
12 0.0489 roc_auc binary 0.807 1 NA Preprocessor1_Model27
13 0.0621 roc_auc binary 0.798 1 NA Preprocessor1_Model28
14 0.0788 roc_auc binary 0.761 1 NA Preprocessor1_Model29
15 0.1 roc_auc binary 0.745 1 NA Preprocessor1_Model30
We prefer to choose a penalty value further along the x-axis, closer to where we start to see the decline in model performance.
Visualize the roc curve using the best penalty values

errors in classifying subgroups of patients
errors in estimating risk levels
errors in predictions.
ethical differences
societal variations
For example: difference in skin color (using machine to detect abnormalities on skin)
Read more from this source

Statistical bias
Social bias
Strategies for mitigating bias across the different steps in machine learning systems development