Survival analysis deals with estimating probability of continuation of a particular status-quo at given point in time. Naturally, it also estimates the probability of discontinuation of the status quo i.e. occurrence of an event or a hazard. It finds application in several fields. For e.g. in medicine to estimate probability of survival of a patient under treatment of a fatal disease. Or in engineering to estimate reliability or predicting failure. Even in marketing and sales, for e.g. to decrease churn rate.
Cricket
In this piece, application of similar method in the game of cricket has been illustrated. Survival in this case means ‘not getting out or dismissed’ while batting. Naturally hazard means getting out. ODI Batting statistics of three Indian batsmen have been used. They are Sachin Tendulkar, Sourav Ganguly and Rahul Dravid. Data has been sourced from ESPN.
Code
#data frames named ganguly and tendulkar are loaded#function to manipulate datamani<-function(x){x<-x%>%filter(Pos !="-")%>%mutate(diss=ifelse((Dismissal=="not out"),0,1))%>%#This is the event of interestmutate(Runs=str_remove(Runs,"[*]"))%>%mutate(BF=as.integer(BF), Runs=as.integer(Runs),Mins=as.integer(Mins),Fours=as.integer(Fours),Sixes=as.integer(Sixes),SR=as.numeric(SR),Pos=as.integer(Pos),Dismissal=as.factor(Dismissal), Inns=as.factor(Inns),Date=dmy(Date))%>%separate(Opposition,c(NA,"Opposition"),extra ="merge", fill ="left")%>%separate(Ground,c(NA,"Ground"),extra ="drop", fill ="left")%>%mutate(Runs=ifelse(is.na(Runs),round(BF*SR,0),Runs))%>%#BF or Balls Faced will be used as unit of timemutate(h_cen=ifelse((Runs>=50& Runs<100),1,0))%>%mutate(cen=ifelse(Runs>=100,1,0))#Adding dummy columns for possible future requirement#x<-dummy_cols(x,select_columns = c("Opposition","Ground")) return(x)}#Adding player names in their data setsganguly$player<-c(rep("ganguly",nrow(ganguly))) tendulkar$player<-c(rep("tendulkar",nrow(tendulkar)))dravid$player<-c(rep("dravid",nrow(dravid)))#Combining data of the three playersdata.raw<-rbind(ganguly,tendulkar, dravid)#structuring the data using the function createddata.df<-mani(data.raw)head(data.df)
Runs Mins BF Fours Sixes SR Pos Dismissal Inns X Opposition Ground
1 3 33 13 0 0 23.07 6 lbw 1 NA West Indies Brisbane
2 46 117 83 3 0 55.42 3 stumped 1 NA England Manchester
3 16 NA 41 3 0 39.02 3 caught 1 NA Sri Lanka RPS
4 36 59 52 3 1 69.23 3 caught 2 NA Zimbabwe SSC
5 59 102 75 7 0 78.66 7 lbw 1 NA Australia SSC
6 11 NA 8 1 0 137.50 8 not out 1 NA Pakistan Toronto
Date player diss h_cen cen
1 1992-01-11 ganguly 1 0 0
2 1996-05-26 ganguly 1 0 0
3 1996-08-28 ganguly 1 0 0
4 1996-09-01 ganguly 1 0 0
5 1996-09-06 ganguly 1 1 0
6 1996-09-17 ganguly 0 0 0
Estimating probability of staying ‘not out’
Kaplan Meier Analysis is a non-parametric analysis. This means that only the event and the time is used in the analysis i.e. for e.g. possible effect of Opposition will not be considered. Unless, of course the data is grouped based on a parameter. Just like we will group the data based on players so that we observe the survival statistics for the individual players. We can further group the data based on Opposition or Ground.
Detailed summary can viewed by running ‘summary(kap.mr.fit)’.
This is visualized below
Code
ggsurvplot( kap.mr.fit,pval =TRUE, # show p-valuebreak.time.by =25, #break X axis by 25 balls#risk.table = "abs_pct", # absolute number and percentage at risk#risk.table.y.text = FALSE,# show bars instead of names in text annotationslinetype ="strata",# Change line type by groupsconf.int =TRUE,# show confidence intervals for#conf.int.style = "step", # customize style of confidence intervalssurv.median.line ="hv",# Specify median survivalggtheme =theme_bw(),# Change ggplot2 themelegend.labs =c("Ganguly", "Tendulkar","dravid"),# change legend labelsncensor.plot =TRUE,# plot the number of censored subjects (outs) at time t#palette = c("#000000", "#2E9FDF","#FF0000"))+labs(x="Balls")
Dravid tends to face more balls before getting dismissed in half of his innings, reflecting a more defensive or enduring playing style. Ganguly, with the highest restricted mean, indicates a slightly higher average number of balls faced but with more variability. Tendulkar, despite facing the most balls overall, has a lower median, suggesting his innings are often shorter but with less variability in shorter durations. These results provide a nuanced view of each player’s batting style and endurance in terms of balls faced per dismissal.
Estimating probability of Sixers
Similar approach was used to estimate probability of at least one sixer. The event, or hazard (for the fielding team) is the batsman hitting a sixer. The hazard plot reveals interesting insight.
Code
#creating new data set with result of sixesdata.new<-data.df%>%mutate(six=ifelse(Sixes>0,1,0))#fitting the modelkap.six.fit<-survfit(Surv(BF,six) ~ player, data=data.new)#summary(kap.six.fit)$tableggsurvplot( kap.six.fit,pval =TRUE,break.time.by =25,linetype ="strata",conf.int =TRUE,surv.median.line ="hv",ggtheme =theme_bw(),legend.labs =c("Ganguly", "Tendulkar","Dravid"),ncensor.plot =TRUE,fun ="event")+labs(x="Balls", y="Sixer")
This analysis suggests that Ganguly is the most aggressive or efficient in terms of hitting sixes, as reflected by his lower median number of balls faced and a relatively narrow confidence interval. Tendulkar also demonstrates a balance between frequency and consistency in hitting sixes. Dravid, on the other hand, appears to be more conservative, facing more balls before hitting sixes, which aligns with his known playing style. The variability in Dravid’s performance is higher, as suggested by the lack of an upper confidence limit. These insights provide a quantitative perspective on the players’ styles concerning hitting sixes.
Similar analysis was conducted using Cox Proportional Hazard model. This model is capable to consider effect of other parameters. ‘Runs’ was used as another parameter in the new model and the hazard plot was plotted again.
Final Analyses
The final stage of analysis includes a subset of the data previously used. Specifically, a few ‘Opposition Teams’ were selected with whom the number of matches were substantially higher than the rest.
Code
#Defining the opposition teamsteams<-c("Pakistan","Sri Lanka", "Australia", "West Indies", "South Africa", "New Zealand","Zimbabwe", "England", "Kenya", "Bangladesh")#Creating new data set with by filtering the teamsdata.pakistan<-data.new%>%filter(Opposition=="Pakistan")data.australia<-data.new%>%filter(Opposition=="Australia")data.england<-data.new%>%filter(Opposition=="England")# Fitting Cox modelscox.diss.pakistan<-survfit(coxph(Surv(BF,diss) ~strata(player),data = data.pakistan))cox.diss.australia<-survfit(coxph(Surv(BF,diss) ~strata(player),data = data.australia))cox.diss.england<-survfit(coxph(Surv(BF,diss) ~strata(player),data = data.england))diss.pakistan.plot<-autoplot(cox.diss.pakistan,conf.int =TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Dismissal")+theme(legend.position ="top")diss.australia.plot<-autoplot(cox.diss.australia,conf.int =TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Dismissal")+theme(legend.position ="top")diss.england.plot<-autoplot(cox.diss.england,conf.int =TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Dismissal")+theme(legend.position ="top")#Resultscox.diss.pakistan
Dravid: Shows the highest endurance against England, facing the most balls on average, and a solid performance against Pakistan. However, he struggles more against Australia. Ganguly: Exhibits a more aggressive style, facing fewer balls on average against all teams, with the least endurance against Australia. Tendulkar: Demonstrates varying strategies, with aggressive play against Pakistan (facing the fewest balls) and better endurance against Australia and England.
Code
#Plotting and arranging ggarrange(diss.pakistan.plot,diss.australia.plot,diss.england.plot,ncol=3)
Further possibilities
After building any model, the model should be tested on a test data to check for accuracy. Validation of the model is required to understand the robustness of the model. If the results are not satisfactory, changes are made to the model to improve accuracy. The process is repeated a few times on requirement.
The predictions are based on certain parameters. There will be more parameters on which the outcome will depend. For e.g. the above analysis considers opposition team as a parameter. However, the ground and weather will also impact the outcome, which has not been considered. There is always a possibility to add more parameter to improve results.
The effect of parameters may also change over time. For e.g. the opposition team may change members in their team, say, add a fast bowler. Also, the batsmen may evolve themselves to tackle a difficult opponent. Then the effect of the opponent on the outcome will not remain same throughout time. Further complex methods like accelerated failure models can be applied to improve the model.
These are beyond the scope of this article. Feel free to try them out and don’t forget to share your results. Feel free to reach out if you need help.
Survival Analysis can be used to help reduce churn, reduce warranty cost and improve pricing, predict failure in machinery and more. If you are curious about how it can help you achieve your business goals, do not hesitate to contact me.
---title: "Survival Analysis in Cricket"author: "Asitav Sen"date: "2020-05-27"date-modified: "1/18/2025"categories: [article, analysis, R]format: html: page-layout: article lightbox: auto---```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE)library(rvest)library(lubridate)library(dplyr)library(tidyverse)library(survival)library(tidyr)library(ggplot2)library(lubridate)library(timereg)library(stringr)library(ggfortify)library(gridExtra)library(caret)library(survminer)ganguly<-read.csv("ganguly.csv")dravid<- read.csv("dravid.csv")tendulkar<- read.csv("god.csv")```## Survival AnalysisSurvival analysis deals with estimating probability of continuation of a particular status-quo at given point in time. Naturally, it also estimates the probability of discontinuation of the status quo i.e. occurrence of an event or a hazard. It finds application in several fields. For e.g. in medicine to estimate probability of survival of a patient under treatment of a fatal disease. Or in engineering to estimate reliability or predicting failure. Even in marketing and sales, for e.g. to decrease churn rate.## CricketIn this piece, application of similar method in the game of cricket has been illustrated. Survival in this case means 'not getting out or dismissed' while batting. Naturally hazard means getting out. ODI Batting statistics of three Indian batsmen have been used. They are [Sachin Tendulkar](https://en.wikipedia.org/wiki/Sachin_Tendulkar), [Sourav Ganguly](https://en.wikipedia.org/wiki/Sourav_Ganguly) and [Rahul Dravid](https://en.wikipedia.org/wiki/Rahul_Dravid). Data has been sourced from ESPN.```{r cars, message=FALSE, warning=FALSE, paged.print=TRUE}#data frames named ganguly and tendulkar are loaded#function to manipulate datamani<-function(x){x<-x%>% filter(Pos != "-")%>% mutate(diss=ifelse((Dismissal=="not out"),0,1))%>% #This is the event of interest mutate(Runs=str_remove(Runs,"[*]"))%>% mutate(BF=as.integer(BF), Runs=as.integer(Runs), Mins=as.integer(Mins),Fours=as.integer(Fours),Sixes=as.integer(Sixes), SR=as.numeric(SR),Pos=as.integer(Pos), Dismissal=as.factor(Dismissal), Inns=as.factor(Inns), Date=dmy(Date))%>% separate(Opposition,c(NA,"Opposition"),extra = "merge", fill = "left")%>% separate(Ground,c(NA,"Ground"),extra = "drop", fill = "left")%>% mutate(Runs=ifelse(is.na(Runs),round(BF*SR,0),Runs))%>% #BF or Balls Faced will be used as unit of time mutate(h_cen=ifelse((Runs>=50 & Runs<100),1,0))%>% mutate(cen=ifelse(Runs>=100,1,0))#Adding dummy columns for possible future requirement#x<-dummy_cols(x,select_columns = c("Opposition","Ground")) return(x)}#Adding player names in their data setsganguly$player<-c(rep("ganguly",nrow(ganguly))) tendulkar$player<-c(rep("tendulkar",nrow(tendulkar)))dravid$player<-c(rep("dravid",nrow(dravid)))#Combining data of the three playersdata.raw<-rbind(ganguly,tendulkar, dravid)#structuring the data using the function createddata.df<-mani(data.raw)head(data.df)```## Estimating probability of staying 'not out'Kaplan Meier Analysis is a non-parametric analysis. This means that only the event and the time is used in the analysis i.e. for e.g. possible effect of Opposition will not be considered. Unless, of course the data is grouped based on a parameter. Just like we will group the data based on players so that we observe the survival statistics for the individual players. We can further group the data based on Opposition or Ground.```{r}kap.mr.fit<-survfit(Surv(BF,diss) ~ player, data=data.df)summary(kap.mr.fit)$table```Detailed summary can viewed by running 'summary(kap.mr.fit)'.This is visualized below```{r message=FALSE, warning=FALSE}ggsurvplot( kap.mr.fit, pval = TRUE, # show p-value break.time.by = 25, #break X axis by 25 balls #risk.table = "abs_pct", # absolute number and percentage at risk #risk.table.y.text = FALSE,# show bars instead of names in text annotations linetype = "strata", # Change line type by groups conf.int = TRUE, # show confidence intervals for #conf.int.style = "step", # customize style of confidence intervals surv.median.line = "hv", # Specify median survival ggtheme = theme_bw(), # Change ggplot2 theme legend.labs = c("Ganguly", "Tendulkar","dravid"), # change legend labels ncensor.plot = TRUE, # plot the number of censored subjects (outs) at time t #palette = c("#000000", "#2E9FDF","#FF0000"))+ labs(x="Balls")```Dravid tends to face more balls before getting dismissed in half of his innings, reflecting a more defensive or enduring playing style. Ganguly, with the highest restricted mean, indicates a slightly higher average number of balls faced but with more variability. Tendulkar, despite facing the most balls overall, has a lower median, suggesting his innings are often shorter but with less variability in shorter durations. These results provide a nuanced view of each player's batting style and endurance in terms of balls faced per dismissal.# Estimating probability of SixersSimilar approach was used to estimate probability of at least one sixer. The event, or hazard (for the fielding team) is the batsman hitting a sixer. The hazard plot reveals interesting insight.```{r message=FALSE, warning=FALSE}#creating new data set with result of sixesdata.new<-data.df%>% mutate(six=ifelse(Sixes>0,1,0))#fitting the modelkap.six.fit<-survfit(Surv(BF,six) ~ player, data=data.new)#summary(kap.six.fit)$tableggsurvplot( kap.six.fit, pval = TRUE, break.time.by = 25, linetype = "strata", conf.int = TRUE, surv.median.line = "hv", ggtheme = theme_bw(), legend.labs = c("Ganguly", "Tendulkar","Dravid"), ncensor.plot = TRUE, fun = "event")+ labs(x="Balls", y="Sixer")```This analysis suggests that Ganguly is the most aggressive or efficient in terms of hitting sixes, as reflected by his lower median number of balls faced and a relatively narrow confidence interval. Tendulkar also demonstrates a balance between frequency and consistency in hitting sixes. Dravid, on the other hand, appears to be more conservative, facing more balls before hitting sixes, which aligns with his known playing style. The variability in Dravid's performance is higher, as suggested by the lack of an upper confidence limit. These insights provide a quantitative perspective on the players' styles concerning hitting sixes.Similar analysis was conducted using Cox Proportional Hazard model. This model is capable to consider effect of other parameters. 'Runs' was used as another parameter in the new model and the hazard plot was plotted again. ***** # Final AnalysesThe final stage of analysis includes a subset of the data previously used. Specifically, a few 'Opposition Teams' were selected with whom the number of matches were substantially higher than the rest.```{r echo=TRUE, message=FALSE, warning=FALSE}#Defining the opposition teamsteams<-c("Pakistan","Sri Lanka", "Australia", "West Indies", "South Africa", "New Zealand","Zimbabwe", "England", "Kenya", "Bangladesh")#Creating new data set with by filtering the teamsdata.pakistan<-data.new%>%filter(Opposition=="Pakistan")data.australia<-data.new%>%filter(Opposition=="Australia")data.england<-data.new%>%filter(Opposition=="England")# Fitting Cox modelscox.diss.pakistan<-survfit(coxph(Surv(BF,diss) ~ strata(player), data = data.pakistan))cox.diss.australia<-survfit(coxph(Surv(BF,diss) ~ strata(player), data = data.australia))cox.diss.england<-survfit(coxph(Surv(BF,diss) ~ strata(player), data = data.england))diss.pakistan.plot<-autoplot(cox.diss.pakistan,conf.int = TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Dismissal")+theme(legend.position = "top")diss.australia.plot<-autoplot(cox.diss.australia,conf.int = TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Dismissal")+theme(legend.position = "top")diss.england.plot<-autoplot(cox.diss.england,conf.int = TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Dismissal")+theme(legend.position = "top")#Resultscox.diss.pakistancox.diss.australiacox.diss.england```Dravid: Shows the highest endurance against England, facing the most balls on average, and a solid performance against Pakistan. However, he struggles more against Australia.Ganguly: Exhibits a more aggressive style, facing fewer balls on average against all teams, with the least endurance against Australia.Tendulkar: Demonstrates varying strategies, with aggressive play against Pakistan (facing the fewest balls) and better endurance against Australia and England.**********```{r echo=TRUE, fig.width=9, message=FALSE, warning=FALSE, results="hide"}#Plotting and arranging ggarrange(diss.pakistan.plot,diss.australia.plot,diss.england.plot,ncol=3)```*********## Further possibilitiesAfter building any model, the model should be tested on a test data to check for accuracy. Validation of the model is required to understand the robustness of the model. If the results are not satisfactory, changes are made to the model to improve accuracy. The process is repeated a few times on requirement.The predictions are based on certain parameters. There will be more parameters on which the outcome will depend. For e.g. the above analysis considers opposition team as a parameter. However, the ground and weather will also impact the outcome, which has not been considered. There is always a possibility to add more parameter to improve results.The effect of parameters may also change over time. For e.g. the opposition team may change members in their team, say, add a fast bowler. Also, the batsmen may evolve themselves to tackle a difficult opponent. Then the effect of the opponent on the outcome will not remain same throughout time. Further complex methods like accelerated failure models can be applied to improve the model.These are beyond the scope of this article. Feel free to try them out and don't forget to share your results. Feel free to reach out if you need help.******>*Survival Analysis can be used to help reduce churn, reduce warranty cost and improve pricing, predict failure in machinery and more. If you are curious about how it can help you achieve your business goals, do not hesitate to contact me.*