Survival Analysis in Cricket

article
analysis
R
Author

Asitav Sen

Published

May 27, 2020

Modified

January 18, 2025

Survival Analysis

Survival analysis deals with estimating probability of continuation of a particular status-quo at given point in time. Naturally, it also estimates the probability of discontinuation of the status quo i.e. occurrence of an event or a hazard. It finds application in several fields. For e.g. in medicine to estimate probability of survival of a patient under treatment of a fatal disease. Or in engineering to estimate reliability or predicting failure. Even in marketing and sales, for e.g. to decrease churn rate.

Cricket

In this piece, application of similar method in the game of cricket has been illustrated. Survival in this case means ‘not getting out or dismissed’ while batting. Naturally hazard means getting out. ODI Batting statistics of three Indian batsmen have been used. They are Sachin Tendulkar, Sourav Ganguly and Rahul Dravid. Data has been sourced from ESPN.

Code
#data frames named ganguly and tendulkar are loaded

#function to manipulate data
mani<-function(x){
x<-x%>%
  filter(Pos != "-")%>%
  mutate(diss=ifelse((Dismissal=="not out"),0,1))%>%    #This is the event of interest
  mutate(Runs=str_remove(Runs,"[*]"))%>%
  mutate(BF=as.integer(BF), Runs=as.integer(Runs),
         Mins=as.integer(Mins),Fours=as.integer(Fours),Sixes=as.integer(Sixes),
         SR=as.numeric(SR),Pos=as.integer(Pos),
         Dismissal=as.factor(Dismissal), Inns=as.factor(Inns),
         Date=dmy(Date))%>%
  separate(Opposition,c(NA,"Opposition"),extra = "merge", fill = "left")%>%
  separate(Ground,c(NA,"Ground"),extra = "drop", fill = "left")%>%
  mutate(Runs=ifelse(is.na(Runs),round(BF*SR,0),Runs))%>%   #BF or Balls Faced will be used as unit of time
  mutate(h_cen=ifelse((Runs>=50 & Runs<100),1,0))%>%
  mutate(cen=ifelse(Runs>=100,1,0))
#Adding dummy columns for possible future requirement
#x<-dummy_cols(x,select_columns = c("Opposition","Ground"))     
return(x)
}

#Adding player names in their data sets

ganguly$player<-c(rep("ganguly",nrow(ganguly))) 
tendulkar$player<-c(rep("tendulkar",nrow(tendulkar)))
dravid$player<-c(rep("dravid",nrow(dravid)))

#Combining data of the three players

data.raw<-rbind(ganguly,tendulkar, dravid)

#structuring the data using the function created

data.df<-mani(data.raw)
head(data.df)
  Runs Mins BF Fours Sixes     SR Pos Dismissal Inns  X  Opposition     Ground
1    3   33 13     0     0  23.07   6       lbw    1 NA West Indies   Brisbane
2   46  117 83     3     0  55.42   3   stumped    1 NA     England Manchester
3   16   NA 41     3     0  39.02   3    caught    1 NA   Sri Lanka        RPS
4   36   59 52     3     1  69.23   3    caught    2 NA    Zimbabwe        SSC
5   59  102 75     7     0  78.66   7       lbw    1 NA   Australia        SSC
6   11   NA  8     1     0 137.50   8   not out    1 NA    Pakistan    Toronto
        Date  player diss h_cen cen
1 1992-01-11 ganguly    1     0   0
2 1996-05-26 ganguly    1     0   0
3 1996-08-28 ganguly    1     0   0
4 1996-09-01 ganguly    1     0   0
5 1996-09-06 ganguly    1     1   0
6 1996-09-17 ganguly    0     0   0

Estimating probability of staying ‘not out’

Kaplan Meier Analysis is a non-parametric analysis. This means that only the event and the time is used in the analysis i.e. for e.g. possible effect of Opposition will not be considered. Unless, of course the data is grouped based on a parameter. Just like we will group the data based on players so that we observe the survival statistics for the individual players. We can further group the data based on Opposition or Ground.

Code
kap.mr.fit<-survfit(Surv(BF,diss) ~ player, data=data.df)
summary(kap.mr.fit)$table
                 records n.max n.start events    rmean se(rmean) median 0.95LCL
player=dravid        318   318     318    278 52.39229  2.131667     44      38
player=ganguly       300   300     300    279 53.35585  2.532818     41      32
player=tendulkar     452   452     452    412 50.24463  2.051203     35      30
                 0.95UCL
player=dravid         56
player=ganguly        45
player=tendulkar      42

Detailed summary can viewed by running ‘summary(kap.mr.fit)’.

This is visualized below

Code
ggsurvplot(
  kap.mr.fit,
  pval = TRUE, # show p-value
  break.time.by = 25, #break X axis by 25 balls
  #risk.table = "abs_pct", # absolute number and percentage at risk
  #risk.table.y.text = FALSE,# show bars instead of names in text annotations
  linetype = "strata",
  # Change line type by groups
  conf.int = TRUE,
  # show confidence intervals for
  #conf.int.style = "step",  # customize style of confidence intervals
  surv.median.line = "hv",
  # Specify median survival
  ggtheme = theme_bw(),
  # Change ggplot2 theme
  legend.labs =
    c("Ganguly", "Tendulkar","dravid"),
  # change legend labels
  ncensor.plot = TRUE,
  # plot the number of censored subjects (outs) at time t
  #palette = c("#000000", "#2E9FDF","#FF0000")
)+
  labs(x="Balls")

Dravid tends to face more balls before getting dismissed in half of his innings, reflecting a more defensive or enduring playing style. Ganguly, with the highest restricted mean, indicates a slightly higher average number of balls faced but with more variability. Tendulkar, despite facing the most balls overall, has a lower median, suggesting his innings are often shorter but with less variability in shorter durations. These results provide a nuanced view of each player’s batting style and endurance in terms of balls faced per dismissal.

Estimating probability of Sixers

Similar approach was used to estimate probability of at least one sixer. The event, or hazard (for the fielding team) is the batsman hitting a sixer. The hazard plot reveals interesting insight.

Code
#creating new data set with result of sixes
data.new<-data.df%>%
  mutate(six=ifelse(Sixes>0,1,0))

#fitting the model
kap.six.fit<-survfit(Surv(BF,six) ~ player, data=data.new)

#summary(kap.six.fit)$table
ggsurvplot(
  kap.six.fit,
  pval = TRUE,
  break.time.by = 25,
  linetype = "strata",
  conf.int = TRUE,
  surv.median.line = "hv",
  ggtheme = theme_bw(),
  legend.labs =
    c("Ganguly", "Tendulkar","Dravid"),
  ncensor.plot = TRUE,
  fun = "event"
)+
  labs(x="Balls", y="Sixer")

This analysis suggests that Ganguly is the most aggressive or efficient in terms of hitting sixes, as reflected by his lower median number of balls faced and a relatively narrow confidence interval. Tendulkar also demonstrates a balance between frequency and consistency in hitting sixes. Dravid, on the other hand, appears to be more conservative, facing more balls before hitting sixes, which aligns with his known playing style. The variability in Dravid’s performance is higher, as suggested by the lack of an upper confidence limit. These insights provide a quantitative perspective on the players’ styles concerning hitting sixes.

Similar analysis was conducted using Cox Proportional Hazard model. This model is capable to consider effect of other parameters. ‘Runs’ was used as another parameter in the new model and the hazard plot was plotted again.


Final Analyses

The final stage of analysis includes a subset of the data previously used. Specifically, a few ‘Opposition Teams’ were selected with whom the number of matches were substantially higher than the rest.

Code
#Defining the opposition teams

teams<-c("Pakistan","Sri Lanka", "Australia", "West Indies", "South Africa", "New Zealand",
"Zimbabwe", "England", "Kenya", "Bangladesh")

#Creating new data set with by filtering the teams

data.pakistan<-data.new%>%filter(Opposition=="Pakistan")
data.australia<-data.new%>%filter(Opposition=="Australia")
data.england<-data.new%>%filter(Opposition=="England")

# Fitting Cox models

cox.diss.pakistan<-survfit(coxph(Surv(BF,diss) ~ strata(player),
                    data = data.pakistan))
cox.diss.australia<-survfit(coxph(Surv(BF,diss) ~ strata(player),
                    data = data.australia))
cox.diss.england<-survfit(coxph(Surv(BF,diss) ~ strata(player),
                    data = data.england))


diss.pakistan.plot<-autoplot(cox.diss.pakistan,conf.int = TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Dismissal")+theme(legend.position = "top")

diss.australia.plot<-autoplot(cox.diss.australia,conf.int = TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Dismissal")+theme(legend.position = "top")

diss.england.plot<-autoplot(cox.diss.england,conf.int = TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Dismissal")+theme(legend.position = "top")


#Results

cox.diss.pakistan
Call: survfit(formula = coxph(Surv(BF, diss) ~ strata(player), data = data.pakistan))

           n events median 0.95LCL 0.95UCL
dravid    55     52     42      30      72
ganguly   50     48     36      28      47
tendulkar 67     63     27      18      48
Code
cox.diss.australia
Call: survfit(formula = coxph(Surv(BF, diss) ~ strata(player), data = data.australia))

           n events median 0.95LCL 0.95UCL
dravid    39     39     29      21      57
ganguly   33     33     24      12      42
tendulkar 70     69     38      26      66
Code
cox.diss.england
Call: survfit(formula = coxph(Surv(BF, diss) ~ strata(player), data = data.england))

           n events median 0.95LCL 0.95UCL
dravid    29     26     50      23      79
ganguly   26     25     43      31      79
tendulkar 37     33     40      29      59

Dravid: Shows the highest endurance against England, facing the most balls on average, and a solid performance against Pakistan. However, he struggles more against Australia. Ganguly: Exhibits a more aggressive style, facing fewer balls on average against all teams, with the least endurance against Australia. Tendulkar: Demonstrates varying strategies, with aggressive play against Pakistan (facing the fewest balls) and better endurance against Australia and England.


Code
#Plotting and arranging 
ggarrange(diss.pakistan.plot,diss.australia.plot,diss.england.plot,ncol=3)


Further possibilities

After building any model, the model should be tested on a test data to check for accuracy. Validation of the model is required to understand the robustness of the model. If the results are not satisfactory, changes are made to the model to improve accuracy. The process is repeated a few times on requirement.

The predictions are based on certain parameters. There will be more parameters on which the outcome will depend. For e.g. the above analysis considers opposition team as a parameter. However, the ground and weather will also impact the outcome, which has not been considered. There is always a possibility to add more parameter to improve results.

The effect of parameters may also change over time. For e.g. the opposition team may change members in their team, say, add a fast bowler. Also, the batsmen may evolve themselves to tackle a difficult opponent. Then the effect of the opponent on the outcome will not remain same throughout time. Further complex methods like accelerated failure models can be applied to improve the model.

These are beyond the scope of this article. Feel free to try them out and don’t forget to share your results. Feel free to reach out if you need help.


Survival Analysis can be used to help reduce churn, reduce warranty cost and improve pricing, predict failure in machinery and more. If you are curious about how it can help you achieve your business goals, do not hesitate to contact me.

Back to top