Jonathan Sedar

Personal website and new home of The Sampler blog

Survival Analysis: Part1 - What is it?

Posted at — 24 Mar 2015

Editor's note: this post is from The Sampler archives in 2015. During the last 4 years a lot has changed, not least that now most companies in most sectors have contracted / employed data scientists, and built / bought / run systems and projects. Most professionals will at least have an opinion now on data science, occasionally earned.

Let's see how well this post has aged - feel free to comment below - and I might write a followup at some point.

In this series of blogposts we'll explain tools & techniques of dealing with time-to-event data, and demonstrate how survival analysis is integral to many business processes.

Survival analysis is long established within actuarial science and medical research but seems infrequently used in general data science projects. In this series of blogposts we will:

By the end you should have a better understanding of the theory, some tools and techniques, and hopefully gain some ideas about how survival analysis can be applied to all manner of event-based processes that are often crucial to business operations.

A Quick Definition and Some Examples

Wikipedia defines survival analysis as:

a branch of statistics that deals with analysis of time duration until one or more events happen, such as death in biological organisms and failure in mechanical systems.

We might for example, expect a life actuary to try to predict what proportion of the general population will survive past a certain age1. She might furthermore want to know the ‘rate of change of survival’ (the hazard function), and the characteristics of individuals which most influence their survival.

More generally, we can use survival analysis to model the expected time-to-event for a wide variety of situations:

A worked example: truck maintenance scheduling

Imagine we have a fleet of haulage trucks and we're particularly interested in the elapsed time between purchase and first maintenance event (aka repairs aka servicing).

We could use this analysis to:

The following sections refer to this example.

The Basics

Plot a survival curve

Illustration of a survival curve and half-life

Observe:

The survival function $S(t)$

The survival function, $S(t)$, of an individual is the probability that they survive until at least time $t$.

$$ \begin{align} S(t) = P(T > t) \end{align} $$

…where $t$ is a time of interest and $T$ is the time of event.

The survival curve is non-increasing (the event may not reoccur for an individual) and is limited within the range $[0,1]$.

Note:

The half-life $t_{\frac{1}{2}}$

We saw the half-life of truck repair illustrated above (the orange arrows). Most people are familiar with measuring this:

  1. select a group of individuals - our fleet of trucks - and measure how long it takes for the event of interest to occur
  2. once the event of interest - the first service repair - occurs for half of the population, that period is aka the half-life

There's nothing particularly special about the half-life, and we might be interested in the time taken for e.g. 25% of the trucks to come in for first service, or 75% or 90% etc.

The hazard function $\lambda(t)$

The hazard function (aka decay function) $\lambda(t)$ is related to the survival function, and tells us the probability that the event $T$ occurs in the next instant $t + \delta t$, given that the individual has reached timestep $t$:

$$ \begin{align} \lambda(t) = \lim_{\delta{t} \rightarrow 0} \frac{\Pr(t \leq T < t + \delta t \ | \ T > t)}{\delta{t}} \end{align} $$

With some maths3 one can work back to the survival function $S(t)$:

$$ \begin{align} S(t) = exp(- \int\limits_{0}^{t} \lambda(u)du) \end{align} $$

The hazard function $\lambda(t)$ is non-parametric, so we can fit a pattern of events that is not necessarily monotonic.

Other measurements and important considerations

The cumulative hazard function

An alternative representation of the time-to-event behaviour is the cumulative hazard function $\Lambda(t)$, which is essentially the summing of the hazard function over time, and is used by some models for its greater stability.

We can show:

$$ \begin{align} \Lambda(t) &= \int\limits_{0}^{t} \lambda(u)du \newline &= -\log S(t) \end{align} $$

… and the simple relation of $\Lambda(t)$ to the survival function $S(t)$ is a nice property, exploited in particular by the Cox Proportional Hazards and Aalen Additive models, which we'll demonstrate later.

Stating it slightly differently, we can parametrically relate the attributes of the individuals to their survival curve:

$$ \begin{align} S(t) = -e^{\sum \Lambda_i(t)} \end{align} $$

This powerful approach is known as Survival Regression. In the trucks example, we might want to know the relative impact of engine size, hours of service, geographical regions driven etc upon the time from first purchase to first service.

Censoring and truncation

We're measuring time-to-event in the real world and so there's practical constraints on the period of study and how to treat individuals that fall outside that period. Censoring is when the event of interest (repair, first sale, etc) occurs outside the study period, and truncation is due to the study design.

The following discussion continues our hypothetical truck maintenance study:

Example of censoring in our truck maintenance scenario

Observe:

  1. We are unlikely to encounter such right-truncation in practice, since we're dealing with well-kept database records for purchases.

Censoring and truncation differ from one analysis to the next and it's always vitally important to understand the limitations of the study and state the heuristics used. Generally, one can expect to deal often with right-censoring, occasionally with left-truncation, and very rarely with left-censoring or right-truncation.

What models can we use?

The very simplest survival models are really just tables of event counts: non-parametric, easily computed and a good place to begin modelling to check assumptions, data quality and end-user requirements etc.

Kaplan-Meier Model

This model gives us a maximum-likelihood estimate of the survival function $\hat S(t)$ like that shown in the first diagram above.

$$ \begin{align} \hat{S}(t) = \prod\limits_{t_{i} < t} \frac{n_i - d_i}{n_i} \end{align} $$

where:

The cumulative product gives us a non-increasing curve where we can read off, at any timestep during the study, the estimated probability of survival from the start to that timestep. We can also compute the estimated survival time or median survival time (half-life) as shown above.

Nelson-Aalen Model

A close alternative method is the Nelson-Aalen model, which estimates the cumulative hazard function $\Lambda(t) = -log S(t)$ and is more stable since it has a summation form:

$$ \begin{align} \hat \Lambda(t) = \sum\limits_{t_i\leq t} \frac{d_i}{n_i} \end{align} $$

where again:

This approach of estimating the hazard function is the basis for many more methods, including:

Cox Proportional Hazards Model

Having computed the survival function4 for a population, the logical next step is to understand the effects of different characteristics of the individuals. In our truck example above, we might want to know whether maintenance periods are affected more or less by mileage, or by types of roads driven, or the manufacturer, model or load-capacity of truck etc.

The Cox PH model is stated as an exponential decay model, yielding a semi-parametric method of estimating the hazard function at time $t$ given a baseline hazard that's modified by a set of covariates:

$$ \begin{align} \lambda(t|X) &= \lambda_0(t) e^{\beta_1x_1 + \cdots + \beta_px_p} \newline &= \lambda_0(t) e^{\beta\vec{x}} \end{align} $$

where:

We can now make comparative statements such as:

There's many more models…

Survival analysis has been developed by many hands over many years and there's an embarrassment of riches in the literature, including:

… but we'll save the detail on those for specific examples in future posts in this series.

Final note

It takes a surprising amount of detail to explain the basics of survival analysis, so thank you for reading this far! As noted above, time-to-event analyses are very widely applicable to all sorts of real-world behaviours - not just studies of lifespan in actuarial or medical science.

In the rest of this series we'll use a publicly available dataset to demonstrate implementing a survival model, interpreting the results, and we'll try to learn something along the way.


  1. Life insurance was amongst the first commercial endeavors of the modern era to use data analysis: with the first life tables compiled by the astronomer and physicist Edmond Halley in the 1690's. Halley was involved in a surprising amount of analytical research, and as a contemporary of Flamsteed, Hooke, Wren and Newton he made a tremendous contribution to mathematics. This book is a good place to start. In the mid 1700's mathematical tools were in place to allow Edward Rowe Mores to establish the world's first mutual life insurer based on sound actuarial principles.

  2. If you have an old watch (pre-2000) with “T SWISS MADE T” somewhere on the dial, the T indicates that a luminous Tritium-containing compound is painted onto the hands and dial. You'll probably also sadly notice that it's completely failing to glow in the dark because the radioactive half-life of ~12.4 years means there's very little Tritium left now.

  3. For a thorough but easily-digested derivation of survival and hazard functions, see the Wikipedia page.

  4. The survival, hazard and cumulative hazard functions are often mentioned seemingly-interchangeably in casual explanations of survival analysis, since usually if you have one function you can have the other(s). In our experience its best to present results to general audiences as a computed survival curve regardless of the model used.


comments powered by Disqus