Survival analysis
Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory, reliability analysis or reliability engineering in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?
To answer such questions, it is necessary to define "lifetime". In the case of biological survival, death is unambiguous, but for mechanical reliability, failure may not be well-defined, for there may well be mechanical systems in which failure is partial, a matter of degree, or not otherwise localized in time. Even in biological problems, some events may have the same ambiguity. The theory outlined below assumes well-defined events at specific times; other cases may be better treated by models which explicitly account for ambiguous events.
More generally, survival analysis involves the modelling of time to event data; in this context, death or failure is considered an "event" in the survival analysis literature – traditionally only a single event occurs for each subject, after which the organism or mechanism is dead or broken. Recurring event or repeated event models relax that assumption. The study of recurring events is relevant in systems reliability, and in many areas of social sciences and medical research.
Introduction to survival analysis
Survival analysis is used in several ways:- To describe the survival times of members of a group
- *Life tables
- *Kaplan–Meier curves
- *Survival function
- *Hazard function
- To compare the survival times of two or more groups
- *Log-rank test
- To describe the effect of categorical or quantitative variables on survival
- *Cox proportional hazards regression
- *Parametric survival models
- *Survival trees
- *Survival random forests
Definitions of common terms in survival analysis
- Event: Death, disease occurrence, disease recurrence, recovery, or other experience of interest
- Time: The time from the beginning of an observation period to an event, or end of the study, or loss of contact or withdrawal from the study.
- Censoring / Censored observation: Censoring occurs when we have some information about individual survival time, but we do not know the survival time exactly. The subject is censored in the sense that nothing is observed or known about that subject after the time of censoring. A censored subject may or may not have an event after the end of observation time.
- Survival function S: The probability that a subject survives longer than time t.
Example: Acute myelogenous leukemia survival data
The aml data set sorted by survival time is shown in the box.
| observation | time | status | x |
| 12 | 5 | 1 | Nonmaintained |
| 13 | 5 | 1 | Nonmaintained |
| 14 | 8 | 1 | Nonmaintained |
| 15 | 8 | 1 | Nonmaintained |
| 1 | 9 | 1 | Maintained |
| 16 | 12 | 1 | Nonmaintained |
| 2 | 13 | 1 | Maintained |
| 3 | 13 | 0 | Maintained |
| 17 | 16 | 0 | Nonmaintained |
| 4 | 18 | 1 | Maintained |
| 5 | 23 | 1 | Maintained |
| 18 | 23 | 1 | Nonmaintained |
| 19 | 27 | 1 | Nonmaintained |
| 6 | 28 | 0 | Maintained |
| 20 | 30 | 1 | Nonmaintained |
| 7 | 31 | 1 | Maintained |
| 21 | 33 | 1 | Nonmaintained |
| 8 | 34 | 1 | Maintained |
| 22 | 43 | 1 | Nonmaintained |
| 9 | 45 | 0 | Maintained |
| 23 | 45 | 1 | Nonmaintained |
| 10 | 48 | 1 | Maintained |
| 11 | 161 | 0 | Maintained |
- Time is indicated by the variable "time", which is the survival or censoring time
- Event is indicated by the variable "status". 0= no event, 1= event
- Treatment group: the variable "x" indicates if maintenance chemotherapy was given
Kaplan–Meier plot for the aml data
The survival function S, is the probability that a subject survives longer than time t. S is theoretically a smooth curve, but it is usually estimated using the Kaplan–Meier curve. The graph shows the KM plot for the aml data and can be interpreted as follows:- The x axis is time, from zero to the last observed time point.
- The y axis is the proportion of subjects surviving. At time zero, 100% of the subjects are alive without an event.
- The solid line shows the progression of event occurrences.
- A vertical drop indicates an event. In the aml table shown above, two subjects had events at five weeks, two had events at eight weeks, one had an event at nine weeks, and so on. These events at five weeks, eight weeks and so on are indicated by the vertical drops in the KM plot at those time points.
- At the far right end of the KM plot there is a tick mark at 161 weeks. The vertical tick mark indicates that a patient was censored at this time. In the aml data table five subjects were censored, at 13, 16, 28, 45 and 161 weeks. There are five tick marks in the KM plot, corresponding to these censored observations.
Life table for the aml data
| time | n.risk | n.event | survival | std.err | lower 95% CI | upper 95% CI |
| 5 | 23 | 2 | 0.913 | 0.0588 | 0.8049 | 1 |
| 8 | 21 | 2 | 0.8261 | 0.079 | 0.6848 | 0.996 |
| 9 | 19 | 1 | 0.7826 | 0.086 | 0.631 | 0.971 |
| 12 | 18 | 1 | 0.7391 | 0.0916 | 0.5798 | 0.942 |
| 13 | 17 | 1 | 0.6957 | 0.0959 | 0.5309 | 0.912 |
| 18 | 14 | 1 | 0.646 | 0.1011 | 0.4753 | 0.878 |
| 23 | 13 | 2 | 0.5466 | 0.1073 | 0.3721 | 0.803 |
| 27 | 11 | 1 | 0.4969 | 0.1084 | 0.324 | 0.762 |
| 30 | 9 | 1 | 0.4417 | 0.1095 | 0.2717 | 0.718 |
| 31 | 8 | 1 | 0.3865 | 0.1089 | 0.2225 | 0.671 |
| 33 | 7 | 1 | 0.3313 | 0.1064 | 0.1765 | 0.622 |
| 34 | 6 | 1 | 0.2761 | 0.102 | 0.1338 | 0.569 |
| 43 | 5 | 1 | 0.2208 | 0.0954 | 0.0947 | 0.515 |
| 45 | 4 | 1 | 0.1656 | 0.086 | 0.0598 | 0.458 |
| 48 | 2 | 1 | 0.0828 | 0.0727 | 0.0148 | 0.462 |
The life table summarizes the events and the proportion surviving at each event time point. The columns in the life table have the following interpretation:
- time gives the time points at which events occur.
- n.risk is the number of subjects at risk immediately before the time point, t. Being "at risk" means that the subject has not had an event before time t, and is not censored before or at time t.
- n.event is the number of subjects who have events at time t.
- survival is the proportion surviving, as determined using the Kaplan–Meier product-limit estimate.
- std.err is the standard error of the estimated survival. The standard error of the Kaplan–Meier product-limit estimate it is calculated using Greenwood's formula, and depends on the number at risk, the number of deaths and the proportion surviving.
- lower 95% CI and upper 95% CI are the lower and upper 95% confidence bounds for the proportion surviving.