In statistics, a likelihood function (often simply the likelihood) is a function of the parameters of a statistical model. The likelihood of a set of parameter values, θ, given outcomes x, is equal to the probability of those observed outcomes given those parameter values, that is $\backslash mathcal\{L\}(\backslash theta\; x)\; =\; P(x\; \; \backslash theta)$.
Likelihood functions play a key role in statistical inference, especially methods of estimating a parameter from a set of statistics. In informal contexts, "likelihood" is often used as a synonym for "probability." But in statistical usage, a distinction is made depending on the roles of the outcome or parameter. Probability is used when describing a function of the outcome given a fixed parameter value. For example, if a coin is flipped 10 times and it is a fair coin, what is the probability of it landing headsup every time? Likelihood is used when describing a function of a parameter given an outcome. For example, if a coin is flipped 10 times and it has landed headsup 10 times, what is the likelihood that the coin is fair?
Definition
The likelihood function is defined differently for discrete and continuous probability distributions.
Discrete probability distribution
Let X be a random variable with a discrete probability distribution p depending on a parameter θ. Then the function
 $\backslash mathcal\{L\}(\backslash theta\; x)\; =\; p\_\backslash theta\; (x)\; =\; P\_\backslash theta\; (X=x),\; \backslash ,$
considered as a function of θ, is called the likelihood function (of θ, given the outcome x of X). Sometimes the probability on the value x of X for the parameter value θ is written as $P(X=x\backslash theta)$; often written as $P(X=x;\backslash theta)$ to emphasize that this value is not a conditional probability, because θ is a parameter and not a random variable.
Continuous probability distribution
Let X be a random variable with a continuous probability distribution with density function f depending on a parameter θ. Then the function
 $\backslash mathcal\{L\}(\backslash theta\; x)\; =\; f\_\{\backslash theta\}\; (x),\; \backslash ,$
considered as a function of θ, is called the likelihood function (of θ, given the outcome x of X). Sometimes the density function for the value x of X for the parameter value θ is written as $f(x\backslash theta)$, but should not be considered as a conditional probability density.
The actual value of a likelihood function bears no meaning. Its use lies in comparing one value with another. For example, one value of the parameter may be more likely than another, given the outcome of the sample. Or a specific value will be most likely: the maximum likelihood estimate. Comparison may also be performed in considering the quotient of two likelihood values. That is why $\backslash mathcal\{L\}(\backslash theta\; x)$ is generally permitted to be any positive multiple of the above defined function $\backslash mathcal\{L\}$. More precisely, then, a likelihood function is any representative from an equivalence class of functions,
 $\backslash mathcal\{L\}\; \backslash in\; \backslash left\backslash lbrace\; \backslash alpha\; \backslash ;\; P\_\backslash theta:\; \backslash alpha\; >\; 0\; \backslash right\backslash rbrace,\; \backslash ,$
where the constant of proportionality α > 0 is not permitted to depend upon θ, and is required to be the same for all likelihood functions used in any one comparison. In particular, the numerical value $\backslash mathcal\{L\}$(θ  x) alone is immaterial; all that matters are maximum values of $\backslash mathcal\{L\}$, or likelihood ratios, such as those of the form
 $\backslash frac\{\backslash mathcal\{L\}(\backslash theta\_2\; \; x)\}\{\backslash mathcal\{L\}(\backslash theta\_1\; \; x)\}$
= \frac{\alpha P(X=x\theta_2)}{\alpha P(X=x\theta_1)}
= \frac{P(X=x\theta_2)}{P(X=x\theta_1)},
that are invariant with respect to the constant of proportionality α.
For more about making inferences via likelihood functions, see also the method of maximum likelihood, and likelihoodratio testing.
Loglikelihood
For many applications involving likelihood functions, it is more convenient to work with the natural logarithm of the likelihood function, called the loglikelihood, than it is to work with the likelihood function itself. Because the logarithm is a monotonically increasing function, the logarithm of a function achieves its maximum value at the same points as the function itself, and hence the loglikelihood can be used in place of the likelihood in maximum likelihood estimation and related techniques. Finding the maximum of a function often involves taking the derivative of a function and solving for the parameter being maximized, and this is often easier when the function being maximized is a loglikelihood rather than the original likelihood function.
For example, some likelihood functions are for the parameters that explain a collection of statistically independent observations. In such a situation, the likelihood function factors into a product of individual likelihood functions. The logarithm of this product is a sum of individual logarithms, and the derivative of a sum of terms is often easier to compute than the derivative of a product. In addition, several common distributions have likelihood functions that contain products of factors involving exponentiation. The logarithm of such a function is a sum of products, again easier to differentiate than the original function.
A. W. F. Edwards referred to the loglikelihood ratio as the support, and the loglikelihood function as the support function.^{[1]} However, there is potential for confusion with the mathematical meaning of 'support', and this terminology is not widely used outside Edwards' main applied field of phylogenetics.
Example: the gamma distribution
As an example, consider the gamma distribution, which has two parameters, α and β. The likelihood function is
 $\backslash mathcal\{L\}\; (\backslash alpha,\; \backslash beta\; \backslash ,\backslash ,\; x)\; =\; \backslash frac\{\backslash beta^\backslash alpha\}\{\backslash Gamma(\backslash alpha)\}\; x^\{\backslash alpha1\}\; e^\{\backslash beta\; x\}$.
Suppose we wish to find the maximum likelihood estimate of β for a single observed value x. This function looks rather daunting. Its logarithm, however, is much simpler to work with:
 $\backslash log\; \backslash mathcal\{L\}(\backslash alpha,\backslash beta\; \backslash ,\backslash ,\; x)\; =\; \backslash alpha\; \backslash log\; \backslash beta\; \; \backslash log\; \backslash Gamma(\backslash alpha)\; +\; (\backslash alpha1)\; \backslash log\; x\; \; \backslash beta\; x.\; \backslash ,$
Maximizing the loglikelihood first requires taking the partial derivative with respect to β:
 $\backslash frac\{\backslash partial\; \backslash log\; \backslash mathcal\{L\}(\backslash alpha,\backslash beta\; \backslash ,\backslash ,\; x)\}\{\backslash partial\; \backslash beta\}\; =\; \backslash frac\{\backslash alpha\}\{\backslash beta\}\; \; x$.
If there are a number of independent random samples x_{1}, ..., x_{n}, then the joint loglikelihood will be the sum of individual loglikelihoods, and the derivative of this sum will be a sum of derivatives of each individual loglikelihood:
 $\backslash frac\{\backslash partial\; \backslash log\; \backslash mathcal\{L\}(\backslash alpha,\backslash beta\; \backslash ,\backslash ,\; x\_1,\; \backslash ldots,\; x\_n)\}\{\backslash partial\; \backslash beta\}\; =\; \backslash frac\{\backslash partial\; \backslash log\; \backslash mathcal\{L\}(\backslash alpha,\backslash beta\; \backslash ,\backslash ,\; x\_1)\}\{\backslash partial\; \backslash beta\}\; +\; \backslash cdots\; +\; \backslash frac\{\backslash partial\; \backslash log\; \backslash mathcal\{L\}(\backslash alpha,\backslash beta\; \backslash ,\backslash ,\; x\_n)\}\{\backslash partial\; \backslash beta\}\; =\; \backslash frac\{n\; \backslash alpha\}\{\backslash beta\}\; \; \backslash sum\_\{i=1\}^n\; x\_i.$
To complete the maximization procedure for the joint loglikelihood, the equation is set to zero and solved for β:
 $\backslash hat\backslash beta\; =\; \backslash frac\{\backslash alpha\}\{\backslash bar\{x\}\}.$
Here $\backslash hat\backslash beta$ denotes the maximumlikelihood estimate, and $\backslash bar\{x\}\; =\; \backslash frac\{1\}\{n\}\; \backslash sum\_\{i=1\}^n\; x\_i$ is the sample mean of the observations.
Likelihood function of a parameterized model
Among many applications, we consider here one of broad theoretical and practical importance. Given a parameterized family of probability density functions (or probability mass functions in the case of discrete distributions)
 $x\backslash mapsto\; f(x\backslash mid\backslash theta),\; \backslash !$
where θ is the parameter, the likelihood function is
 $\backslash theta\backslash mapsto\; f(x\backslash mid\backslash theta),\; \backslash !$
written
 $\backslash mathcal\{L\}(\backslash theta\; \backslash mid\; x)=f(x\backslash mid\backslash theta),\; \backslash !$
where x is the observed outcome of an experiment. In other words, when f(x  θ) is viewed as a function of x with θ fixed, it is a probability density function, and when viewed as a function of θ with x fixed, it is a likelihood function.
This is not the same as the probability that those parameters are the right ones, given the observed sample. Attempting to interpret the likelihood of a hypothesis given observed evidence as the probability of the hypothesis is a common error, with potentially disastrous consequences in medicine, engineering or jurisprudence. See prosecutor's fallacy for an example of this.
From a geometric standpoint, if we consider f (x, θ) as a function of two variables then the family of probability distributions can be viewed as a family of curves parallel to the xaxis, while the family of likelihood functions are the orthogonal curves parallel to the θaxis.
Likelihoods for continuous distributions
The use of the probability density instead of a probability in specifying the likelihood function above may be justified in a simple way. Suppose that, instead of an exact observation, x, the observation is the value in a short interval (x_{j−1}, x_{j}), with length Δ_{j}, where the subscripts refer to a predefined set of intervals. Then the probability of getting this observation (of being in interval j) is approximately
 $\backslash mathcal\{L\}\_\backslash text\{approx\}(\backslash theta\; \backslash mid\; x\; \backslash text\{\; in\; interval\; \}\; j)\; =\; f(x\_\{*\}\backslash mid\backslash theta)\; \backslash Delta\_j,\; \backslash !$
where x_{*} can be any point in interval j. Then, recalling that the likelihood function is defined up to a multiplicative constant, it is just as valid to say that the likelihood function is approximately
 $\backslash mathcal\{L\}\_\backslash text\{approx\}(\backslash theta\; \backslash mid\; x\; \backslash text\{\; in\; interval\; \}\; j)=\; f(x\_\{*\}\backslash mid\backslash theta),\; \backslash !$
and then, on considering the lengths of the intervals to decrease to zero,
 $\backslash mathcal\{L\}(\backslash theta\; \backslash mid\; x\; )=\; f(x\backslash mid\backslash theta).\; \backslash !$
Likelihoods for mixed continuous–discrete distributions
The above can be extended in a simple way to allow consideration of distributions which contain both discrete and continuous components. Suppose that the distribution consists of a number of discrete probability masses p_{k}(θ) and a density f(x  θ), where the sum of all the p's added to the integral of f is always one. Assuming that it is possible to distinguish an observation corresponding to one of the discrete probability masses from one which corresponds to the density component, the likelihood function for an observation from the continuous component can be dealt with as above by setting the interval length short enough to exclude any of the discrete masses. For an observation from the discrete component, the probability can either be written down directly or treated within the above context by saying that the probability of getting an observation in an interval that does contain a discrete component (of being in interval j which contains discrete component k) is approximately
 $\backslash mathcal\{L\}\_\backslash text\{approx\}(\backslash theta\; \backslash mid\; x\; \backslash text\{\; in\; interval\; \}\; j\; \backslash text\{\; containing\; discrete\; mass\; \}\; k)=p\_k(\backslash theta)\; +\; f(x\_\{*\}\backslash mid\backslash theta)\; \backslash Delta\_j,\; \backslash !$
where $x\_\{*\}\backslash $ can be any point in interval j. Then, on considering the lengths of the intervals to decrease to zero, the likelihood function for an observation from the discrete component is
 $\backslash mathcal\{L\}(\backslash theta\; \backslash mid\; x\; )=\; p\_k(\backslash theta),\; \backslash !$
where k is the index of the discrete probability mass corresponding to observation x.
The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and the probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observation x, but not with the parameter θ.
Example 1
Let $p\_\backslash text\{H\}$ be the probability that a certain coin lands heads up (H) when tossed. So, the probability of getting two heads in two tosses (HH) is $p\_\backslash text\{H\}^2$. If $p\_\backslash text\{H\}\; =\; 0.5$, then the probability of seeing two heads is 0.25.
 $P(\backslash text\{HH\}\; \; p\_\backslash text\{H\}=0.5)\; =\; 0.25.$
Another way of saying this is that the likelihood that $p\_\backslash text\{H\}\; =\; 0.5$, given the observation HH, is 0.25, that is
 $\backslash mathcal\{L\}(p\_\backslash text\{H\}=0.5\; \; \backslash text\{HH\})\; =\; P(\backslash text\{HH\}\; \; p\_\backslash text\{H\}=0.5)\; =\; 0.25.$
But this is not the same as saying that the probability that $p\_\backslash text\{H\}\; =\; 0.5$, given the observation HH, is 0.25. The likelihood that $p\_\backslash text\{H\}\; =\; 1$, given the observation HH, is 1, but it is not true that the probability that $p\_\backslash text\{H\}\; =\; 1$, given the observation HH, is 1. Two heads in a row does not prove that the coin always comes up heads, because two heads in a row is possible for any $p\_\backslash text\{H\}\; >\; 0$.
The likelihood function is not a probability density function. The integral of a likelihood function is not in general 1. In this example, the integral of the likelihood over the interval [0, 1] in $p\_\backslash text\{H\}$ is 1/3, demonstrating that the likelihood function cannot be interpreted as a probability density function for $p\_\backslash text\{H\}$.
Example 2
Consider a jar containing N lottery tickets numbered from 1 through N. If you pick a ticket randomly then you get positive integer n, with probability 1/N if n ≤ N and with probability zero if n > N. This can be written
 $P(nN)=\; \backslash frac\{N\}$
where the Iverson bracket [n ≤ N] is 1 when n ≤ N and 0 otherwise.
When considered a function of n for fixed N this is the probability distribution, but when considered a function of N for fixed n this is a likelihood function. The maximum likelihood estimate for N is N_{0} = n (by contrast, the unbiased estimate is 2n − 1).
This likelihood function is not a probability distribution, because the total
 $\backslash sum\_\{N=1\}^\backslash infty\; P(nN)\; =\; \backslash sum\_\{N\}\; \backslash frac\{N\}\; =\; \backslash sum\_\{N=n\}^\backslash infty\; \backslash frac\{1\}\{N\}$
is a divergent series.
Suppose, however, that you pick two tickets rather than one.
The probability of the outcome {n_{1}, n_{2}}, where n_{1} < n_{2}, is
 $P(\backslash \{n\_1,n\_2\backslash \}N)=\; \backslash frac\{\backslash binom\; N\; 2\}\; .$
When considered a function of N for fixed n_{2}, this is a likelihood function. The maximum likelihood estimate for N is N_{0} = n_{2}.
This time the total
 $\backslash sum\_\{N=1\}^\backslash infty\; P(\backslash \{n\_1,n\_2\backslash \}N)$
= \sum_{N} \frac{\binom N 2}
=\frac 2 {n_21}
is a convergent series, and so this likelihood function can be normalized into a probability distribution.
If you pick 3 or more tickets, the likelihood function has a well defined mean value, which is larger than the maximum likelihood estimate. If you pick 4 or more tickets, the likelihood function has a well defined standard deviation too.
Relative likelihood
Relative likelihood function
Suppose that the maximum likelihood estimate for θ is $\backslash hat\; \backslash theta$. Relative plausibilities of other θ values may be found by comparing the likelihood of those other values with the likelihood of $\backslash hat\; \backslash theta$. The relative likelihood of θ is defined^{[2]}^{[3]} as $\backslash mathcal\{L\}(\backslash theta\; \; x)/\backslash mathcal\{L\}(\backslash hat\; \backslash theta\; \; x).$
A 10% likelihood region for θ is
 $\backslash \{\backslash theta:\; \backslash mathcal\{L\}(\backslash theta\; \; x)/\backslash mathcal\{L\}(\backslash hat\; \backslash theta\; \; x)\; \backslash ge\; 0.10\backslash \},$
and more generally, a p% likelihood region for θ is defined^{[2]}^{[3]} to be
 $\backslash \{\backslash theta:\; \backslash mathcal\{L\}(\backslash theta\; \; x)/\backslash mathcal\{L\}(\backslash hat\; \backslash theta\; \; x)\; \backslash ge\; p/100\; \backslash \}.$
If θ is a single real parameter, a p% likelihood region will typically comprise an interval of real values. In that case, the region is called a likelihood interval.^{[2]}^{[3]}^{[4]}
Likelihood intervals can be compared to confidence intervals. If θ is a single real parameter, then under certain conditions, a 14.7% likelihood interval for θ will be the same as a 95% confidence interval.^{[2]} In a slightly different formulation suited to the use of loglikelihoods, the e^{−2} likelihood interval is the same as the 0.954 confidence interval (under certain conditions).^{[4]}
The idea of basing an interval estimate on the relative likelihood goes back to Fisher in 1956 and has been used by many authors since then.^{[4]} If a likelihood interval is specifically to be interpreted as a confidence interval, then this idea is immediately related to the likelihood ratio test which can be used to define appropriate intervals for parameters. This approach can be used to define the critical points for the likelihood ratio statistic to achieve the required coverage level for a confidence interval. However a likelihood interval can be used as such, having been determined in a welldefined way, without claiming any particular coverage probability.
Relative likelihood of models
The definition of relative likelihood can also be generalized to compare different (fitted) statistical models. This generalization is based on Akaike information criterion, or more usually, AICc (Akaike Information Criterion with correction). Suppose that, for some dataset, we have two statistical models, M_{1} and M_{2}, with fixed parameters. Also suppose that AICc(M_{1}) ≤ AICc(M_{2}). Then the relative likelihood of M_{2} with respect to M_{1} is defined^{[5]} to be
 exp((AICc(M_{1})−AICc(M_{2}))/2)
To see that this is a generalization of the earlier definition, suppose that we have some model M with a (possibly multivariate) parameter θ. Then for any θ, set M_{2} = M(θ), and also set M_{1} = M($\backslash hat\backslash theta$). The general definition now gives the same result as the earlier definition.
Likelihoods that eliminate nuisance parameters
In many cases, the likelihood is a function of more than one parameter but interest focuses on the estimation of only one, or at most a few of them, with the others being considered as nuisance parameters. Several alternative approaches have been developed to eliminate such nuisance parameters so that a likelihood can be written as a function of only the parameter (or parameters) of interest; the main approaches being marginal, conditional and profile likelihoods.^{[6]}^{[7]}
These approaches are useful because standard likelihood methods can become unreliable or fail entirely when there are many nuisance parameters or when the nuisance parameters are highdimensional. This is particularly true when the nuisance parameters can be considered to be "missing data"; they represent a nonnegligible fraction of the number of observations and this fraction does not decrease when the sample size increases. Often these approaches can be used to derive closedform formulae for statistical tests when direct use of maximum likelihood requires iterative numerical methods. These approaches find application in some specialized topics such as sequential analysis.
Conditional likelihood
Sometimes it is possible to find a sufficient statistic for the nuisance parameters, and conditioning on this statistic results in a likelihood which does not depend on the nuisance parameters.
One example occurs in 2×2 tables, where conditioning on all four marginal totals leads to a conditional likelihood based on the noncentral hypergeometric distribution. This form of conditioning is also the basis for Fisher's exact test.
Marginal likelihood
Sometimes we can remove the nuisance parameters by considering a likelihood based on only part of the information in the data, for example by using the set of ranks rather than the numerical values. Another example occurs in linear mixed models, where considering a likelihood for the residuals only after fitting the fixed effects leads to residual maximum likelihood estimation of the variance components.
Profile likelihood
It is often possible to write some parameters as functions of other parameters, thereby reducing the number of independent parameters.
(The function is the parameter value which maximizes the likelihood given the value of the other parameters.)
This procedure is called concentration of the parameters and results in the concentrated likelihood function, also occasionally known as the maximized likelihood function, but most often called the profile likelihood function.
For example, consider a regression analysis model with normally distributed errors. The most likely value of the error variance is the variance of the residuals. The residuals depend on all other parameters. Hence the variance parameter can be written as a function of the other parameters.
Unlike conditional and marginal likelihoods, profile likelihood methods can always be used, even when the profile likelihood cannot be written down explicitly. However, the profile likelihood is not a true likelihood, as it is not based directly on a probability distribution, and this leads to some less satisfactory properties. Attempts have been made to improve this, resulting in modified profile likelihood.
The idea of profile likelihood can also be used to compute confidence intervals that often have better smallsample properties than those based on asymptotic standard errors calculated from the full likelihood. In the case of parameter estimation in partially observed systems, the profile likelihood can be also used for identifiability analysis.^{[8]}
Results from profile likelihood analysis can be incorporated in uncertainty analysis of model predictions.^{[9]}
An implementation is available in the MATLAB Toolbox PottersWheel.
Partial likelihood
A partial likelihood is a factor component of the likelihood function that isolates the parameters of interest.^{[10]} It is a key component of the proportional hazards model.
In English, "likelihood" has been distinguished as being related to but weaker than "probability" since its earliest uses. The comparison of hypotheses by evaluating likelihoods has been used for centuries, for example by John Milton in Aeropagitica: "when greatest likelihoods are brought that such things are truly and really in those persons to whom they are ascribed". 20:44, 23 December 2012 (UTC)
In the Netherlands Christiaan Huygens used the concept of likelihood in his book "Van rekeningh in spelen van geluck" ("On Reasoning in Games of Chance") in 1657.
In Danish, "likelihood" was used by Thorvald N. Thiele in 1889.^{[11]}^{[12]}^{[13]}
In English, "likelihood" appears in many writings by
Charles Sanders Peirce, where
modelbased inference (usually
abduction but sometimes including
induction) is distinguished from statistical procedures based on
objective randomization. Peirce's preference for randomizationbased inference is discussed in "
Illustrations of the Logic of Science" (1877–1878) and "
A Theory of Probable Inference" (1883)".
"probabilities that are strictly objective and at the same time very great, although they can never be absolutely conclusive, ought nevertheless to influence our preference for one hypothesis over another; but slight probabilities, even if objective, are not worth consideration; and merely subjective likelihoods should be disregarded altogether. For they are merely expressions of our preconceived notions" (7.227 in his Collected Papers).
"But experience must be our chart in economical navigation; and experience shows that likelihoods are treacherous guides. Nothing has caused so much waste of time and means, in all sorts of researchers, as inquirers' becoming so wedded to certain likelihoods as to forget all the other factors of the economy of research; so that, unless it be very solidly grounded, likelihood is far better disregarded, or nearly so; and even when it seems solidly grounded, it should be proceeded upon with a cautious tread, with an eye to other considerations, and recollection of the disasters caused." (Essential Peirce, volume 2, pages 108–109)
Like Thiele, Peirce considers the likelihood for a binomial distribution. Peirce uses the logarithm of the oddsratio throughout his career. Peirce's propensity for using the log odds is discussed by Stephen Stigler.
In Great Britain, "likelihood" was popularized in mathematical statistics by R.A. Fisher in 1922:^{[14]} "On the mathematical foundations of theoretical statistics". In that paper, Fisher also uses the term "method of maximum likelihood". Fisher argues against inverse probability as a basis for statistical inferences, and instead proposes inferences based on likelihood functions. Fisher's use of "likelihood" fixed the terminology that is used by statisticians throughout the world.
See also
Notes
References
  jstor = 2958222
  jstor = 2344804


  jstor = 2676741

External links
 Likelihood function at Planetmath
This article was sourced from Creative Commons AttributionShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, EGovernment Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a nonprofit organization.