Logistic Regression Analysis

Binary logistic regression analysis is a statistical method that can be applied mainly in retrospective data to explore and model the relationship between a random dichotomous variable and one or more random independent variables (continuous or categorical) [68–70].

From: Analysis in Nutrition Research , 2019

Logistic Regression

Julien I.E. Hoffman , in Biostatistics for Medical and Biomedical Practitioners, 2015

Multiple Explanatory Variables

An advantage of logistic regression is that it allows the evaluation of multiple explanatory variables by extension of the basic principles. The general equation is

P = 1 1 + e ( β 0 + β 1 X 1 + β 2 X 2 + β n X n ) = 1 1 + e ( β 0 + β i X i ) .

For example, consider predicting the probability of artificial ventilation from birth weight, gestational age, and maternal age, first separately (Figure 33.4).

Figure 33.4. Individual probability curves for birth weight and age. The curve for maternal age was almost flat and is not shown.

The probability of artificial ventilation is almost 1 for the smallest and youngest neonates. It is possible to interpret each curve as splitting the area into a portion with risk of ventilation and a portion with no risk of ventilation; the two vertical dashed lines show that at a birth weight of 1400   g, the odds of artificial ventilation are about one-quarter the odds of no artificial ventilation.

The statistics of these curves are shown in Table 33.3.

Table 33.3. Probability of artificial ventilation

Whole model test
Model –Log-Likelihood DF Chi-Square Prob   >   ChiSq
Difference 54.14953 1 108.2991 <0.0001
Full 163.92351
Reduced 218.07303 Birth weight
RSquare (U) 0.2483
Observations (or sum weights) 315
Whole model test
Model –Log-Likelihood DF Chi-Square Prob   >   ChiSq
Difference 76.10544 1 152.2109 <0.0001
Full 141.96759
Reduced 218.07303 Gestational age
RSquare (U) 0.3490
Observations (or sum weights) 315
Whole model test
Model –Log-Likelihood DF Chi-Square Prob   >   ChiSq
Difference 4.89681 1 9.793625 <0.0018
Full 213.17622
Reduced 218.07303 Maternal age
RSquare (U) 0.0225
Observations (or sum weights) 315

All three explanatory factors are significant on their own, with the best single predictor being gestational age. Although maternal age as a predictor is significant, with P  =   0.0018, the RSquare value of 0.0225 shows that maternal age by itself can explain only 2.25% of the variability.

The question is what happens if all the three explanatory factors are included in a single equation. Will it provide more information than any of the single regressions? Will it show if any of the variables are redundant? The results are shown in Table 33.4.

Table 33.4. Regression with three variables

Whole model test
Model –Log-Likelihood DF Chi-Square Prob   >   ChiSq
Difference 79.86792 3 159.7358 <0.0001
Full 138.20511
Reduced 218.07303
RSquare (U) 0.3662
Observations (or sum weights) converged by gradient 315
Lack of fit
Parameter estimates
Term Estimate Std error Chi-Square Prob   >   ChiSq Lower 95% Upper
Intercept 20.4402255 2.622536 60.75 <0.0001 15.5825235 25.9006
Birth weight in grams −0.0008706 0.0008508 1.05 0.3062 −0.0025389 0.00080
Gestational age at delivery −0.5993864 0.100158 35.81 <0.0001 −0.8054551 −0.411
Maternal age at delivery. For log odds of 0/1 −0.0591915 0.0244469 5.86 0.0155 −0.1085991 −0.0123

The RSquare has increased slightly to 0.3662 from the highest single value of 0.3490 for gestational age alone. Birth weight is no longer a predictor; it has an insignificant chi-square (P  =   0.3062) and the confidence limits for its coefficient range from positive to negative. Therefore omit birth weight and produce a final regression with two variables (Table 33.5).

Table 33.5. Two variable regression

Whole model test
Model –Log-Likelihood DF Chi-Square Prob   >   ChiSq
Difference 79.34711 2 158.6942 <0.0001
Full 138.72592
Reduced 218.07303
RSquare (U) 0.3639
Observations (or sum weights) 315
Lack of fit
Source DF –Log-Likelihood Chi-Square Prob   >   ChiSq
Lack of fit 258 117.22481 234.4496 0.8509
Saturated 260 21.50111
Fitted 2 138.72592
Parameter estimates
Term Estimate Std error Chi-Square Prob   >   ChiSq
Intercept 21.5626217 2.4170882 79.58 <0.0001
Gestational age at delivery −0.6703814 0.0744915 80.99 <0.0001
Maternal age at delivery. For log odds of 0/1 −0.0603604 0.0244109 6.11 0.0134

The final equation is

P = 1 1 + e ( 21.5626 0.6704 G A 0.0604 M A ) ,

where GA is gestational age and MA is maternal age.

For gestational ages of 25, 30, and 35   weeks, and maternal ages of 20 and 40   years, the probabilities of needing artificial ventilation calculated from the formula are given in Table 33.6.

Table 33.6. Selected probabilities

Gestational age (weeks) Maternal age (years) Probability of artificial ventilation
25 20 0.9732
25 40 0.9158
30 20 0.5603
30 40 0.2758
35 20 0.0427
35 40 0.0139

In keeping with the coefficients that were determined, maternal age plays a small role in determining the need for artificial ventilation.

Alternatively, write ln P 1 P = 21.5626 0.6704 G A 0.0604 M A . The value for β0 of 21.5626 is the average risk of artificial ventilation independent of any explanatory variables. β1, the coefficient for gestational age, is 0.6704. Therefore, the logit decreases by 0.6704 units for each week increase in gestational age if maternal age is constant.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128023877000330

Ancestry Estimation

Elizabeth A. DiGangi , Joseph T. Hefner , in Research Methods in Human Skeletal Biology, 2013

Logistic Regression

Logistic regression (LR) is a statistical method similar to linear regression since LR finds an equation that predicts an outcome for a binary variable, Y, from one or more response variables, X. However, unlike linear regression the response variables can be categorical or continuous, as the model does not strictly require continuous data. To predict group membership, LR uses the log odds ratio rather than probabilities and an iterative maximum likelihood method rather than a least squares to fit the final model. This means the researcher has more freedom when using LR and the method may be more appropriate for nonnormally distributed data or when the samples have unequal covariance matrices. Logistic regression assumes independence among variables, which is not always met in morphoscopic datasets. However, as is often the case, the applicability of the method (and how well it works, e.g., the classification error) often trumps statistical assumptions. One drawback of LR is that the method cannot produce typicality probabilities (useful for forensic casework), but these values may be substituted with nonparametric methods such as ranked probabilities and ranked interindividual similarity measures (Ousley and Hefner, 2005).

Logistic regression analysis can also be carried out in SPSS® using the NOMREG procedure. We suggest a forward stepwise selection procedure. When we ran that analysis on a sample of data collected by JTH ( 2009) the LR stepwise selected five variables: (1) inferior nasal aperture, (2) interorbital breadth, (3) nasal aperture width, (4) nasal bone structure, and (5) post-bregmatic depression. The likelihood ratio test (Table 5.7) is significant and demonstrates that the reduced model is equivalent to the final LR model. The Cox and Snell pseudo R-squared statistics (not shown) (0.553) imply that approximately 56% of the variation in morphoscopic trait expression is explained by ancestry. This LR model is accurate for nearly 90% of the individuals in the sample (Table 5.8).

TABLE 5.7. Likelihood Ratio Tests for the Two-Way Logistic Regression

Effect Model Fitting Criteria Likelihood Ratio Tests
–2 Log Likelihood of Reduced Model Chi-Square df Sig.
Intercept 183.2665 0 0
INA 238.2002 54.9337 5 0.00000
IOB 207.4619 24.1953 2 0.00001
NAW 193.9302 10.6637 2 0.00484
NBS 199.0345 15.7680 4 0.00335
PBD 191.4299 8.1634 2 0.01688

INA   =   inferior nasal aperture

IOB   =   interorbital breadth

NAW   =   nasal aperture width

NBS   =   nasal bone structure

PBD   =   post-bregmatic depression

TABLE 5.8. Classification Matrix for Two-Way Logistic Regression

Black White % Correct
Black 200 17 92.17
White 19 117 86.03
Total 89.80

Each of these presented methods has advantages and disadvantages and each is suited to a particular task. We present these two statistics not to suggest they are the best or most appropriate methods but to demonstrate the flexibility of statistical methods to handle categorical data and to encourage the reader to explore these and other statistics for use in their own projects.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123851895000054

Core Technologies: Machine Learning and Natural Language Processing

Prakash Nadkarni , in Clinical Research Computing, 2016

4.5.3 SVMs versus logistic regression

Like logistic regression, SVMs can also be generalized to categorical output variables that take more than two values. On the other hand, the kernel trick can also be employed for logistic regression (this is called "kernel logistic regression"). While logistic regression, like linear regression, also makes use of all data points, points far away from the margin have much less influence because of the logit transform, and so, even though the math is different, they often end up giving results similar to SVMs.

As to the choice of SVMs versus logistic regression, it often makes sense to try both. SVMs sometimes give a better fit and are computationally more efficient—logistic regression uses all data points but then the values away from the margin are discounted, while SVM uses only the support-vector data points to begin with. However, SVM is a bit of a "black box" in terms of interpretability. In logistic regression, on the other hand, the contribution of individual variables to the final fit can be better understood, and in back-fitting of the data, the outputs can be directly interpreted as probabilities.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012803130800004X

Research and Methods

Andrew C. Leon , in Comprehensive Clinical Psychology, 1998

3.12.4.5.3 Logistic regression

Logistic regression analysis is used to examine the association of (categorical or continuous) independent variable(s) with one dichotomous dependent variable. This is in contrast to linear regression analysis in which the dependent variable is a continuous variable. The discussion of logistic regression in this chapter is brief. Hosmer and Lemeshow (1989) provide a comprehensive introduction to logistic regression analysis.

Consider an example in which logistic regression could be used to examine the research question, "Is a history of suicide attempts associated with the risk of a subsequent (i.e., prospectively observed) attempt?" The logistic regression model compares the odds of a prospective attempt in those with and without prior attempts. The ratio of those odds is called the odds ratio. A logistic regression does not analyze the odds, but a natural logarithmic transformation of the odds, the log odds. Although the calculations are more complicated when there are multiple independent variables, computer programs can be used to perform the analyses. However, because of the logarithmic transformation of the odds ratio, the interpretation of results from the computer output is not necessarily straightforward. Interpretation requires a transformation back to the original scale by taking the inverse of the natural log of the regression coefficient, which is called exponentiation. The exponentiated regression coefficient represents the strength of the association of the independent variable with the outcome. More specifically, it represents the increase (or decrease) in risk of the outcome that is associated with the independent variable. The exponentiated regression coefficient represents the difference in risk of the outcome (e.g., suicide attempt) for two subjects who differ by one point on the independent variable. In this case, that is the difference between those with and without history of attempts (i.e., when history of attempts is coded: 0 = no and 1 = yes).

The logistic regression model can be extended to include several independent variables (i.e., hypothesized risk factors). For instance, are history of attempts, severity of depression, and employment status risk factors for suicidal behavior, controlling for diagnosis, age, and gender? Each odds ratio from such a model represents the change in risk of the outcome (i.e., a suicide attempt) that is associated with the independent variable, controlling for the other independent variables.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0080427073002649

Towards Automatic Risk Analysis for Hereditary Non-Polyposis Colorectal Cancer Based on Pedigree Data

Münevver Köküer , ... Roger Green , in Outcome Prediction in Cancer, 2007

3.3.1. Classification Based on Logistic Regression

Logistic regression is part of a category of statistical models called "generalized linear models" and many of its applications can be found in the medical field. An assessment of clinical findings in HNPCC is given in Wijnen et al. (1998).

The logistic regression is a method for classifying a given input vector x = (x 1, x 2,…, xD ) into one of two classes. It is based on a model that the logarithm of the odds of belonging to one class is a linear function of the feature vector elements used for classification, i.e.

(4) l n ( p / 1 - p ) = α + β 1 x 1 + β 2 x 2 + + β D x D ,

where p is the probability of belonging to one class, p/(1 – p) is the odds ratio, and a and β1, β2, … β D are regression coefficients that are to be estimated based on the data. The most widely used method to estimate these coefficients is the maximum likelihood.

Due to the above-mentioned characteristics of the LR, HNPCC pedigree data are analysed separately for each of the risk classes (high, intermediate and low), in turn, to predict the probability of belonging to that class; i.e. in each risk class, the other two are combined together.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444528551500143

Selected Statistical Methods in QSAR

Kunal Roy , ... Rudra Narayan Das , in Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment, 2015

6.3.2 Logistic regression

Logistic regression [19] is a statistical classification model that measures the relationship between a categorical-dependent variable (having only two categories) and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable. Logistic regression does not assume a linear relationship between the dependent and independent variables. The independent variables need neither be normally distributed, nor linearly related, nor of equal variance within each group.

The form of the logistic regression equation is

(6.32) logit [ p ( x ) ] = log [ p ( x ) 1 p ( x ) ] = a + b 1 x 1 + b 2 x 2 + b 3 x 3 +

where logit[p(x)] is the log (to base e) of the likelihood ratio that the dependent variable is 1, and p can only range from 0 to 1. In Eq. (6.32), a is the model's intercept, and X 1, …, X k are molecular descriptors with their corresponding regression coefficients b 1, … , b k (for molecular descriptors 1 through k). For an unknown compound, LR calculates the probability that the compound belongs to a certain target property (say, active or inactive). LR estimates the probability of the compound being an active substance. If the calculated logit[p(x)] is greater than 0.5, then it is more probable that the compound is active. Similar to MLR, the regression coefficients in LR can describe the influence of a molecular descriptor on the outcome of the prediction. When the coefficient has a large value, it shows that the molecular descriptor strongly affect the probability of the outcome, whereas a zero value coefficient shows that the molecular descriptor has no influence on the outcome probability. Likewise, the sign of the coefficients affects the probability as well; that is, a positive coefficient increases the probability of an outcome while a negative coefficient will result in the opposite.

The regression coefficients are usually estimated using the maximum likelihood (ML) method. In logistic regression, two hypotheses are of interest: the null hypothesis, which is when all the coefficients in the regression equation take the value zero; and the alternative hypothesis, that the model with predictors currently under consideration is accurate and differs significantly from the null or zero. The likelihood ratio test is based on the significance of the difference between the likelihood ratio for the researcher's model with predictors (called model chi square) minus the likelihood ratio for baseline model with only a constant in it. Significance at the p=0.05 level or less means the researcher's model with the predictors is significantly different from the one with the constant only (all b coefficients being zero).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128015056000065

Data Mining

John A. Bunge , Dean H. Judson , in Encyclopedia of Social Measurement, 2005

Logistic Regression

Logistic regression is a well-known procedure that can be used for classification. This is a variant of multiple regression in which the response is binary rather than quantitative. In the simplest version, the feature variables are taken to be nonrandom. The response, which is the class, is a binary random variable that takes on the value 1 (for the class of interest) with some probability p, and the value 0 with probability 1   p. The "success probability" p is a function of the values of the feature variables; specifically, the logarithm of the odds ratio or the "log odds," log[p/(1  p)], is a linear function of the predictor variables. To use logistic regression for classification, a cutoff value is set, typically 0.5; a case is assigned to class 1 if its estimated or fitted success probability is greater than (or equal to) the cutoff, and it is assigned to class 0 if the estimated probability is less than the cutoff. Because of the nature of the functions involved, this is equivalent to a linear classification boundary, although it is not (necessarily) the same as would be derived from linear discriminant analysis.

Like standard multiple regression, logistic regression carries hypothesis tests for the significance of each variable, along with other tests, estimates, and goodness-of-fit assessments. In the classification setting, the variable significance tests can be used for feature selection: modern computational implementations incorporate several variants of stepwise (iterative) variable selection. Because of the conceptual analogy with ordinary multiple regression and the ease of automated variable selection, logistic classification is probably the most frequently used data mining procedure. Another advantage is that it produces a probability of success, given the values of the feature variables, rather than just a predicted class, which enables sorting the observations by probability of success and setting an arbitrary cutoff for classification, not necessarily 0.5. But wherever the cutoff is set, logistic classification basically entails a linear classification boundary, and this imposes a limit on the potential efficacy of the classifier. Some flexibility can be achieved by introducing transformations (e.g., polynomials) and interactions among the feature variables.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0123693985001596

Statistics, Use in Immunology

Amelia Dale Horne , in Encyclopedia of Immunology (Second Edition), 1998

Advanced methods

Logistic regression

Logistic regression differs from simple or multiple regression in that the response variable is discrete and typically dichotomous rather than continuous, and also in that the distribution of the errors is binomial rather than normal. Thus, this method is adaptable to the study of binary immune response status (e.g. positive or negative). It may be used to assess the association between immune response and one or more explanatory factors, or it may be employed for classifying subjects as does discriminant analysis (see below).

Proportional hazards (Cox) regression

This method, a type of survival or time-to-event analysis, has been used to evaluate the effect of covariates on time to seroconversion. Survival methods are especially useful when individual subjects in a study are not followed for equal periods of time, or when there is censoring (which occurs when an individual is lost to follow-up during the study). In order for this analysis to be informative about serological data, blood samples must be drawn at numerous time points. However, it may not be possible to know precisely when seroconversion actually occurs – only when it is discovered through a laboratory test.

Discriminant analysis

This technique has been used to assess the effects of covariates on seroconversion status in response to vaccination. The basic purpose of discriminant analysis is to classify an individual with specified covariate characteristics into one of two or more population groups. In serological analysis, these groups are usually only two, responders and nonresponders. The linear discriminant function is appropriate when the explanatory variables in each comparison group are approximately multivariately normally distributed with equal variances/covariances. If explanatory factors are qualitative rather than continuous, then logistic regression, which does not require normality, may be preferable.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0122267656005752

Machine Learning Contribution to Solve Prognostic Medical Problems

Flavio Baronti , ... Antonina Starita , in Outcome Prediction in Cancer, 2007

4.4. Comparison with logistic regression

One of the most common data explanation methods in medicine is logistic regression (LR) (a detailed explanation of the method can be found in Hosmer and Lemeshow, 2000). The logistic regression method builds a probabilistic model of the outcome, obtained through a logit transformation of a linear combination of the input values.

The model is based on the following equation:

(1) f ( x ) = 1 1 + e α + β x ,

where α is a scalar, β is a vector, and x is the input vector.

Skipping over the mathematical motivations, the usefulness of logistic regression is not only in building a model which can predict unseen data, but chiefly in the interpretation of the coefficients found (the β s in our equation). An important quantity relative to the input variables is the odds ratio; let us see for instance what this means for a single dichotomous input variable, xj ∈ {0, 1}.

In this context, the odds ratio is simply calculated as e β j , and describes how much the outcome is more likely when the input condition is present (x j = 1) with respect to when the input condition is not present (x j = 0). This value is then clearly very important in deciding how much an input variable is influential on the final outcome. Similarly, for a real variable x j R the odds ratio e c β j quantify how much the outcome is more likely for an increase of c units of the variable.

Another advantage of LR models, aside from their strong mathematical and statistical foundations, is their simplicity of setup: there are no parameters to finetune like in most ML methods, and they are readily available in standard statistical analysis packages.

4.4.1. Logistic regression results

Applying logistic regression to the HNSCC dataset presents two main difficulties:

The heterozygote should be considered equal to either one of the homozygotes. This relationship can either be ignored or manually enforced, since the logistic regression model cannot be instructed to automatically derive it.

The dataset must be complete: no missing values are allowed. In order for this to happen, we removed from the dataset all the lines with at least one missing value. This however causes roughly two-thirds of the data to be thrown away. A better strategy is to identify some subsets of the attributes, and do the same procedure only on those subsets; having less attributes makes it more likely for an instance to be complete.

The first test was performed with all the attributes: after removing all the instances with missing data, only 120 were left. The only significant (Wald's test z = 3.99, p = 10–4) variable found was packyears, with an odds ratio of 1.04 ± 0.01. These results were consistent even after manually deciding for each gene which homozygote the heterozygote it was equal to.

The next tests were performed splitting in two subsets the genes, and taking into account only one of the two groups; this means less columns, which in turn produces more full rows. The two tests could in fact use 141 and 302 instances respectively; however, the only strongly significant variable predicted was again packyears. On the second subset, containing only the nat and cyp genes, also sex was not rejected (p = 0.003).

The main drawback of logistic regression analysis on this dataset is probably not the need for complete data, but its limitation to consider only linear interactions between the risk factors. For instance, while the allelic variants of two genes do not singularly induce any increase in risk, it is possible that their combined effect does. LR cannot detect these nonlinear associations, and thus is a limited tool to explore the possibly complex interactions between attributes.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978044452855150012X

Biostatistical Basis of Inference in Heart Failure Study

Longjian Liu MD, PHD, MSC (LSHTM), FAHA , in Heart Failure: Epidemiology and Research Methods, 2018

Basic concept of logistic regression

The logistic regression is simply a nonlinear transformation of the linear regression. The "logistic" distribution is an S-shaped distribution function. The logit distribution contains the estimated probabilities to lie between 0 and 1. Fig. 4.19 depicts the logit distribution.

The Logistic function σ ( χ ) = 1 1 + exp ( χ )

How logistic regression works?

Logistic regression is a technique for analyzing problems in which there are one or more independent variables that determine a dependent variable (outcome). In most cases, the dependent variable is a dichotomous variable (in which there are only two possible outcomes).

The goal of logistic regression is to find the best fitting model to describe the relationship between the dichotomous characteristic of interest (dependent variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of a presence of the characteristic of interest:

Logit ( p ) = b 0 + b 1 X 1 + b 2 X 2 + b 3 X 3 + L + b i X i

where p is the probability of the presence of the characteristic of interest and bi is the regression coefficient for Xi.

Mathematically, logistic regression uses a maximum likelihood estimation procedure rather than the least squares estimation procedure that is used in linear regression.

The logit transformation is defined as the logged odds:

Odds = P 1 P = probability presence of the characteristic probability of absence of characteristic Logit ( P ) = L n [ P 1 P ] = b 0 + b i X i

where P is the probability that the event Y occurs, P  =   (Y   =   1); P/(1   P) is the "odds ratio"; and ln [P/(1   P)] is the log odds ratio or "logit."

The equation may also be inverted to give an expression for the probability P as

Px = 1 1 + exp [ ( b 0 + b 1 X 1 + b 2 X 2 + b 3 X 3 + b i X i ) ]

Odds ratio (OR)   =   exp (b)

Note the logistic regression formula is little complicated. However, this work of calculation can be done quickly by computer software, such as SAS.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780323485586000049