In this ar we talk about correlation evaluation which is a technique used come quantify the associations in between two constant variables. For example, we might want to quantify the association in between body mass index and systolic blood pressure, or in between hours of exercise per week and percent body fat. Regression evaluation is a related method to assess the connection between an end result variable and one or an ext risk components or confounding variables (confounding is discussed later). The outcome variable is additionally called the response or dependent variable, and the threat factors and confounders are called the predictors, or explanatory or independent variables. In regression analysis, the dependent variable is denoted "Y" and also the live independence variables space denoted by "X".

You are watching: For what types of associations are regression models useful?

< NOTE: The ax "predictor" have the right to be misleading if it is understood as the capability to predict even past the limits of the data. Also, the ax "explanatory variable" might give an impression of a causal impact in a situation in i m sorry inferences need to be restricted to identify associations. The state "independent" and also "dependent" variable are less subject to this interpretations as they do not strongly imply cause and also effect.

Learning Objectives

After perfect this module, the student will be may be to:

Define and administer examples of dependent and independent variables in a examine of a public health problemCompute and also interpret a correlation coefficientCompute and also interpret coefficients in a direct regression analysis

*

Correlation Analysis

In correlation analysis, we calculation a sample correlation coefficient, an ext specifically the Pearson Product minute correlation coefficient. The sample correlation coefficient, denoted r,

ranges in between -1 and +1 and also quantifies the direction and strength that the linear association between the two variables. The correlation between two variables have the right to be confident (i.e., higher levels of one variable are associated with greater levels that the other) or negative (i.e., greater levels that one change are linked with lower levels that the other).

The sign of the correlation coefficient shows the direction the the association. The size of the correlation coefficient shows the strength of the association.

For example, a correlation of r = 0.9 suggests a strong, positive association in between two variables, conversely, a correlation the r = -0.2 indicate a weak, negative association. A correlation close come zero argues no linear association between two continuous variables.

It is vital to note that there might be a non-linear association in between two consistent variables, but computation of a correlation coefficient does not detect this. Therefore, that is always important to evaluate the data carefully before computing a correlation coefficient. Graphical display screens are specifically useful to discover associations in between variables.

The figure listed below shows 4 hypothetical scenarios in i beg your pardon one constant variable is plotted along the X-axis and the other along the Y-axis.

*

Scenario 1 depicts a strong positive association (r=0.9), comparable to what we might see because that the correlation between infant birth weight and also birth length.Scenario 2 depicts a weaker combination (r=0,2) that we might expect to see in between age and body mass index (which often tends to rise with age).Scenario 3 can depict the lack of combination (r about = 0) between the degree of media exposure in adolescence and also age in ~ which adolescents initiate sex-related activity.Scenario 4 could depict the strong an adverse association (r= -0.9) normally observed in between the number of hours of aerobic exercise per week and also percent body fat.

*

Example - Correlation that Gestational Age and Birth Weight

A tiny study is conducted involving 17 infants to inspection the association in between gestational period at birth, measured in weeks, and also birth weight, measure in grams.

Infant id #

Gestational age (weeks)

Birth load (grams)

1

34.7

1895

2

36.0

2030

3

29.3

1440

4

40.1

2835

5

35.7

3090

6

42.4

3827

7

40.3

3260

8

37.3

2690

9

40.9

3285

10

38.3

2920

11

38.5

3430

12

41.4

3657

13

39.7

3685

14

39.7

3345

15

41.1

3260

16

38.0

2680

17

38.7

2005

We wish to calculation the association in between gestational age and also infant bear weight. In this example, birth weight is the dependent variable and gestational age is the elevation variable. Therefore y=birth weight and x=gestational age. The data are presented in a scatter chart in the number below.

*

Each allude represents an (x,y) pair (in this instance the gestational age, measure up in weeks, and the bear weight, measured in grams). Keep in mind that the elevation variable, gestational age) is on the horizontal axis (or X-axis), and the dependent change (birth weight) is on the vertical axis (or Y-axis). The scatter plot reflects a hopeful or straight association in between gestational age and birth weight. Infants with shorter gestational eras are much more likely to be born with lower weights and infants with much longer gestational eras are much more likely to be born with higher weights.

Computing the Correlation Coefficient

The formula because that the sample correlation coefficient is:

*

where Cov(x,y) is the covariance of x and also y identified as

*
and also
*
space the sample variances that x and also y, characterized as follows:

*
and also
*

The variances that x and y measure the variability that the x scores and y scores about their particular sample method of X and Y taken into consideration separately. The covariance actions the variability the the (x,y) pairs around the average of x and mean of y, thought about simultaneously.

*

To compute the sample correlation coefficient, we have to compute the variance the gestational age, the variance of birth weight, and likewise the covariance of gestational age and also birth weight.

We very first summarize the gestational period data. The median gestational period is:

*

To compute the variance the gestational age, we should sum the squared deviations (or differences) in between each it was observed gestational age and also the mean gestational age. The computations space summarized below.

Infant identifier #

Gestational age (weeks)

*

*

1

34.7

-3.7

13.69

2

36.0

-2.4

5.76

3

29.3

-9.1

82,81

4

40.1

1.7

2.89

5

35.7

-2.7

7.29

6

42.4

4.0

16.0

7

40.3

1.9

3.61

8

37.3

-1.1

1.21

9

40.9

2.5

6.25

10

38.3

-0.1

0.01

11

38.5

0.1

0.01

12

41.4

3.0

9.0

13

39.7

1.3

1.69

14

39.7

1.3

1.69

15

41.1

2.7

7.29

16

38.0

-0.4

0.16

17

38.7

0.3

0.09

*

*

*

The variance that gestational period is:

*

Next, us summarize the birth load data. The typical birth load is:

*

The variance the birth load is computed just as us did because that gestational age as shown in the table below.

Infant ID#

Birth Weight

*

*

1

1895

-1007

1,014,049

2

2030

-872

760,384

3

1440

-1462

2,137,444

4

2835

-67

4,489

5

3090

188

35,344

6

3827

925

855,625

7

3260

358

128,164

8

2690

-212

44,944

9

3285

383

146,689

10

2920

18

324

11

3430

528

278,764

12

3657

755

570,025

13

3685

783

613,089

14

3345

443

196,249

15

3260

358

128,164

16

2680

-222

49,284

17

2005

-897

804,609

*

*

*

The variance the birth weight is:

*

Next we compute the covariance:

To compute the covariance the gestational age and also birth weight, we must multiply the deviation indigenous the typical gestational period by the deviation native the median birth weight for every participant, the is:

*

The computations space summarized below. Notification that we simply copy the deviations from the median gestational age and birth weight from the two tables over into the table below and multiply.

Infant ID#

*

*

*

1

-3.7

-1007

3725.9

2

-2.4

-872

2092.8

3

-9,1

-1462

13,304.2

4

1.7

-67

-113.9

5

-2.7

188

-507.6

6

4.0

925

3700.0

7

1.9

358

680.2

8

-1.1

-212

233.2

9

2.5

383

957.5

10

-0.1

18

-1.8

11

0.1

528

52.8

12

3.0

755

2265.0

13

1.3

783

1017.9

14

1.3

443

575.9

15

2.7

358

966.6

16

-0.4

-222

88.8

17

0.3

-897

-269.1

Total = 28,768.4

The covariance of gestational age and also birth load is:

*

Finally, we have the right to ow compute the sample correlation coefficient:

*

Not surprisingly, the sample correlation coefficient indicates a strong positive correlation.

As we noted, sample correlation coefficients range from -1 come +1. In practice, coherent correlations (i.e., correlations that are clinically or virtually important) have the right to be as small as 0.4 (or -0.4) for confident (or negative) associations. Over there are likewise statistical test to recognize whether an observed correlation is statistically far-reaching or not (i.e., statistically significantly different native zero). Measures to check whether an it was observed sample correlation is suggestive of a statistically significant correlation are described in detail in Kleinbaum, Kupper and also Muller.1

Regression Analysis

Regression evaluation is a widely used technique which is valuable for countless applications. We present the technique here and expand on its supplies in succeeding modules.

Simple direct Regression

Simple direct regression is a method that is appropriate to recognize the association in between one independent (or predictor) variable and one continuous dependent (or outcome) variable. Because that example, mean we desire to assess the association between total cholesterol (in milligrams every deciliter, mg/dL) and body mass index (BMI, measured as the proportion of weight in kilograms to elevation in meters2) where total cholesterol is the dependent variable, and BMI is the independent variable. In regression analysis, the dependent change is denoted Y and also the independent change is denoted X. So, in this case, Y=total cholesterol and X=BMI.

When over there is a single consistent dependent variable and a single independent variable, the analysis is dubbed a straightforward linear regression analysis . This evaluation assumes the there is a direct association in between the 2 variables. (If a various relationship is hypothesized, such as a curvilinear or exponential relationship, alternate regression analyses room performed.)

The figure below is a scatter diagram showing the relationship in between BMI and also total cholesterol. Each suggest represents the observed (x, y) pair, in this case, BMI and the corresponding complete cholesterol measure up in every participant. Note that the independent variable (BMI) is on the horizontal axis and the dependent variable (Total Serum Cholesterol) on the vertical axis.

BMI and also Total Cholesterol

*

The graph mirrors that over there is a positive or straight association between BMI and total cholesterol; attendees with lower BMI are more likely to have actually lower complete cholesterol levels and participants with greater BMI are an ext likely come have greater total cholesterol levels. In contrast, mean we examine the association between BMI and HDL cholesterol.

In contrast, the graph listed below depicts the relationship in between BMI and HDL cholesterol in the very same sample of n=20 participants.

BMI and also HDL Cholesterol

*

This graph shows a an adverse or station association in between BMI and also HDL cholesterol, i.e., those with reduced BMI are more likely to have higher HDL cholesterol levels and also those with higher BMI are an ext likely to have actually lower HDL cholesterol levels.

For one of two people of these relationship we could use simple linear regression analysis to calculation the equation that the line that ideal describes the association in between the live independence variable and also the dependence variable. The straightforward linear regression equation is together follows:

*

where Y is the guess or supposed value the the outcome, X is the predictor, b0 is the approximated Y-intercept, and also b1 is the approximated slope. The Y-intercept and also slope are estimated from the sample data, and also they are the worths that minimization the sum of the squared differences in between the observed and the predicted worths of the outcome, i.e., the estimates minimize:

*

These differences between observed and also predicted values of the result are called residuals. The approximates of the Y-intercept and also slope minimize the amount of the squared residuals, and are dubbed the least squares estimates.1

Residuals

Conceptually, if the values of X provided a perfect prediction of Y then the amount of the squared differences in between observed and predicted worths of Y would be 0. The would median that variability in Y can be fully explained by differences in X. However, if the differences in between observed and also predicted values space not 0, climate we space unable to completely account for differences in Y based on X, climate there space residual errors in the prediction. The residual error could an outcome from inaccurate measurements of X or Y, or there might be other variables besides X that impact the value of Y.

Based ~ above the observed data, the ideal estimate that a direct relationship will certainly be obtained from an equation because that the line that minimizes the differences between observed and also predicted values of the outcome. The Y-intercept the this heat is the worth of the dependent variable (Y) when the independent change (X) is zero. The slope the the heat is the readjust in the dependent variable (Y) relative to a one unit readjust in the independent variable (X). The the very least squares estimates of the y-intercept and also slope room computed as follows:

*

and

*

where

r is the sample correlation coefficient,the sample means are
*
and
*
and Sx and Sy room the standard deviations of the independent variable x and the dependent change y, respectively.

BMI and Total Cholesterol

The least squares estimates of the regression coefficients, b 0 and b1, describing the relationship between BMI and also total cholesterol are b0 = 28.07 and also b1=6.49. These room computed as follows:

*

and

*

The calculation of the Y-intercept (b0 = 28.07) to represent the estimated complete cholesterol level once BMI is zero. Since a BMI that zero is meaningless, the Y-intercept is not informative. The calculation of the steep (b1 = 6.49) to represent the change in full cholesterol family member to a one unit readjust in BMI. For example, if we compare 2 participants whose BMIs different by 1 unit, we would mean their full cholesterols to different by around 6.49 units (with the human being with the higher BMI having actually the higher total cholesterol).

The equation the the regression line is as follows:

*

The graph listed below shows the estimated regression heat superimposed on the scatter diagram.

*

The regression equation have the right to be offered to estimate a participant"s total cholesterol together a role of his/her BMI. Because that example, expect a participant has a BMI the 25. We would certainly estimate their complete cholesterol to it is in 28.07 + 6.49(25) = 190.32. The equation can additionally be used to estimate total cholesterol for various other values the BMI. However, the equation need to only be used to estimate cholesterol levels for persons who BMIs are in the variety of the data supplied to generate the regression equation. In our sample, BMI varieties from 20 to 32, therefore the equation need to only be offered to generate approximates of full cholesterol because that persons through BMI in that range.

There are statistical test that can be performed to evaluate whether the estimated regression coefficients (b0 and also b1) are statistically substantially different native zero. The test of many interest is normally H0: b1=0 matches H1: b1≠0, wherein b1 is the population slope. If the populace slope is significantly different from zero, we conclude that there is a statistically far-reaching association in between the independent and also dependent variables.

BMI and HDL Cholesterol

The least squares approximates of the regression coefficients, b0 and also b1, describing the relationship between BMI and HDL cholesterol are as follows: b0 = 111.77 and also b1 = -2.35. These room computed as follows:

*

and

*

Again, the Y-intercept in uninformative because a BMI the zero is meaningless. The calculation of the steep (b1 = -2.35) represents the adjust in HDL cholesterol family member to a one unit adjust in BMI. If we compare 2 participants whose BMIs differ by 1 unit, us would intend their HDL cholesterols to differ by around 2.35 units (with the person with the higher BMI having actually the reduced HDL cholesterol. The figure listed below shows the regression line superimposed ~ above the scatter diagram for BMI and also HDL cholesterol.

*

Linear regression evaluation rests on the presumption that the dependent variable is consistent and that the circulation of the dependent variable (Y) at each worth of the independent variable (X) is about normally distributed. Note, however, that the live independence variable deserve to be continuous (e.g., BMI) or deserve to be dichotomous (see below).

Comparing average HDL Levels through Regression Analysis

Consider a clinical trial to evaluate the efficacy that a brand-new drug to boost HDL cholesterol. We might compare the mean HDL levels between treatment teams statistically using a 2 independent samples t test. Here we think about an alternate approach. Summary data because that the psychological are displayed below:

Sample Size

Mean HDL

Standard Deviation of HDL

New Drug

Placebo

50

40.16

4.46

50

39.21

3.91

HDL cholesterol is the constant dependent variable and treatment assignment (new drug versus placebo) is the elevation variable. Intend the data on n=100 attendees are gone into into a statistical computing package. The outcome (Y) is HDL cholesterol in mg/dL and the independent change (X) is therapy assignment. Because that this analysis, X is coded together 1 because that participants who obtained the brand-new drug and as 0 because that participants who obtained the placebo. A an easy linear regression equation is approximated as follows:

*

where Y is the estimated HDL level and X is a dichotomous change (also dubbed an indicator variable, in this case indicating whether the participant was assigned to the new drug or come placebo). The estimate of the Y-intercept is b0=39.21. The Y-intercept is the worth of Y (HDL cholesterol) when X is zero. In this example, X=0 shows assignment to the placebo group. Thus, the Y-intercept is exactly equal to the median HDL level in the placebo group. The steep is estimated as b1=0.95. The slope represents the estimated change in Y (HDL cholesterol) loved one to a one unit change in X. A one unit change in X represents a distinction in therapy assignment (placebo versus brand-new drug). The slope represents the distinction in average HDL levels in between the therapy groups. Thus, the median HDL for participants receiving the brand-new drug is:

*

*
-----
*

A examine was performed to evaluate the association in between a person"s intelligence and also the dimension of their brain. Participants completed a standardized IQ test and also researchers provided Magnetic Resonance Imaging (MRI) come determine brain size. Demography information, including the patient"s gender, was also recorded.

*

The controversy Over environmental Tobacco smoke Exposure

There is convincing proof that active smoking is a cause of lung cancer and heart disease. Plenty of studies excellent in a wide variety of circumstances have consistently demonstrated a solid association and also indicate the the risk of lung cancer and cardiovascular condition (i.e.., love attacks) increases in a dose-related way. This studies have actually led to the conclusion that active smoking is causally pertained to lung cancer and also cardiovascular disease. Researches in active smokers have had actually the advantage that the life time exposure come tobacco smoke deserve to be quantified through reasonable accuracy, due to the fact that the unit dose is constant (one cigarette) and also the habitual nature that tobacco smoking renders it possible for many smokers to carry out a reasonable estimate of their total lifetime exposure quantified in regards to cigarettes per day or packs per day. Frequently, average everyday exposure (cigarettes or packs) is combined with expression of use in years in order to quantify exposure together "pack-years".

It has been much more challenging to develop whether ecological tobacco acting (ETS) exposure is causally concerned chronic illness like heart disease and lung cancer, because the total lifetime exposure dosage is lower, and also it is lot more daunting to correctly estimate full lifetime exposure. In addition, quantifying these risks is also complex because of confounding factors. Because that example, ETS exposure is usually classified based on parental or spousal smoking, but these studies are unable come quantify other eco-friendly exposures to tobacco smoke, and inability come quantify and change for other environmental exposures such as air pollution makes it an overwhelming to demonstrate an association also if one existed. As a result, there proceeds to be conflict over the risk enforced by environmental tobacco exhilaration (ETS). Some have gone therefore far as to claim the even really brief exposure come ETS can reason a myocardial infarction (heart attack), however a very large prospective cohort research by Enstrom and Kabat to be unable to demonstrate far-ranging associations between exposure come spousal ETS and also coronary love disease, chronic obstructive pulmonary disease, or lung cancer. (It must be noted, however, the the report through Enstrom and Kabat has actually been extensively criticized for methodological problems, and also these authors likewise had gaue won ties come the tobacco industry.)

Correlation analysis provides a helpful tool for thinking about this controversy. Take into consideration data from the British physicians Cohort. They reported the annual mortality because that a variety of an illness at 4 levels the cigarette cigarette smoking per day: never smoked, 1-14/day, 15-24/day, and 25+/day. In bespeak to perform a correlation analysis, i rounded the exposure levels to 0, 10, 20, and 30 respectively.

Cigarettes Smoked

Per Day

CVD Mortality

Per 100,000 guys Per Year

Lung Cancer Mortality

Per 100,000 men Per Year

0

10 (actually 1-14)

20 (actually 15-24)

30 (actually >24)

572

14

802

105

892

208

1025

355

The figures listed below show the two estimated regression present superimposed top top the scatter diagram. The correlation with amount of smoking was strong for both CVD mortality (r= 0.98) and for lung cancer (r = 0.99). Note likewise that the Y-intercept is a coherent number here; it to represent the predicted yearly death price from these condition in individuals who never smoked. The Y-intercept because that prediction of CVD is slightly greater than the observed price in never ever smokers, if the Y-intercept for lung cancer is reduced than the observed price in never smokers.

The linearity of these relationships says that there is one incremental hazard with each additional cigarette smoked per day, and the added risk is estimated by the slopes. This probably helps united state think about the consequences of ETS exposure. For example, the hazard of lung cancer in never smokers is fairly low, however there is a finite risk; miscellaneous reports imply a threat of 10-15 lung cancers/100,000 every year. If an separation, personal, instance who never ever smoked proactively was exposed come the identical of one cigarette"s exhilaration in the form of ETS, then the regression argues that their hazard would increase by 11.26 lung cancer deaths per 100,000 every year. However, the threat is plainly dose-related. Therefore, if a non-smoker was employed through a tavern with heavy levels of ETS, the risk can be considerably greater.

*

*

Finally, it have to be provided that some findings imply that the association between smoking and heart an illness is non-linear at the really lowest exposure levels, meaning that non-smokers have actually a disproportionate boost in risk when exposed come ETS early out to boost in platelet aggregation.

Summary

Correlation and also linear regression analysis are statistical techniques to quantify associations between an independent, sometimes called a predictor, change (X) and also a consistent dependent result variable (Y). Because that correlation analysis, the independent variable (X) deserve to be constant (e.g., gestational age) or ordinal (e.g., boosting categories that cigarettes every day). Regression analysis can likewise accommodate dichotomous independent variables.

See more: Asl Sign For Sweet In Sign Language ? Watch Travel To Ghana Sign Series Now!!

The procedures described here assume the the association between the independent and dependent variables is linear. V some adjustments, regression analysis can additionally be offered to calculation associations that follow another functional form (e.g., curvilinear, quadratic). Here we consider associations in between one elevation variable and one continuous dependent variable. The regression evaluation is called an easy linear regression - an easy in this instance refers come the reality that there is a single elevation variable. In the following module, we think about regression evaluation with numerous independent variables, or predictors, thought about simultaneously.