Law.com Subscribers SAVE 30%

Call 855-808-4530 or email [email protected] to receive your discount on a new subscription.

Practice Tip

By Jerome M. Staller, Ph.D.
February 09, 2004

Multiple regression analysis, a statistical tool often used in litigation as evidence in employment-discrimination suits, can also be useful in product liability matters to show probable causation and also to show the probable range of economic damages.

Basically, regression analysis statistically compares the relationship between two or more independent, or explanatory, variables and one dependent variable. The dependent variable is the variable the model is attempting to explain.

For example, in a model used to examine longevity, age would be the dependent variable. Diet, race, sex and geographical location could constitute independent variables. Diet might prove to be either a significant or insignificant factor in longevity, or the regression analysis may show that diet is a statistically insignificant predictor of longevity, while geographical location and sex might be shown to bear a significant relationship to longevity.

If the variables are chosen with care, and no independent variables that may have a strong influence on the dependent variable are left out, the regression analysis might suggest causation, and in many cases can be more reliable than anecdotal or observational evidence. For example, it may be observed that people living near a toxic waste dumpsite may suffer a higher-than-normal rate of cancer. This is an observed correlation between the two variables – the cancer rate and the proximity to the site — but the correlation does not necessarily show causation: The cancer rate may be more directly attributable to the fact that the population near the site includes a higher percentage of smokers, or that those living near the site are economically deprived and have, on average, limited access to health care.

A well-crafted multiple regression analysis that carefully weighs all or most possible variables can show with a high degree of statistical certainty which factor is most accountable for the higher cancer rate.

An example of the application of multiple regression analysis in a recent case helps illustrate how multiple regression analysis was to be applied to a products liability matter.

Experts agree that stray voltage — ambient electrical current occurring due to faulty grounding of high-tension wires — can adversely affect milk production of dairy cows. The problem is generating a number of lawsuits brought by dairy farmers against electricity suppliers, but there is much disagreement over what level of stray voltage is harmful and at what level milk production is affected.

The challenge was to assess the economic damages claims made by a dairy farmer who alleged that stray voltage negligently inflicted on his herd by the rural electrical co-op in 1993 through 1994 affected milk production. Detailed records on each cow in the herd were available for the years 1989 through 2002, allowing for a reliable regression analysis. Available data for each cow included date of birth, daily milk production, and “SCC” (somatic cell count) scores indicating white-blood-cell count (lower scores indicated healthier animals). In addition to individual cow data, other descriptive data were considered, including herd size by year, new additions to the herd, the age of the cows, annual herd production, average milk production by age, and the average SCC score of the herd.

Correlations between pairs of variables were tested first. A positive correlation between a cow's age and milk production for cows younger than 5 years was found; that is, milk production increases until cows reach the age of 5. After age 5, there is a negative correlation — milk production decreases.

Similarly, the possible correlation between health (measured by SCC count) and milk production was investigated and, no surprise, there was an inverse correlation – low (good) SCC scores correlated with higher production. (Other factors that might have an impact on milk production, however, are harder to account for — the effects of an addition to or removal of animals from the herd, new pastures, changes in barns and other unquantifiable factors that might affect production but could not be explicitly modeled.)

The multiple regression analysis was then designed. A model that examined the effects of the age of the cows, the SCC score for the cow and the number of days the cow was “in milk” (producing) was built. To test how stray voltage in 1992 through 1994 had affected herd milk production, a “dummy” variable was included. A dummy variable is a way to express a characteristic such as gender, race or some characteristic that is not easily quantifiable. The presence or absence of stray voltage can be modeled in a “yes” or “no” framework. A variable coded “yes” for the years 1993-1994, and coded “no” for the years 1989-1992 and 1995-2002 was used.

The implicit question asked by a regression is: “If we change the value of one of the independent (explanatory) variables, what is the effect on the dependent variable?” We used regression analysis to determine the effect of each variable: What is the effect on production if we change the cow's age? The “days in milk”? The cow's health? And, comparing the years 1993 and 1994 to the other years, what is the effect on milk production?

The analysis yielded the following results:

  • A 1-year age increase (up to age 5) correlates with an approximate 2-pound monthly increase in production.
  • A 1-day increase in “days in milk” amounted to a 0.075-pound production increase.
  • A one-point increase in the SCC score (white-blood-cell count) led to a 4.4-pound annual decrease in production.

The loss of milk production was calculated by comparing actual milk production to a forecast of milk production, assuming no other factor, including the alleged stray voltage, was present. The loss of production would be the difference between the actual and forecasted levels of production.

This analysis permitted us to isolate loss due to “all other factors.” This loss necessarily included the loss possibly caused by stray voltage, and also included loss due to other unquantifiable factors, such as changes in husbandry techniques, addition of new animals to the herd, and changes in the herd environment, eg, a new barn, since it was impossible to isolate the loss due to each of these unquantifiable factors separately. However, the loss figure arrived at by multiple regression analysis is more reliable and accurate than a figure based on an estimate that did not consider the interplay of all relevant factors, such as basing loss on average production per year.

In this case, we used multiple regression analysis to calculate economic loss. The technique can also be applied to determine causation in product liability matters where there is sufficient data. For example, in a case alleging defective automobile design, the dependent variable might be identified as accidents, or fatal accidents, or repairs. Independent variables, assuming reliable data, might be age of driver, weather conditions, road conditions, vehicle mileage, and drivers' records.

The most important factors in any multiple regression analysis are reliable data, a properly identified dependent variable and inclusion of all measurable independent variables that might have an effect on the dependent variable.

An excellent primer to the use of multiple regression analyses in litigation is found in the Reference Guide on Scientific Evidence (Federal Judicial Center, 2d ed., 2000). The chapter on multiple regression analysis includes a concise yet thorough appendix covering the basics.

A caveat: Multiple regression analysis can be a powerful evidentiary tool, but it can also be misapplied. Elaborate multiple regression analyses can be statistically “significant,” yet, on a practical level, trivial or useless in addressing the question at issue. Statistical significance is determined by a test of how strongly a dependent variable is associated with one or more of the independent variables. However, a variable can be statistically significant, but practically insignificant.

For example, in one toxic tort case, an elaborate multiple regression analysis conducted by plaintiffs in a class action matter demonstrated a positive correlation between lost IQ points as a result of exposure to toxic chemicals and diminished lifetime earnings — a seemingly logical result. However, when the plaintiffs' statistical model was dissected, it was apparent that the basic assumptions on which the model was based were: 1) that each lost IQ point correlated with the loss of about 1 month of education, and 2) each additional year of education implied additional annual income of $1800 to $2000. This served as the basis for loss claims of $60,000 to $500,000 per plaintiff.

However, because no plaintiff in the class had been shown to have lost more than six IQ points, the model simply showed that no more than 6 months of schooling would be lost. The model failed to show that any plaintiff would miss graduation from high school as a result of this predicted loss. Cross examined on the model, the statistician who designed it admitted that the model made no distinction between an extra few months of school in the 10th grade versus an extra few months of school in the 12th grade. Thus, the multistage model, while statistically correct in each of its constituent stages, was shown to be, overall, extremely unreliable as a predictor of lost income.

This points out another significant consideration when deciding to use multiple regression analysis evidence. To be effective, the analysis must be understood by a jury. Probabilities uncovered by regression analysis may be counterintuitive, showing that an apparently simple cause-and-effect relationship may not exist. To make such evidence accessible to a jury, a careful step-by-step explanation might be required. In such situations, an expert who can explain the regression process simply and clearly would be the most effective.



Jerome M. Staller, Ph.D. [email protected]

Multiple regression analysis, a statistical tool often used in litigation as evidence in employment-discrimination suits, can also be useful in product liability matters to show probable causation and also to show the probable range of economic damages.

Basically, regression analysis statistically compares the relationship between two or more independent, or explanatory, variables and one dependent variable. The dependent variable is the variable the model is attempting to explain.

For example, in a model used to examine longevity, age would be the dependent variable. Diet, race, sex and geographical location could constitute independent variables. Diet might prove to be either a significant or insignificant factor in longevity, or the regression analysis may show that diet is a statistically insignificant predictor of longevity, while geographical location and sex might be shown to bear a significant relationship to longevity.

If the variables are chosen with care, and no independent variables that may have a strong influence on the dependent variable are left out, the regression analysis might suggest causation, and in many cases can be more reliable than anecdotal or observational evidence. For example, it may be observed that people living near a toxic waste dumpsite may suffer a higher-than-normal rate of cancer. This is an observed correlation between the two variables – the cancer rate and the proximity to the site — but the correlation does not necessarily show causation: The cancer rate may be more directly attributable to the fact that the population near the site includes a higher percentage of smokers, or that those living near the site are economically deprived and have, on average, limited access to health care.

A well-crafted multiple regression analysis that carefully weighs all or most possible variables can show with a high degree of statistical certainty which factor is most accountable for the higher cancer rate.

An example of the application of multiple regression analysis in a recent case helps illustrate how multiple regression analysis was to be applied to a products liability matter.

Experts agree that stray voltage — ambient electrical current occurring due to faulty grounding of high-tension wires — can adversely affect milk production of dairy cows. The problem is generating a number of lawsuits brought by dairy farmers against electricity suppliers, but there is much disagreement over what level of stray voltage is harmful and at what level milk production is affected.

The challenge was to assess the economic damages claims made by a dairy farmer who alleged that stray voltage negligently inflicted on his herd by the rural electrical co-op in 1993 through 1994 affected milk production. Detailed records on each cow in the herd were available for the years 1989 through 2002, allowing for a reliable regression analysis. Available data for each cow included date of birth, daily milk production, and “SCC” (somatic cell count) scores indicating white-blood-cell count (lower scores indicated healthier animals). In addition to individual cow data, other descriptive data were considered, including herd size by year, new additions to the herd, the age of the cows, annual herd production, average milk production by age, and the average SCC score of the herd.

Correlations between pairs of variables were tested first. A positive correlation between a cow's age and milk production for cows younger than 5 years was found; that is, milk production increases until cows reach the age of 5. After age 5, there is a negative correlation — milk production decreases.

Similarly, the possible correlation between health (measured by SCC count) and milk production was investigated and, no surprise, there was an inverse correlation – low (good) SCC scores correlated with higher production. (Other factors that might have an impact on milk production, however, are harder to account for — the effects of an addition to or removal of animals from the herd, new pastures, changes in barns and other unquantifiable factors that might affect production but could not be explicitly modeled.)

The multiple regression analysis was then designed. A model that examined the effects of the age of the cows, the SCC score for the cow and the number of days the cow was “in milk” (producing) was built. To test how stray voltage in 1992 through 1994 had affected herd milk production, a “dummy” variable was included. A dummy variable is a way to express a characteristic such as gender, race or some characteristic that is not easily quantifiable. The presence or absence of stray voltage can be modeled in a “yes” or “no” framework. A variable coded “yes” for the years 1993-1994, and coded “no” for the years 1989-1992 and 1995-2002 was used.

The implicit question asked by a regression is: “If we change the value of one of the independent (explanatory) variables, what is the effect on the dependent variable?” We used regression analysis to determine the effect of each variable: What is the effect on production if we change the cow's age? The “days in milk”? The cow's health? And, comparing the years 1993 and 1994 to the other years, what is the effect on milk production?

The analysis yielded the following results:

  • A 1-year age increase (up to age 5) correlates with an approximate 2-pound monthly increase in production.
  • A 1-day increase in “days in milk” amounted to a 0.075-pound production increase.
  • A one-point increase in the SCC score (white-blood-cell count) led to a 4.4-pound annual decrease in production.

The loss of milk production was calculated by comparing actual milk production to a forecast of milk production, assuming no other factor, including the alleged stray voltage, was present. The loss of production would be the difference between the actual and forecasted levels of production.

This analysis permitted us to isolate loss due to “all other factors.” This loss necessarily included the loss possibly caused by stray voltage, and also included loss due to other unquantifiable factors, such as changes in husbandry techniques, addition of new animals to the herd, and changes in the herd environment, eg, a new barn, since it was impossible to isolate the loss due to each of these unquantifiable factors separately. However, the loss figure arrived at by multiple regression analysis is more reliable and accurate than a figure based on an estimate that did not consider the interplay of all relevant factors, such as basing loss on average production per year.

In this case, we used multiple regression analysis to calculate economic loss. The technique can also be applied to determine causation in product liability matters where there is sufficient data. For example, in a case alleging defective automobile design, the dependent variable might be identified as accidents, or fatal accidents, or repairs. Independent variables, assuming reliable data, might be age of driver, weather conditions, road conditions, vehicle mileage, and drivers' records.

The most important factors in any multiple regression analysis are reliable data, a properly identified dependent variable and inclusion of all measurable independent variables that might have an effect on the dependent variable.

An excellent primer to the use of multiple regression analyses in litigation is found in the Reference Guide on Scientific Evidence (Federal Judicial Center, 2d ed., 2000). The chapter on multiple regression analysis includes a concise yet thorough appendix covering the basics.

A caveat: Multiple regression analysis can be a powerful evidentiary tool, but it can also be misapplied. Elaborate multiple regression analyses can be statistically “significant,” yet, on a practical level, trivial or useless in addressing the question at issue. Statistical significance is determined by a test of how strongly a dependent variable is associated with one or more of the independent variables. However, a variable can be statistically significant, but practically insignificant.

For example, in one toxic tort case, an elaborate multiple regression analysis conducted by plaintiffs in a class action matter demonstrated a positive correlation between lost IQ points as a result of exposure to toxic chemicals and diminished lifetime earnings — a seemingly logical result. However, when the plaintiffs' statistical model was dissected, it was apparent that the basic assumptions on which the model was based were: 1) that each lost IQ point correlated with the loss of about 1 month of education, and 2) each additional year of education implied additional annual income of $1800 to $2000. This served as the basis for loss claims of $60,000 to $500,000 per plaintiff.

However, because no plaintiff in the class had been shown to have lost more than six IQ points, the model simply showed that no more than 6 months of schooling would be lost. The model failed to show that any plaintiff would miss graduation from high school as a result of this predicted loss. Cross examined on the model, the statistician who designed it admitted that the model made no distinction between an extra few months of school in the 10th grade versus an extra few months of school in the 12th grade. Thus, the multistage model, while statistically correct in each of its constituent stages, was shown to be, overall, extremely unreliable as a predictor of lost income.

This points out another significant consideration when deciding to use multiple regression analysis evidence. To be effective, the analysis must be understood by a jury. Probabilities uncovered by regression analysis may be counterintuitive, showing that an apparently simple cause-and-effect relationship may not exist. To make such evidence accessible to a jury, a careful step-by-step explanation might be required. In such situations, an expert who can explain the regression process simply and clearly would be the most effective.



Jerome M. Staller, Ph.D. [email protected]

This premium content is locked for Entertainment Law & Finance subscribers only

  • Stay current on the latest information, rulings, regulations, and trends
  • Includes practical, must-have information on copyrights, royalties, AI, and more
  • Tap into expert guidance from top entertainment lawyers and experts

For enterprise-wide or corporate acess, please contact Customer Service at [email protected] or 877-256-2473

Read These Next
MLF BONUS CONTENT: Marketing Predictions and Trends In 2025 Image

Our friends at Edge Marketing are ending the year by sharing their predictions for 2025. From the continued evolution of generative AI and its many uses to an increase in multimedia and hypertargeting, these are some of the key factors that will guide legal marketing strategies in the new year.

CLS BONUS CONTENT: The Shifting E-Discovery Landscape: From Artificial Intelligence to Antitrust Image

As organizations enhance their e-discovery processes and infrastructure, the expectation to leverage technology to maximize service delivery increases. However, legal professionals must balance innovation with humanity.

Supreme Court Hears Arguments In Corporate Trademark Infringement Remedy Calculation Case Image

The business-law issue of whether and when a corporate defendant is considered distinct from its affiliated entities emerged on December 11 at the U.S. Supreme Court, with the justices confronting whether a non-defendant’s affiliate’s revenue can be part of a judge’s calculation of the monetary remedy for the corporate defendant’s infringement of a trademark.

Navigating AI Risks: Best Practices for Compliance and Security Image

The most forward-thinking companies embrace AI with complete confidence because they have created governance programs that serve as guardrails for this incredible new technology. Effective governance ensures AI consistently aligns with an organization’s best interests, safeguarding against potential risks while unlocking its full potential.

What Will 2025 Bring for Legal Tech Image

It’s time for our annual poll of experts on what they expect 2025 to bring in legal tech, including generative AI (of course), e-discovery, and more.