ARTICLE
8 February 2024

Class Actions And Mass Torts: To Sample, Or Not To Sample, That Is The Question

In mass torts and class actions, the question of whether to produce and analyze an entire population of data or a smaller sample is often a topic of discussion.
United States New York Litigation, Mediation & Arbitration

In mass torts and class actions, the question of whether to produce and analyze an entire population of data or a smaller sample is often a topic of discussion. The question can come up as early as the class certification phase of the dispute when issues of commonality, typicality, adequacy, and numericity are considered, or as late as the damages phase when each side is attempting to quantify classwide damages.

The following discussion highlights some of the motives for analyzing a statistical sample of data rather than the entire population of data, specifically in the context of a class action or mass tort dispute. This article then highlights various approaches that can be used to minimize the sample size necessary to arrive at a statistically significant result.

To Sample or Not to Sample?

Time and Cost

The considerable cost and time to produce and analyze an entire population of data is often the biggest reason for defense counsel to consider a statistical sample. This is especially true when the population of data is not readily available (e.g., data exists in only hard copy form or warehoused offsite in a storage facility), is housed in decentralized databases (especially those with inconsistent data definitions), has been migrated into a secondary system (i.e., post-merger or acquisition) or was converted from an in-house database solution to one of the various ERP solutions. Any of these scenarios can significantly increase the complexity and cost to compile and then standardize the entire population of data during discovery and are good reasons to at least consider a statistical sample.

While defense counsel's desire to minimize financial and operational impacts to their client via statistical sampling is noteworthy, it is not likely a motivator for plaintiff's counsel. However, when the cost of producing and analyzing the entire population of data is significant enough (especially when a third party is required, for reasons discussed below) and the conversation turns to cost sharing between the parties, it can be a motivator for plaintiff and defense counsel to come to an agreement on the reasonableness of a statistical sample.

Data Privacy

Data privacy considerations and the cost to comply with the various state and federal privacy laws are also reasons for counsel to consider, or raise with the court, producing a statistical sample rather than producing the entire population of data. For example, actions brought forward by classes of employees typically require producing data containing personally identifiable information (PII) like social security number, address, phone number, etc. Producing this data for an entire class of employees and then managing its protection across potentially dozens of experts and law firms can be daunting. With each touchpoint exists the potential for a data breach, failed regulatory compliance or additional litigation. In matters like these where there are data privacy concerns, a statistical sample can be considered.

These exposure risks increase exponentially in class actions involving medical data where HIPAA and PHI laws guide the data management, security, and compliance. The cost to maintain an environment required to house massive amounts of medical data for what could amount to years of litigation can be significant.1 Again, when possible, producing a small but statistically significant sample of data may be preferred to producing the entire population of sensitive data.

Counter-Considerations

One consideration when deciding to rely on a statistical sample, especially during the damages phase of the dispute, is that it could limit defense arguments that individual inquiry is necessary to estimate damages. Specifically, in order to rely on the sample, you likely have to assume that it is representative of the entire population, and if it is representative of the entire population then individual inquiry likely is not necessary. For this reason, counsel may choose not to leverage a statistical sample, outside of early risk assessments and exposure analyses designed for settlement purposes under attorney-client privilege classifications. In the end, both parties need to weigh the downstream strategic risks of statistical sampling vs. producing and analyzing the entire population of data.

Another risk to sampling is drawing conclusions from non-statistical sampling. For example, the parties may agree (or be ordered by the court) to use a sample for discovery purposes during the class certification phase. This type of sample is not always statistically significant and in such cases the parties should be cautious before drawing any conclusions about the population. Take, for example, a case where a judge in a weights and measures class action requires an evaluation of the pricing practices at one randomly selected store. From this evaluation, the court and the two parties might be able to gain some insight into the retailer's pricing practices and procedures, but results should not be used to make assertions about other stores owned by the retailer.

Minimizing Sample Size

Once the decision has been made to analyze a statistical sample of the data rather than the entire population, the question turns to how small of a sample can be pulled while still being statistically significant. As discussed above, the decision to sample in a class action is one often driven by cost and risk considerations. In either case, the party or parties endorsing the use of the sample are going to want to select as small a sample as possible. And while the sample size requirement is seemingly rigid and formulaic, there are approaches that can be leveraged to minimize its size. The following outlines a few of these approaches.

Two-Sided vs. One-Sided Sample

The textbook statistical analysis often incorporates a two-sided confidence interval, e.g., “the class of insured patients was under-reimbursed from an insurance provider by between $100 and $200 (with 95 percent confidence).” And because it is so common and widely understood, the decision to perform a two-sided analysis is often made without much thought. However, a two-sided analysis requires a much larger sample than one that is one-sided. For example, the required sample size would be greatly decreased if the statement quoted above was instead changed to “the class of insured patients was under-reimbursed from an insurance provider by no more than $200 (with 95 percent confidence).” While this change in wording may not be suitable given the type of legal arguments being made, it is nonetheless worth considering in certain situations.

Estimating Proportional vs. Continuous Variables

Most statistical sampling in class action disputes involves estimating a continuous variable like dollars (e.g., lost “overtime earnings” in a wage and hour class action). Because a continuous variable like dollars can in theory be infinitely small (and even potentially negative) and infinitely large, it can take a very large sample to estimate the variable with any degree of confidence and precision. Alternatively, estimating a proportional or binary variable which is constrained to a value between 0 and 1 (e.g., 75 percent of the employees in the wage and hour class were underpaid), requires a significantly smaller sample size to arrive at the same degree of confidence and precision as estimating the continuous variable.

Once again, though, this decision typically cannot be made in a bubble and is largely a function of the facts of the case and the underling allegations. In fact, the stage of the dispute often dictates the type of variable that needs to be estimated. For example, the question of what “percent” of employees underpaid (i.e., a proportional variable) is more likely to come up during the class certification phase, while the question of what “amount” was underpaid (i.e., a continuous variable) is more likely to come up during the damages phase. However, when counsel is looking to assess risk or potentially settle a matter early on in the dispute, an expert has more flexibility to decide what type of variable to estimate. And if cost is a concern or there is significant apprehension to expanded discovery, a significantly smaller, yet insightful, sample can be relied upon to estimate a proportional rather than a continuous variable.

Confidence and Precision Requirements

The decision on the level of confidence and precision needed from the statistical sample also affects the size of the sample. The higher the level of confidence (e.g., 90 percent confidence vs. 95 percent confidence) and precision (e.g., 10 percent precision vs. 5 percent precision) preferred, the larger the sample size needed. While everyone would love to rely on results with 1 percent precision, it likely requires tens of thousands of sampling units, whereas results with 5 percent precision may only require a few hundred. So, to the extent counsel or the joint parties can agree on a less stringent level of confidence and precision, a much smaller sample can likely be relied on. It is worth noting that increasing the level of confidence and/or precision of the sample induces nonlinear increases in the sample size. For example, doubling the level of precision from 10 percent to 5 percent often requires an exponentially larger sample.

Stratified Random Sampling

Another significant driver of sample size is the variability inherent in the data being analyzed. The more variance in the underlying data, the larger the sample needs to be to estimate statistically significant results. While seemingly out of the control of the expert, there are actually ways to minimize the amount of variability, thereby minimizing the required sample size. The introduction of a “stratified” random sample is one such way.

Conceptually, the introduction of stratification allows the expert to partition off groups of like records such that intragroup variation is smaller than what exists population-wide. For example, take a class of consumers who have allegedly been overcharged by a retailer. One could reasonably assume that the various geographical locations of the class members might drive different buying patterns among the class resulting in significant variability in the underlying data. A consumer from the Midwest might only buy the goods in question during the summer months while a consumer from the South might buy them year-round. Collectively, these different buying patterns could be adding variability to the data. In this case, some of the variability can be controlled for by “stratifying” the underlying sales data and statistical sample into geographic regions.2

Any intrinsic factor that could be adding variation to the underlying data can be considered for stratification. However, introducing too many strata can have the reverse effect of increasing the required sample size, since a minimum number of sampling units is typically needed in each stratum. For example, stratifying the consumer data described above by both state and year (over a five year period) creates 250 individual strata (i.e., 50 states * 5 years) each of which might require a minimum of 20 sampling units for a total of 5,000 sampling units, thereby defeating the goal of minimizing the sample size.3 Note as a starting point, four to six strata should be considered although in practice it is not uncommon to see significantly more strata.4

In Conclusion

While statistical sampling can be cost effective, reduce discovery (and the risk of producing confidential information such as PII), and provide reliable results, there are still many reasons to not sample in a class action or mass tort. This article presents counsel with issues to consider when weighing the pros and cons of producing and analyzing a small statistical sample or the entire population of data, against the various facts of the case and their legal strategy.

Footnotes

1. Note: A&M's Disputes and Investigations practice has its own Enhanced Security Zone, which is an isolated network segment that is specifically designed to host client data that requires heightened security governance. Our U.S.-based data centers are ISO27001:2022 certified, HIPAA and NIST CSF compliant (third-party annual attestation), as well as Cyber Essentials PLUS certified.

2. Note that whatever sample size is chosen (e.g., 100 sampling units), it is allocated across the various strata (e.g., 50 in the Midwest and 50 in the South) and is not a requirement for each stratum (e.g., 100 in the Midwest and 100 in the South).

3. Cochran (1977) suggests using at least 20 sample units in each stratum while Kish (1965) suggests using at least 10 sample units in each stratum. (Cochran, W.G. 1977. Sampling techniques. 3rd ed. John Wiley & Sons, New York.), (Kish, L. 1965. Survey sampling. John Wiley & Sons, New York.)

4. Cochran, W.G. 1977. Sampling techniques. 3rd ed. John Wiley & Sons, New York

Originally Published 31 January 2024

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

Mondaq uses cookies on this website. By using our website you agree to our use of cookies as set out in our Privacy Policy.

Learn More