Abstract
Students, lawyers, and other professionals are sometimes unfamiliar with scientific methodology or the proper use of statistics. Even so, it is likely that they will confront attempts to use social science to prove a null hypothesis, when the data might suggest otherwise if properly handled. As a didactic device, following Erasmus and C. S. Lewis, advice is provided on “how to misuse” social science for one's own purposes, in hopes that future scholars will not be misled by such inappropriate, though not infrequent, practices.
This author has always believed that science should serve the interests of truth, with as little political interference as possible. However, over 30 years of research experience and recent legal experience (Schumm, 2005, 2010d; Schumm & Nass, 2006; Schumm, Nazarinia, & Bosch, 2009) has taught me that politics and “public relations” (Kupelian, 2010, p. 233) can play a large role in how science is “managed” or “used.” Pedagogically, sometimes the truth is best explained by illustrating its opposite, even if facetiously, as in a parody (e.g., Erasmus, 1929; Lewis, 1958). Readers are supposed to recognize the parallels and be encouraged to resist them. Although Cohen (1988, p. 16) and others (Vicente & Torenvliet, 2000, pp. 260–261) have clearly explained how the null hypothesis can never be proven scientifically, certainly, there are aspiring graduate students, professionals, and attorneys who might be tempted to win their arguments or cases by subterfuge when the desired outcome is a “proof” of the null hypothesis. Cohen stated that “The null hypothesis … is always false in the real world” (1990, p. 1308). If sample sizes are small, one cannot even conclude that there are no small or medium size effects, only perhaps that any effects are not large. That is why Cohen (1990) stated, “I have learned and taught that the primary product of a research inquiry is one or more measures of effect size, not p values” (p. 1310).
To illustrate the misuse of statistics in a coherent manner, a relevant real-world example was chosen: the harms related to tobacco use. While Edward Bernays made effective use of public relations to promote tobacco use by women (Kupelian, 2010, p. 232), here the focus will be on how science could be used to promote tobacco use in general or at least to deter arguments against it. The “instructions in obfuscation” are presented to the aspiring student, but each is followed by an explanation (“Reality check for the wise”) of how to avoid the problems created by poor statistical analysis and reasoning. Suggestions for “proving” a null hypothesis with respect to the possible dangers of tobacco use are arranged roughly from least to greatest difficulty. For example, in practice, a tobacco company might hire scientific experts and a legal team to defend or promote their right to sell tobacco products even if there were increasing evidence of the harmfulness of tobacco use. A major goal might be to discredit any research that suggested a rejection of the null hypothesis, the null hypothesis being that there were no differences in health outcomes between tobacco users and non-tobacco users. While it is also possible to exaggerate the importance of statistically significant effects in ways that distort science, that issue is beyond the scope of this discussion. Although most arguments will apply broadly, some specific approaches are attuned to the peculiarities of the U.S. legal system.
The following, then, is a list of detailed “Approaches” to using incorrect—but apparently valid—arguments that misrepresent data or knowledge, with some citations of published articles that demonstrate how such practices are found even in highly cited, peer-reviewed articles 1 . Obviously the entire text is couched as facetious advice, meant to call the reader's attention to common types of erroneous or manipulative argument. Some of these are clearly courtroom tactics; others are more sophisticated “mistreatments” of data that are found in empirical science journals. The intent is to sharpen the reader's instincts for correct interpretation of experimental design, data analysis, and interpretation.
Approach A: Claim “No Differences”
The first and simplest approach is to claim that there are no known differences, at least with respect to selected outcomes, even if no research has ever been done on a topic. Because if no research has ever been done, then there are no known differences, so the argument is strictly true, and it's hard to argue with truth! To show such a claim to be incorrect, someone would have to conduct research and present it, which is a long, difficult process in most cases. If the misfortune should occur that some research has been conducted, perhaps much of it can be ignored; after all, one's interpretation of the research literature is a statement of professional opinion that may enhance only one side of an issue or “accidentally” ignore certain published studies or parts thereof. For example, if a journal article reported no differences on two of three variables, one might not call attention to the observed difference on the third variable. A corollary of this principle is to keep one's opponents, by any means necessary, from searching for or presenting research results if those results do not favor your case (e.g., Schumm, 2010b). If the court, judge, or journal reader never hears about data supporting an alternative perspective, then they cannot reason according to it.
If an opponent does bring research evidence favoring a rejection of the null hypothesis, then try to disqualify their expertise on whatever grounds may appeal to the court, regardless of the validity of the argument. One useful approach is to ask if the researcher is personally familiar with the area; for example, if he is studying the effects of smoking, ask if he is a smoker himself and if not, if his personal medical or religious views account for that. If so, try to disqualify him as inherently biased and incapable of rendering a legitimate expert opinion, even if the law says it is illegal to disqualify an expert on the basis of their religion. Another approach is to pretend that research on a controversial topic has never existed (Schumm, 2010b).
Reality Check for the Wise:
If there is no extant research, it is presumptuous to assume that any hypothesis, null or otherwise, is true. Overlooking research that does not fit one's “position” is not considered ethical and certainly limits one's perspective. If reviews of scientific literature consistently overlook results from one “side” or another (Schumm, 2008, 2010c), or use different methodological criteria in assessing research from one side or the other, one should suspect bias. Experts' opinions should be evaluated on the accuracy of research-based interpretations or testimony, not their personal values, religious views, or demographic characteristics.
Approach B: Muddle the Definitions or Focus on Irrelevant Outcomes
Work to shift definitions away from behavior, especially risky behavior, to consider all manner of psychological issues. For example, in tobacco use, exactly how does one define who is a “smoker”? Is it someone who actually smokes tobacco? Is it someone who smokes but does not inhale? Is it someone who is attracted to, even craves, smoking tobacco but chooses not to use it? Is it someone who believes in smokers' rights for political reasons but does not smoke? Is it someone who smoked tobacco as an adolescent but later gave it up? What about someone who smokes for a few years, then quits, and then takes up smoking again? What if a person has smoked two packs a day for 40 years but quit in the past month—is that person still a “smoker” or not? Half the battle can be won if the definitions are “right.” For example, who would want to base laws about tobacco use on attraction to smoking? After all, people can't help their attractions to a drug. Definition of a smoker as anyone attracted to smoking does not negate the real damage done by smoking, and including those who do not smoke but are attracted to it in groups designated as smokers in research designs will greatly improve the chances of supporting the null hypothesis. For example, if you want to prove that lung disease has nothing to do with smoking, use attraction to smoking or political identification with smokers as your independent variable rather than actual smoking behavior. You will be much more likely to succeed in proving the null hypothesis by grouping non-smoking people with actual smokers.
There is another approach that is useful for muddying the waters. Suppose that second-hand smoke really did harm children. Surely, we wouldn't want to measure either second-hand smoke or harms accurately. So, we could recommend a “more inclusive” approach to second-hand smoke exposure. Instead of focusing on parents or relatives who smoke tobacco in their homes, broaden the focus to all those who have such products or related items (e.g., matches) in their homes, even if no one ever smokes. Argue that we must know how the presence or availability of smoking tobacco and related products might harm children, “the big picture.” As well, this will increase your sample size and appear to provide more statistical power, even if the measures are becoming increasingly irrelevant to the things that might actually harm children. In some studies, this approach has increased the sample size by nearly 150% (Wainright & Patterson, 2006, 2008; Patterson, 2009).
With respect to outcomes, there are always many possible outcomes. You don't want to do, or cite, research on outcomes most likely to be adversely affected by smoking, such as disease or other health outcomes. For example, if there is concern about the effects of parental smoking on children, don't examine effects of second-hand smoke. Rather, study whether parents who smoke feed their children at least one meal a day or if they seldom abuse them physically or emotionally. In other words, focus on things that may be important but are unlikely to be related to whether a parent smokes or not. If you can study 50 such factors and find no support for differences between the children of smokers and non-smokers, then even if opponents find significant differences on relevant variables, their significant findings will seem minor in comparison to the multitude of null findings. It will sound profound when you say that you tested 50 factors and only one of them was related to tobacco use, even if lung disease was the one that was significant in terms of a rejection of the null hypothesis.
Reality Check for the Wise:
Clarity of definitions is essential in science. In our example, while attraction to smoking would increase the probability of smoking behavior, any damage done by smoking is most likely done by engaging in the behavior, not merely thinking about it. Furthermore, there are probably specific aspects of smoking behavior that increase the risk of adverse effects. The actual use of tobacco products probably matters far more than their availability. If causes or effects are not measured clearly, it is much more likely that a null hypothesis will be accepted. Unless theoretically relevant outcomes are assessed, it is far less likely that the null hypotheses will be rejected. You might report dozens of tests of some outcomes showing no differences between smoking and non-smoking parents but those results would not negate the real harm done by parental second-hand smoke on children. The importance of valid measurement has been noted by Faigman, Kaye, Saks, & Sanders, et al. (2002, p. 142) and Kaye and Freedman (2002).
Approach C: Establish Impossible Criteria for Disproving the Null Hypothesis
It is very important to start with an argument that cannot be refuted. For example, you might argue that unless more than 95% of all smokers (using a muddled definition for best effect, see above) die of lung cancer by age 30 compared to less than 5% of non-smokers, you will not consider the null hypothesis rejected. Argue that it is not “fair” to legally restrict tobacco use unless every single smoker (by your definition) dies of lung cancer within a short time of smoking while no non-smoker ever dies of lung cancer. If the question is whether tobacco use is necessarily a problematic behavior, “All we need is a single case in which the answer is a negative” (Hooker, 1957, p. 30). All you need is a select group of healthy, youthful tobacco users (the “Marlboro man” cowboy comes to mind) or long-time users who have lived to a ripe old age in apparent good health, to “prove” the null hypothesis about tobacco use, at least in the eyes of the public, whose persuasion is the ultimate objective (Kirk & Madsen, 1990). The scientific point is that if you only need one case, then you could have a study with 29 of 30 tobacco users suffering ill effects from tobacco use compared to only 1 of 30 non-users suffering similar ill effects and you could “reasonably” support your hypothesis, even if the odds ratio in such a case was enormous, as well as statistically significant.
Reality Check for the Wise
There are probably almost no variables in social science that will ever be strong enough to yield 95/5 splits on outcomes. For example, Luntz (2009) reported that “Two-thirds (66 percent) of nonreligious Americans agree with the statement ‘If it feels good, do it,’ despite its selfish, even dangerous undertones. By comparison, fully 71 percent of religious Americans disagree with the concept of instant gratification. What we have here is a chasm between the value systems of these two American camps.” (p. 261). Although with N = 200, such percentage differences would yield an odds ratio of 4.75 (95% CI = 2.61, 8.64; r = .37; effect size, Cohen's d = 0.79), a substantial as well as a statistically significant (p < .001) difference, the result is far from a 95/5 split. Of course such a result should not be used to justify instant gratification as a lifestyle. Kaufman and Schumm (2011) replicated Luntz's research with a sample of 195 university students, finding a significant negative association between intrinsic religiosity and impulsivity.
In reality, risky behaviors can lead to instant adverse effects on rare occasions, but more often, the risk accumulates over longer periods of time. For example, texting on a cell phone while driving can lead instantly to a fatal accident but most of the time it happens, there are probably no adverse effects. Despite that “lack of evidence,” many states have legally banned texting while driving—not in the expectation that all texting will cease but that the cumulative risk will be reduced. Furthermore, some would argue—even accurately—that they are such good drivers that they can text and drive without increasing risk to anyone, that they have been doing so for years without an accident. If the existence of a few examples of successful risk takers were sufficient to discredit legislation against risky behavior, probably no legislation regarding any risk could ever be justified.
Back to our example, if 29/30 tobacco users had poor outcomes compared to only 1/30 non-users, that would create astonishing results statistically (odds ratio = 841.0, p < .001; 95%CI = 50.2, 14,097.9; two-sided Fisher's Exact Test, p < .001; r = .93, p < .01, ES > 4.0); and yet, presumably, the result would be considered by the obfuscator (on the “it only takes one case” theory) to represent “proof” of the null hypothesis. One could probably show that drunk drivers are more likely to have accidents and yet at the same time show that not all drunk drivers end up in an accident; thus, you could argue that although drunk drivers were statistically more likely to have accidents and even kill other drivers, being a drunk driver per se was not automatically dangerous or pathological because of the many who never have any accidents. The “one case” argument may sound like good philosophy, but it is not how science is typically done (Schumm, 2010a).
Approach D: Forget to Report or Examine Effect Sizes
Results, especially with small samples, may not be statistically significant even though the effect sizes are medium to large substantively. If you think this might be a problem, it is very useful to “forget” to report means and standard deviations, without which effect sizes cannot be calculated. If you do report these descriptive statistical results, someone (e.g., Schumm, 2010b) might actually calculate effect sizes after the fact. It also helps to “forget” to report sample sizes or t test values, if you can get away with that (Patterson, 2009). Never perform a power analysis with a small sample, as that will alert the reader to the weakness of your analysis (Erich, Kanenberg, Case, Allen, & Bogdanos, 2009, p. 401).
Reality Check for the Wise
Enough information (e.g., means and standard deviations) should always be provided in a published article to permit the calculation of effect sizes. Better yet, in addition to discussions of statistical significance, effect sizes should be reported, along with discussions of their importance to the proper interpretation of the results. A pattern of results consistent with a particular interpretation should also be acknowledged, even if the outcome measures are correlated (Schumm & Crow, 2010). For example, suppose one found that on nine of 10 different measures of respiratory functioning, smokers fared worse than non-smokers; such consistent results should not be dismissed even if those measures were somewhat similar (Wainright & Patterson, 2006).
Approach E: Use Small Samples and Unreliable or Invalid Measures
If research must be done or is discovered in contemporary archives, then one must steer away from large samples and focus on very small samples or, at least, report only the results from very small samples, so that only extremely large effects will ever be detected. Doing many such small studies (Schumm, 2008) is very likely not to support rejecting the null hypothesis (Katz, 2006, p. 138). There is an added advantage, because one can say that many studies have supported an hypothesis of “no differences”—this sounds impressive to the uninformed. If the null hypothesis is occasionally rejected in a small study, attribute the finding to random chance. Citing the benefits of qualitative research is another especially good way to justify continued use of tiny samples.
Sometimes, you might find yourself with an unenviably large sample. One approach is to consider whether oversampling was used to collect more data from typically undersampled populations. If so, instead of correcting the problem by increasing the size of the regular sample, try reducing the size of that rare group. You might start with a group of 200 rarely studied participants and achieve a downsized sample with only 30 such participants. That would greatly reduce the chances of rejecting the null hypothesis. Perhaps you still have a problem—even after downsizing, you still have 150 participants in your group, which would leave you with a much better chance of rejecting the null hypothesis. Break up the group into subgroups, preferably four or five (Cochran & Mays, 2007). That way you can still end up with comparison groups that are very small and unlikely to afford enough statistical power to permit rejection of the null hypothesis. For example, suppose that tobacco users represent 20% of the population. You draw a random sample of 1,000 persons except you oversample tobacco users to get 400, for a final sample of 800 non-users and 400 users. The chances are good of rejecting the null hypothesis with such a large sample. Claiming that you are downsizing your users sample to 200 to make it “representative,” you reduce it to 200, but that is still too large. Divide the tobacco users into smaller groups, citing the important theoretical differences among them—e.g., chewers, pipe smokers, cigar smokers, cigarette smokers, snuff users—and then divide pipe and cigar users into inhalers and non-inhalers. This will divide the 200 into groups with an average size of less than 30, which will severely reduce the chances of rejecting the null hypothesis. At the same time, you may even be commended for your attention to detail!
Another approach is to use variables that have large amounts of missing data. You might start with a group of 100 participants, but if half of those participants skipped a question on your survey, then your de facto group size is only 50, much smaller and much less likely to yield significant findings, not to mention that you can justify disregarding any significant findings by citing the bias introduced by so much missing data. For example, you might start with 44 participants but lose up to 21 participants for a final N = 23 (Wainright & Patterson, 2008, p. 122).
The more unreliable or invalid a measure, the less likely that it will significantly correlate with anything. Therefore, the more such poor measures are used in a study, the more likely one is to support the null hypothesis, avoiding any finding of statistically significant results. It is especially helpful to avoid discussion of the reliability or validity of the measures used, allowing the reader to merely assume psychometric quality without any evidence. A really poor measure is mostly random error, a wonderful situation!
Reality Check for the Wise:
Studies with fewer than 100 participants that do not discuss the statistical power of their analyses or the effect sizes of their results should be viewed with caution. Studies with small samples can be very useful for developing new ideas and future directions for research but are not very useful for detecting the presence of small effects that would be statistically significant in larger samples. To detect a small effect, difference between two groups with an 80% chance (power) with a significance level of .05 requires a sample size of N = 393 (Cohen, 1992, p. 158) while detecting a medium size effect requires N = 64. Good (2001, pp. 229–232) has presented helpful information on sample size and statistical power. When evaluating the role of subgroups, one should probably look at results for the subgroups separately and then combined. If missing data represent less than 10% of the sample, the “damage” is probably minimal. However, as missing data increase, the risk of selection bias increases and should not be discounted. There are a variety of approaches for dealing with larger amounts of missing data, beyond the scope of this discussion. Some results can be significant by chance alone, so one must be careful to not over-interpret scattered significant findings observed when conducting multiple statistical tests.
Measures of key constructs should have both high reliability and adequate validity. That is, they should be internally consistent and perform similarly over time. They should correlate with measures that you would expect them to correlate with and they should not correlate with measures that you would not expect them to correlate with—expectations, of course, being based on theory. Researchers should routinely provide evidence to support the adequate reliability and validity of their measures. Otherwise, high random variability in either measures or even experimental methods will obscure any true effects of the independent variables or different groups being used to predict outcomes.
Approach F: Minimize Unwelcome Findings
If eventually larger studies are done, especially with random, representative samples, then it is more likely that some effects will be discovered. At this point, the easiest approach is to note their significance but avoid mention of effect sizes, since some of them might be large and very damaging to the idea of a null or “no differences” hypothesis. It sounds much better to mention that five barely significant results were found than to mention that the effect sizes were substantial. If the effects do appear to be large, then the easiest approach would be to deny their clinical significance, which being very subjective, is difficult to discredit. However, those who are not clinicians may not care if the results were “clinically” significant or not, so another way to discredit the research is to observe that the study has not been replicated, at least by scholars you consider to be “credible.” If the results have been replicated, at this point, the best approach is to insist that the results are not meaningful (by your idiosyncratic definition of meaningfulness, of course), even if they are statistically significant, substantial, possibly even clinically significant, and replicated. At the very least, you should argue that statistical significance does not imply clinical significance (Herek & Garnets, 2007, p. 356). In our example, if you were representing tobacco companies, you might argue that even if smoking did appear related to poor health outcomes, everyone has to die of something, so the results don't really mean much. You could also argue that people, being free, can choose not to smoke at any time and thereby minimize any negative health outcomes. Hence, the real issue should be the freedom to smoke (or not) rather than possible health outcomes from smoking. At this point, one must be careful to obscure any effects of second-hand smoke or effects on others or society in general from smoking because those issues might be more likely to be viewed as meaningful by neutral parties.
Reality Check for the Wise:
If effect sizes are never mentioned in a research report, one should be wary. Many journals require the reporting of effect sizes. When results are statistically significant and effect sizes medium to large in magnitude, dismissal of those results in connection with “lack of clinical significance” or “meaninglessness” are an indication that the interpreter is grasping for straws. While it is useful to replicate important studies, exact replication is rare. Often one must consider similar studies even if they are not exact replications.
Approach G: Using Group Designs “Creatively”
The ideal in science is to compare two pure groups of approximately equal size with equal variances in which all of the members in each group accurately represent different qualities. Such careful sample selection allows for a greater chance of rejecting the null hypothesis and must be avoided at all costs. One way to be creative is to use data sets in which too few questions were asked to permit accurate identification of the members of pure groups. If the two pure groups really were different, then you can increase the chances of not observing that if you can mix some members of each group into the other group! Perhaps you can arrange for members of group A to be in group B at a rate of 20% and members of group B to be in group A at a rate of 50%. Now whatever differences may exist between the two groups, A and B will be largely obscured, reducing the chances of detecting the true differences between the two groups. In our example, perhaps some people smoke tobacco every day while others smoke only once a week; try combining the “once a week” smokers with the “never” smokers before comparing the “non-smokers” to the “smokers.” It may be very helpful to use this approach in combination with undersizing, missing data, or using several subgroups. Another helpful approach would be to drop from your study anyone who died or became seriously disabled from respiratory illnesses; after all, they are in no position to be re-interviewed.
Reality Check for the Wise:
If concepts are not defined clearly or not measured precisely, one can support almost any null hypothesis, in addition to preventing effective replication of the research. One should never use weak operational definitions to form two groups and then assume the results generalize to far more specifically defined groups. For example, persons who chew tobacco and persons who smoke tobacco may be very different, even though both could be defined as “tobacco users.” Women who chew tobacco may be different from men who chew tobacco, even though both groups are “chewers.” Thus, generalizing from chewers to smokers or from men to women would not always be appropriate in spite of superficial similarities.
Approach H: Statistical Controls
The refuge of last resort for the advocate of the null hypothesis must be statistical control. There are at least five excellent ways to justify the null hypothesis with statistical controls.
Measuring control variables but not using them.—It is likely that if you survey tobacco users about tobacco use, “socially desirable” responses will be elicited. A clever way to deal with this problem is to measure social desirability but not use it as a control variable in analyses. That way you get credit for measuring an important control variable without having to endanger your intended acceptance of the null hypothesis by actually using the control variable in your analyses (Erich, et al., 2009, p. 400). Another useful approach is to measure a variable such as educational attainment that might differ between groups but report only means without standard deviations (Erich, Leung, & Kindle, 2005, p. 50) or report only the mean and standard deviation for both groups together rather than for each group separately (Erich, et al., 2005, p. 51). Either way, you must avoid controlling for those variables, especially if they have the potential to act as suppressor variables.
Reality Check for the Wise:
Means and standard deviations for both groups should be reported, especially for variables that are either theoretically important or are shown to differ significantly between the two groups. Measuring an important control variable without using it as a statistical control factor should raise immediate questions—after all, why measure it if it's not important enough to use as a control variable?
“Controlling” for natural or logical outcomes.—Suppose you were a researcher with an agenda to support tobacco companies. Suppose you were studying cancer patients in two groups (Groups A and B, 30 smokers and 30 non-smokers). Let's assume that smokers are more likely to have some form of cancer (40% vs 10%), which will yield an odds ratio of 6.00 (p < .05). That substantial and statistically significant odds ratio would not look good in court, so something must be done about it! You also observe that when participants learned they had cancer, there was a tendency for the patients with cancer (67%, for both smokers and non-smokers) to experience depression; at the same time, there was a higher tendency for smokers without cancer to be depressed (33.3%) than for non-smokers without cancer (11.1%). In this case, if you want to discredit the relationship between smoking and cancer, try controlling for depression! There is a danger that your opponents might observe that depression was a natural outcome of illness and therefore a poor control, but one must hope they are not so astute. Adding depression in our example, would save the day, yielding a statistically nonsignificant odds ratio (4.01) for smoking and cancer. Even though cancer would be found to be significantly related to depression, that result can be disregarded as meaningless—“obviously,” cancer leads to depression—or even not reported. Furthermore, now you could go to court and argue that cancer was not a smoker's disease because (1) some non-smokers developed cancer, (2) not all smokers developed cancer, and (3) there was no statistically significant relationship between smoking and cancer once you controlled for the tendency for ill persons (and even healthy smokers) to be more depressed. If the consequence of smoking is cancer and cancer has its own consequences, then controlling for the latter can make it appear that smoking is not harmful at all. Just make sure the other side doesn't have a methodologist who can sort out such issues!
Reality Check for the Wise:
The model is that smoking causes cancer which causes depression. Controlling for depression violates the causal order in terms of the smoking versus cancer association. It could be argued that depression causes cancer but such an effect would be indirect (e.g., weakening the immune system) as compared to the more logical direct effect of having a potentially terminal illness affect one's mental health.
“Controlling” for natural or logical antecedents.—Another method is to use a variable that is highly correlated with smoking itself, something that is very closely tied to smoking but not cancer. Let's leave the data set (N = 60) alone except to add a variable labeled dating. Let's suppose that 81.5% of non-smokers without cancer can get dates easily as well as 66.7% of non-smokers with cancer, whereas only 16.7% of smokers can (perhaps bad breath and stained teeth are at work), for both those with and without cancer. Now, when one controls for dates (which do not significantly predict cancer), the relationship between smoking and cancer is no longer significant. Big tobacco could claim that there was some sort of smoker-phobia out there, restricting the social lives of smokers. In other words, it was the prejudices of people who would not date persons with bad breath and stained teeth. Smoking wasn't a direct cause of cancer; more likely, it was social or sexual rejection and unfair dating discrimination causing stress that led to higher vulnerability to development of cancer. Perhaps society had imposed an internalized smoker-phobia on smokers so they unconsciously hated themselves for smoking and their self-hatred exacerbated their rejection by others. In other words, blame anyone or anything except the person responsible for the risky choice of using tobacco. As a last resort, remember that people probably have genetic predispositions for nicotine cravings or attraction to wafts of beautiful billowing smoke, so even if they make the choice to smoke, it still really isn't their own personal responsibility.
Reality Check for the Wise:
The issue is twofold—(1) can the discrimination variable account for any effect and (2) can it completely explain away the relationship between smoking and cancer. The first proposition might be true while the second might not be true. However, unless both are true, one cannot explain the association between smoking and cancer as entirely due to the variable of social discrimination. Furthermore, speculating that the third variable might have explained away the association if it had been measured and analyzed might be interesting philosophy, but it is not sound interpretation.
Controlling for the kitchen sink.—Every time you add a variable, even a totally random variable, to a statistical model, you introduce additional random variance that acts like a smokescreen to obscure any true differences among the variables. It is a well-kept secret that controlling for any set of variables is not a truly empirical enterprise. So, don't worry, most scholars won't catch you at it. Just throw in whatever variables you want, regardless of their theoretical merit or rationale. The more the better, as it will reduce the chance of finding differences between the groups on the important variables in question. You might try controlling for state of residence (there are 50 of them!; Rosenfeld, 2010); it will sound so logical and yet water down almost anything you evaluate, especially if your overall sample size is small compared to the total number of variables used (Langbein & Yost, 2009)! This is especially defensible in the U.S. because geography is becoming closely associated with the moral and political values of the populace in the United States (Bishop, 2008; Silk & Walsh, 2008; Cahn & Carbone, 2010), so that controlling for state of residence can reduce the apparent effects of virtually any outcome of interest. With respect to tobacco use, one might find that use was higher in Southern states but that Southern states had higher rates of poverty. Thus, if you control for poverty and/or access to high quality health care at the state level, you might find little residual effect of tobacco use on overall health outcomes, thus “proving” the harmlessness of tobacco use.
Reality Check for the Wise:
A non-statistical way of thinking about the above points is to consider that you are making a soup: if you only have three ingredients, you could probably identify each one. If you add ten more, you would probably not be able to identify all of them. Likewise, even if your added ingredient was only water, if you added enough water, you might eventually not be able to taste any of the original ingredients. Statistically, you are adding random variance and eventually that random variance, by chance, matches the true variance of the original variable(s) and may appear to account for or cancel it. Adding a plethora of “control” variables may seem sophisticated methodologically, but unless the controls make theoretical sense, they can have the effect of making acceptance of the null hypothesis more likely simply due to random factors. Miethe (2007, pp. 24–26) discusses the role of proximal and distal factors as well as spurious associations and intervening (developmental) variables. Finkelstein and Levin (2001, pp. 292–296) discuss a variety of issues associated with statistical controls, including under-matching and over-matching, confounding, selection bias, and response bias, as well as interaction effects (pp. 433–436).
“Controlling” for multiple tests.—Another method is to use the Bonferroni procedure and divide the standard .05 level of significance by the number of tests conducted. Suppose you have conducted nine statistical tests and two of them are significant, p < .05. You could divide .05/9 = .006 and set .006 as the new required level for statistical significance, so that fewer of the tests were significant by the new criterion (Fulcher, Chan, Raboy, & Patterson, 2002, p. 69). This works especially well if you don't mention what you were doing in the report because readers will look at the t values and expect them to be significant (p < .05) and yet you are reporting no significant results—thus, confusing potential readers.
Reality Check for the Wise:
Using Bonferroni procedures is appropriate when you are trying to avoid granting credence to significant results that are likely due to chance alone. If one uses a criterion of p < .05, one will find five apparently significant results, on average, if one conducts 100 statistical tests. If you conduct 20 statistical tests and you are worried that the one significant result you found was a chance result, then using the Bonferroni procedure would be appropriate. However, if you conduct 10 statistical tests and find three significant results, use of the Bonferroni procedure may be too conservative.
Integrating multiple methods.—For best results, of course, using all five types of “statistical controls” at the same time is highly recommended. It is very unlikely that anyone will catch all of these abuses of statistics. And, don't forget—it doesn't matter which variables provide you with the desired results, just keep trying different ones until one of them creates the outcome you desire! Keep in mind that you are under no obligation to tell anyone how hard you had to try to find something that worked! You only publish the one result you liked, after all, even if you had to try 200 of them to get that one that fit your preconceived ideas.
Reality Check for the Wise:
One should control statistically for pre-existing differences between the groups in question or for differences that would be expected theoretically. For example, if you found that non-smokers had higher education than smokers, you should control for education when comparing those two groups lest you confound outcomes of smoking with outcomes of higher (or lower) education. On the other hand, if you theoretically expected smokers to be more neurotic than non-smokers and you knew that neuroticism predicted the outcome variable you were using, you would want, for theoretical reasons, to account for neuroticism. In other words, the statistical decisions should be made for sound scientific or theoretical reasons, not just to find the solution you are looking for.
Approach I: Avoidance of Statistical Tests
The best way to support the null hypothesis is to use t tests. Other statistical tests, such as nonparametric tests (in some cases) or multivariate analyses of variance with repeated measures, although more powerful statistically, may feature a greater risk of rejecting the null hypothesis. Moreover, equivalence testing must be avoided at all costs as a method for evaluating your null hypothesis, since it offers perhaps the best chance for rejection of the null hypothesis. A related way to sound profound but without doing any genuine statistical tests is to report that, say, 56% of tobacco users developed a problem, such as lung disease within 17 years, compared to perhaps 40% of non-users over their whole lifetimes (Gartrell & Bos, 2010). All it takes is a little exaggeration of the 40% rate to “approximately half” and forgetting the denominator to make it seem that the outcomes for users and non-users are about the same, as if 56%/17 years was about the same as 40%/ lifetime simply because the numerators of the fractions are somewhat similar. After all, 56% is “pretty much the same” as 50%.
Reality Check for the Wise:
Equivalence tests are now considered by some statisticians as the new gold standard for evaluating null hypotheses (Rogers, Howard, & Vessey, 1993; Jones, Jarvis, Lewis, & Ebbutt, 1996; Tryon, 2001; Wellek, 2003, pp. 101–106; Cleophas, Zwinderman, & Cleophas, 2006, pp. 59–65; Tryon & Lewis, 2008). Aside from equivalence testing, there are many other statistical approaches than may be better than using simple t tests (Schumm, 2010b; Schumm and Canfield, 2011). When comparing different outcome rates over time, any differences in timeframe must be accounted for as well as any differences in outcomes.
Approach J: Avoid Complex Models
It might occur that your null hypothesis is rejected, or not rejected, despite your best efforts. Either way, one must avoid the development of theoretical models that could be tested statistically. For example, if the null hypothesis was not rejected, there might be mediating or moderating variables of interest. For example, there might be a moderating (interaction or contingency) variable that would show that for one group the null hypothesis was rejected and for another group it was not rejected. Likewise, there might be a mediating variable that was significantly related to both the independent variable and the outcome variable, thus linking the independent variable indirectly to the outcome variable (Schumm, in press). Either way, such a model might provide evidence that there was a significant effect of the independent variable on the outcome, even if indirect or limited to one group. If the null hypothesis was rejected, it's probably wiser to try to dismiss that evidence, or reinterpret it as meaningless, than to explain it away with an actual theoretical model that could be tested statistically. Perhaps, although statistically significant, the outcome is not “clinically” significant (Matthey, 1998). For example, if smoking damages the lungs and damaged lungs are associated with later lung cancer, the model that smoking was harmless would sound so illogical that your court case might not hold up in the eyes of a jury of ordinary citizens. Therefore, it would be wise to avoid any scientific theory or research that considers damage to the lungs as a valid or useful variable with respect to smoking and possible cancer-related outcomes. In general, it's even better to avoid theoretical models that include any mediating or moderating variables. As a last resort, you might have to try to show that a model you prefer is better than the model the other side prefers, but that is an arduous task, and best avoided.
Fortunately, there is an even better way to deceive a reader or jury in this case! Suppose that tobacco use does cause lung disease at three times the rate for non-users and that lung disease has a high mortality rate. That sounds pretty bad. Here is what you do. Divide your research subjects into four groups, split on use/non-use and on diseased/non-diseased. What will seem to occur is that those with lung disease—user or non-user—will have higher mortality rates compared to those without lung disease (Golombok, Perry, Burston, Murray, Mooney-Somers, Stevens, et al., 2003). This “proves” that tobacco use is not the problem, getting lung disease is the problem. So there is no harm in smoking tobacco, just in getting lung disease. Virtually any high risk, adverse consequences of tobacco use—or any other behavioral practice—can be obscured with such statistical and theoretical finesse! Should someone notice the higher rates of lung disease for tobacco users, argue that it is not the risky behavior of smoking that is the problem, it's a lack of health care or something else.
Reality Check for the Wise:
The process of science is not about testing hypotheses once and forgetting about them. The goal is to build valid theory based on multiple tests of reasonable hypotheses. For example, it is unlikely that smokers and non-smokers (or any other set of groups) would be exactly identical as groups on all but one factor, such as smoking. A good theory would explain how they were different prior to becoming smokers (or not) and exactly how smoking (or its antecedents) both benefitted and harmed both the smoker or non-smoker and their other family members. A mediating variable explains how one variable influences another variable (e.g., smoking causes genetic damage that increases the risk of getting cancer later in life; genetic damage is the mediating variable). A moderating variable explains how the effect of one variable on another may differ across two or more groups (e.g., smoking causes more genetic damage for women than for men so that women are more likely to be diagnosed with cancer later in life; gender is the moderating variable). Sometimes control variables can have subset effects as moderators; for example, one might try to control for religion by using affiliation (e.g., Christian vs non-Christian; Langbein & Yost, 2009). However, if some Christians are pro-tobacco and others are anti-tobacco, affiliation will likely not be a very helpful control variable. Religious commitment, as measured by weekly attendance, might be a more useful control variable. If a risky lifestyle or behavior causes problems that leads to other problems, then controlling for the intermediate problems does not negate the adverse consequences of the initial causal factor(s), even if the risky factors are more distal or indirect in their effects. Both mediating and moderating variables would be considered prior to smoking status and afterwards with respect to benefits and/or harms. For example, at one time, second-hand smoke was considered harmless to children by many, though not all scholars (Cameron, 1972); today, after many years and better research, we know that it is harmful. Had social policy been established on the opinion of no harm, how many children would have been (or were) harmed? Finkelstein and Levin (2001, pp. 433–436) have discussed interactive models that involve moderating or contingency factors.
Approach K: Dismiss Consideration of Personal Responsibility for Outcomes
This point differs from Approaches A to J, in that it is related to courtroom consideration of empirical research and legal technique, not experimental design or use of statistics. However, when all of the above approaches have failed, this approach may yet be successful in courtroom situations, at least in the European Union and the USA. It is also important in applications of research in social policy, for example, education, welfare, health, etc.
One can dismiss any discussion of morality, responsibility, and personal ethics from legal disputes over risky behaviors by equating any such arguments with imposition of specific, formal religious practice. The issue can be framed as one of personal, individual rights, the right to choose for oneself regardless of the consequences for self or society in the short or long term. A deft use of the doctrine of separation of church and state in these situations can be effective to banish any consideration of personal responsibility from legal, and even from public discourse. Point to times when formal religion has been twisted to promote racism, incivility, or other evils. Better yet, argue that liberty was meant for the individual's pleasure or choice more than the pursuit of physical, relational, emotional, or social health or cohesion. While some ancient religions (e.g., Druidism; Philips, 2010) 2 have fostered an “individual rights” mindset (including unabashed sexuality and promiscuity), most of the more recent monotheistic religions have not. People who attend formal religious services will often hear statements supportive of individual responsibility and moral behavior. For example, Bunyan (1680, 1969) discussed consequences of good and bad moral behavior in The Life and Death of Mr. Badman. Likewise, the Talmud considered the question of casual sex and was opposed to it (Katz & Schwartz, 1997, pp. 209–211); it (Cohen, 1949) demanded the strictest standards of sexual morality (p. 97). The Talmud also warned against the evils of excessive alcohol consumption (p. 232). The Qur'an (Shakir, 1985) speaks of good deeds being rewarded and bad deeds being punished (6: 160, p. 93). A cursory examination of the Book of Proverbs reveals two principles at work: one (A) in which a focus on short-term pleasure often leads to long-term adverse consequences (for self and others) and a second (B) in which making short-term sacrifices of one's pleasures, even “rights,” can lead to long-term very positive outcomes (for self and others). Recently, Baumeister and Tierney (2011) have reported, along the lines of Principle A, that “Religious people are less likely than others to develop unhealthy habits, like getting drunk, engaging in risky sex, taking illicit drugs, and smoking cigarettes” (p. 179). Religious people are, along the lines of Principle B, also “more likely to wear seat belts, visit a dentist, and take vitamins” (p. 179). They appear to develop better self-control (McCullough & Willoughby, 2009) because religion improves reason for inhibition and improves monitoring of behavior, as religious persons feel watched by God and by their religious peers (Baumeister & Tierney, 2011, p. 181).
Clearly, if thinking about personal responsibility can be characterized as “rigid, moral” rules, it can be eliminated from public discourse, and half the battle is won for public and legal approval of unrestricted tobacco use or anything else that might be claimed as an individual right, pleasure, or choice (and can be profitably marketed!). Russell (2003) indicated that one could beautifully illustrate “the resilience that characterizes the lives of most” [tobacco-using] youth by noting how one such youth's defense of his risky lifestyle was that “We have all the fun!” (p. 1253). Having “all the fun” with behaviors that pose longer-term risks to one's own health and others' health is indeed a beautiful illustration less of resilience than of Principle A.
Reality Check for the Wise:
While pleasure is a “feel-good” type of affect, other types of affect (satisfaction, happiness, joy) may provide the same positive “feel” while being associated with more positive long-term outcomes, e.g., psychological and physical well being (Schumm, 1999). An attempt to falsely equate lifestyles based on selfish pursuit with formal moral systems is both counterproductive to the long-term welfare of a society and a way to generate inequality rather than equality. Thomas Jefferson, who was familiar with the potential evils of religion, wrote that (Notes on Virginia, 1782) “… it does me no injury for my neighbor to say there are twenty gods or no God. It neither picks my pocket nor breaks my leg.” While his statement clearly supports freedom of religion, it implies that he would rather have neighbors (of any religion) who lived according to Principle B than have those who lived according to Principle A. And today it appears that religious persons are far more likely to subscribe to Principle B, and presumably reject living by Principle A, than are those who describe themselves as non-religious (Luntz, 2009). For example, even Langbein and Yost (2009) found that the “percent Christian” (even though a poor measure of general religiosity and not inclusive of all religions) by states of the USA predicted significantly (p < .01) lower rates of divorce, abortion, and out-of-wedlock births.
Approach M: When You Do Want Significant Results
There is one condition under which you would want to report significant results. It would help the cause of tobacco users if you could show that states that had enacted restrictions on tobacco use also were characterized by tobacco users with poorer mental health and greater feelings of being stigmatized or persecuted. Specifically, you are looking for significant interaction effects between state regulations and tobacco use status such that in states with stricter regulations regarding tobacco use you find greater problems among tobacco users. Of course, even if you don't find significant interactions, you can still report them as if they were meaningful (Hatzenbuehler, 2011, p. 899) providing detailed figures of the non-significant findings (p. 900). If you find significant results, avoid reporting effect sizes, especially if they are small; you might find it useful to compare cell means without testing for interaction effects (Riggle, Rostosky, & Horne, 2009). You might dare to report small effect sizes and hope that no one notices how small they were (Hatzenbuehler, O'Cleirigh, Grasso, Mayer, Safren, & Bradford, 2012). You also probably do not want to highlight a finding such as over 56% of tobacco users in your study had been diagnosed with any psychiatric disorder in the past 12 months compared to less than 35% of non-users (Hatzenbuehler, Keyes, & Hasin, 2009, p. 2278); nor would you want to highlight a finding that tobacco users in states with fewer restrictions were far more likely to have other drug disorders (p. 2279), which might suggest that state restrictions might be beneficial to tobacco users. An easy path to success is to ask tobacco users if they like restrictive state regulations and how those regulations make them feel—surely this will pick up a lot of negativity among users (Rostosky, Riggle, Horne, Denton, & Huellemeier, 2010), even if none of the results are significant statistically.
If you find that the interactions among variables are seldom significant or that tobacco users seldom report discrimination against themselves, downplay those findings (McLaughlin, Hatzenbuehler, & Keyes, 2010). Likewise, if tobacco users recover quickly from the enactment of new tobacco use restrictions, downplay that finding (Rostosky, Riggle, Horne, & Miller, 2009). If you are testing multiple times for interaction effects, do not use Bonferroni procedures because instead of half (8/16) of your tests being significant (.02 < p < .05), you might find none of them so (Hatzenbuehler, et al., 2009, p. 2280; Hatzenbuehler, Keyes, & McLaughlin, 2011, p. 1077). Furthermore, do not discuss the actual pattern of the significant interactions—you might find that tobacco users had fewer problems when they lost their jobs in states with fewer restrictions, a difference that might account for the statistical significance of an interaction term as much as a finding that those who lost jobs in restrictive states had more problems (p. 1077). Also avoid considering the possibility that the mental health of non-tobacco users might improve in states with more restrictions (Hatzenbuehler, McLaughlin, Keyes, & Hasin, 2010); if there are substantially more non-users than users, the overall mental health of the population might improve, even if it worsened for the users, an indication that the restrictions were improving the quality of life for the population as a whole.
Reality Check for the Wise:
Interaction effects are difficult to assess (Jaccard, Turrisi, & Wan, 1990). Non-significant interaction effects should not be interpreted as significant. Significant interaction effects must conform to the hypothesized theory, otherwise, their significance may only provide support for a different pattern of interaction. If multiple interaction effects are tested and only a few are significant, caution should be observed, because some, if not all, of those significant effects might be a product of chance alone.
Approach N: Emergency Actions
Sadly, in some situations, it may not be possible to avoid evidence that supports a rejection of the null hypothesis. Here it is best to “reframe” the results in a way that supports your case. If your group appears at a disadvantage from the research, blame discrimination or minority status as the cause, so the “bad” results are only a reflection of “bad” things done to them by others. If that's not enough, blame “internalization” of societal discrimination, a gambit that is especially helpful if many research subjects don't believe they have ever been discriminated against. Argue that they are so discriminated against that they are not even aware of it – and that's precisely the most effective and dangerous form of discrimination. If you really get in a pinch, argue that the research itself doesn't matter very much anyway because it doesn't have substantive, clinical, or practical weight or because a civil rights or legal rights issue is at stake. Don't forget to discredit, intimidate, or ridicule any pesky scholar who disagrees with you, remembering Saul Alinsky's precepts (1969, 1972; Kupelian, 2010, pp. 240–241).
If something is published in a law journal (Schumm, 2005) you don't like, argue that most social scientists don't rely upon law journals; if it's published in a social science journal, argue that it is a “lower-tier” or an online journal (Schumm, 2010e). Accuse them of “cooking the data,” “data fudging,” or “data fraud” even if you can't prove it (Miethe, 2007, p. 27); this is especially useful if you have been “cooking the numbers” yourself per the above instructions. Do your best to cover up any biases in social science that might hurt your case (Schumm, 2008, 2010d). Research is your servant, not your master. If it supports your case, use it; if not, discredit or reinterpret it to support your case. Even if scientific “consensus” is incorrect because of methodological flaws, argue that consensus is essentially fact, even though some courts might reject this argument under the Daubert tests (Faigman, et al., 2002, p. 27). Dismiss those who disagree with you as “unscientific” (Hooker, 1996, p. 917) or “discredited” regardless of the facts. Accuse them of bias against smokers, of hating innocent people, of being bigots—eventually, the public and the courts may believe your accusations even if they are false. These guidelines should be helpful to all aspiring professionals who view science, as has at times been the case with religion and law, as means to an end, regardless of the consequences for others.
Reality Check for the Wise:
The Devilish Obfuscator may be able to be outwitted here if keen empiricists expose the weaknesses of his arguments and provide evidence in turn, regardless of whether the issue is using tobacco or some other controversial matter. But where will scholars be found who, as expert witnesses, are “scientists first and expert witnesses (and advocates) second” (Faigman, et al., 2002, p. 53)?
The concluding point is that judges, attorneys, and other professionals need to understand statistics, research methods, and experimental designs well enough to avoid the types of pitfalls “recommended” in this tongue-in-cheek exercise. The Daubert tests allow the exclusion of evidence with a weak scientific foundation, even if it has some “consensus” behind it (Faigman, et al., p. 27). Most peer review takes place after publication (Faigman, et al., p. 37), so courts must weigh the scientific validity of evidence over popularity or mere “body counts” of experts on one side or the other. Expert testimony by itself without a chance for expert rebuttal is of little value—but both sides of the case should be allowed to discuss the other expert opinions. It is hoped that this discussion will encourage further education in those topics both during and after formal professional training. There are numerous resources on the law and statistics for those interested in more in-depth study (e.g., DeGroot, Feinberg, & Kadane, 1986; Finkelstein & Levin, 1990, 2001 Finkelstein, 2009).
