Errors in the Statistical Analysis of Gueguen, N. (2013). Effects of a tattoo on men s behaviour and attitudes towards women: An experimental field study. Archives of Sexual Behavior, 42, 1517-1524. C. J. Schwarz Department of Statistics and Actuarial Science, Simon Fraser University cschwarz@stat.sfu.ca December 27, 2013 Contents 1 Introduction 2 2 Experiment 1 2 3 Experiment 2 5 4 Summary 6 1
2 EXPERIMENT 1 Abstract Gueguen (2013) conducted a study to investigate the impact of a tattoo on men s behavior and attitudes towards women. The key flaw in the analyses in this paper is that the author failed to distinguish between the experimental unit (the woman) and the observation unit (the period on the beach or the man questioned), i.e. the author fell prey to the problems of pseudo-replication (Hurlbert, 1984). Fortunately, the results from the two experiments are striking enough that even a poor analysis generally lead to the correct conclusions about the impact of the tattoo on the perceived attractiveness of women. However, the incorrect analyses should be corrected so that future experimenters do not make the same errors. 1 Introduction Gueguan (2013) 1 did an interesting experiment on the influence of a tattoo on men s behavior and attitudes towards women. This article attracted much media attention, including an article in the Economist 2. The experimental protocol is presented in the paper. Briefly, 11 women lay on a beach either with or without a fake tattoo on their back. A confederate was nearby and (a) recorded the number of approaches and time to the first approach to the woman (Experiment 1) and (b) polled nearby men on three questions relative to the subject (Experiment 2): The probability of having a date with the women if such an opportunity arose on a 9 point scale from 1 = no probability to 9 = high probability; The probability that the woman would agree to have sex on the first date on the same 9 point scale; The physical attractiveness of the woman on a 9 point scale with 1 = not all physically attractive to 9 = very physically attractive. 2 Experiment 1 Each of the 11 women participated in 10 sessions with and without a tattoo for a total of 110 observations under each condition. The raw data was extracted from the paper and is presented in Table 1. The author concluded: A Chi square goodness-of-fit test was used to analyze our data regarding the frequencies of men s contact in the two conditions (with or without a tattoo). A significant difference between 1 Gueguen, N. (2013). Effects of a tattoo on men s behaviour and attitudes towards women: An experimental field study. Archives of Sexual Behavior, 42, 1517-1524 http://dx.doi.org/10.1007/s10508-013-0104-2 2 blah blah blah c 2013 Carl James Schwarz 2
2 EXPERIMENT 1 the frequencies of male approaches was found, χ 2 (1, N = 37) = 6.08, p =.004, revealing that significantly more men approached the confederates when they exhibited a tattoo. The difference between the two tattooing conditions in the time elapsed before the first man s contact was examined with a Student-Fisher test for unpaired distributions. A significant difference was found, t(35) = 3.01, p =.005, d = 1.02, revealing that men approached the tattooed confederates more promptly. Table 1: Raw data for Experiment 1 extracted from the paper. A total of 110 observation periods for each condition were observed. Tattoo No tattoo Number contacts 26 11 Mean time to first contact (min) 23.61 34.78 SD time to first contact (min) 8.26 14.19 The χ 2 test can be conducted using R: visits <- c(26,11) chisq.test(visits) giving Chi-squared test for given probabilities data: visits X-squared = 6.0811, df = 1, p-value = 0.01366 We obtain the same χ 2 test-statistic value, but a different p-value. I also tried an exact binomial test but was also unable to obtain the p-value above. The comparison of means was done using a two-sample t-test (assuming equal variance) rather than the preferred Welch t-test 3. library(bsda) tsum.test(mean.x=23.61, s.x=8.26, n.x=26, 3 Ruxton, G.D. (2006). The unequal variance t-test is an underused alternative to Student s t-test and the MannÐWhitney U test. Behavioral Ecology, 17, 688-690. http://dx.doi.org/10.1093/beheco/ark016. c 2013 Carl James Schwarz 3
2 EXPERIMENT 1 mean.y=34.78, s.y=14.19, n.y=11) tsum.test(mean.x=23.61, s.x=8.26, n.x=26, mean.y=34.78, s.y=14.19, n.y=11, var.equal=true) giving Welch Modified Two-Sample t-test data: Summarized x and y t = -2.4416, df = 12.966, p-value = 0.02972 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -21.055992-1.284008 sample estimates: mean of x mean of y 23.61 34.78 Standard Two-Sample t-test data: Summarized x and y t = -3.0126, df = 35, p-value = 0.004789 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -18.697152-3.642848 sample estimates: mean of x mean of y 23.61 34.78 While we were able to reproduce the results in the paper, there is a subtle error in the analysis that will become more apparent in the discussion of the second experiment. The author has fallen prey to pseudoreplication 4 by confusing the experimental unit (the 11 women) with the observational unit (the multiple times on the beach). Rather than using a simple chi-square test, each women should serve as a block. Each woman s data can be arranged in a 2 2 table to compare the number of approaches with and without a tattoo, and the results combined over the women using a Cochran-Mantel-Haenszel test. Because of the sparseness of the data, women with no approaches would be discarded, and an exact test would likely be needed. 4 Hurlbert, S. H. 1984. Pseudo replication and the design of ecological field experiments. Ecological Monographs 54, 187-211. http://dx.doi.org/10.2307/1942661 c 2013 Carl James Schwarz 4
3 EXPERIMENT 2 For the analysis of the time to approach, the mean time to approach for each women-tattoo combination should be computed, and then these means compared using a paired t-test (see next section). Because of the likely imbalance of the design, an analysis that has multiple components of variance will be needed. 3 Experiment 2 In the second experiment, the 11 women again lay on the beach with and without a tattoo. The confederate approached 20 men for each women-tattoo combination as asked the three questions as given in the appendix. The author stated: To test the possible interaction effect between confederates and tattoo conditions, A 11 (confederate) 2 (experimental condition) ANOVA with confederates as the between factor and experimental conditions as the within factor was performed for each dependent variables. We found no interaction effect, with probability for a date, F (10, 209) = 1.12, η 2 p =.05,probability for sex, F (10, 209) < 1, η 2 p =.01, and for physical attractiveness, F (10, 209) < 1, η 2 p =.04, so the data were collapsed across confederates. Table 2 shows the mean of the three dependent variables. Differences between the two tattoo conditions were examined with a Student-Fisher independent test. Regarding the participants estimate of having a date with the confederate, a significant difference was found, t(438) = 8.36, p <.001, d = 0.80, revealing that participants thought they were more likely to have a date with the tattooed confederates. Regarding the participants estimate of having sex on the first date, a significant difference was found, t(438) = 14.35, p <.001, d = 1.37, revealing again that participants thought that the probability they would have sex with the confederates would be higher with the tattooed confederates. Regarding the physical attractiveness rating, despite the apparent difference between the two groups, no statistical difference was found, t(438) = 1.47, d = 0.14, revealing that the level of physical attractiveness attributed to the confederate was not influenced by the tattoo condition. All of the above analyses are inappropriate because the author has confused the experimental unit (the 11 women) with the observational unit (the 20 men for each women-tattoo condition). The experimental factor is the presence/absence of the tattoo and this experimental factor is applied to the 11 women of the study. The 20 men who were asked for their opinion of the woman-tattoo combination are pseudo-replicates as they are all measuring the same women-tattoo combination. The measurements of the 20 men on the same womantattoo combination are NOT independent, which violates the assumption required for the above ANOVA and t-tests. As an analogy, suppose that the 20 men were all asked to measure the woman s height with a ruler. We certainly would not treat the 20 measurements as independent. The only way in which the above analysis would be appropriate is if each man saw a different woman-tattoo combination. The proper way to analyze this data is to average the pseudo-replicate measurements by the 20 men to get a single number for each woman-tattoo combination. These 11 pairs of measurements can now be c 2013 Carl James Schwarz 5
4 SUMMARY analyzed using a simple paired t-test and the resulting test-statistics will have 10 = 11 1 degrees of freedom representing the 11 experimental units. This approach would seem to throw away information as the analysis would look identical regardless if 20 or 200 or 2000 men were asked their opinion about each woman-tattoo combination. In fact, no information is lost. The response of each man to a woman-tattoo combination has two components of variation. First, not all women with a tattoo would appear to be identical and so there will be a woman-towoman variation in the response. Second, not all men would give identical scores to a particular womantattoo combination, so there is a man-to-man variation in the response. So V ar(y ijk ) = σ 2 w + σ 2 m where Y ijk is the score of man k when viewing women i with tattoo condition j; σ 2 w is the woman-to-woman variance component, and σ 2 m is the man-to-man variance component. The repeated measurements of the same woman-tattoo combination by multiple men only provides information on the man-to-man variance component. The variance of the average response over the men is then V ar(y ij ) = σw 2 + n men Consequently when the average of the men is taken, the variability of the average will decline as the number of men in the average increases and so the information is not lost. The author of this paper was kind enough to provide summary statistics on the mean response over the 20 subjects for each variable. The results of the paired t-test are: variable Difference N Mean Std Error t Value DF Pr > t Attractiveness y n 11 0.2818 0.1778 1.59 10 0.1440 Probability of a date y n 11 1.3409 0.1741 7.70 10 <.0001 Probability of sex y n 11 1.7545 0.0630 27.83 10 <.0001 σ2 m Fortunately, the effects are large enough that the inappropriate analysis lead to the same conclusions. Similarly, the authors correlational analysis is also not appropriate because of the lack of independence among the multiple measurements on the same woman-tattoo combination. 4 Summary The key flaw in the analysis of both experiments is that the author failed to distinguish between the experimental unit (the woman) and the observation unit (the period on the beach or the man questioned). In the first experiment, the consequences are not likely to be severe because of the very sparse data collected. c 2013 Carl James Schwarz 6
4 SUMMARY Fortunately, the results from the second experiment are striking enough that even a poor analysis generally lead to the correct conclusions about the impact of the tattoo on the perceived attractiveness of women. However, the incorrect analyses should be corrected so that future experimenters do not make the same errors. In the interests of Reproducible Research, it is important that both the raw data and the computer code used to analyze the data be available to readers. c 2013 Carl James Schwarz 7