MTS 525-0
 Special Topics Research Seminar

Section 20: Generalizing about Message Effects
 Spring 2020



TOPIC 5:  Interpreting effect size magnitude and variability


5.1 Effect size magnitudes

            5.1.1  Abstract characterizations of effect size magnitude

            5.1.2  Observed average effect sizes

            5.1.3  The null as a range: Equivalence testing and second-generation p-values

5.2  Effect size variability

            5.2.1  Heterogeneity indices (I2, Q, Birge’s R, etc.)

            5.2.2  Prediction intervals

5.3  The “replication crisis” revisited




5.1  Effect size magnitudes


5.1.1  Abstract characterizations of effect size magnitude


Funder, D. C., & Ozer, D. J. (2019). Evaluating effect size in psychological research: Sense and nonsense. Advances in Methods and Practices in Psychological Science, 2, 156-168. doi:10.1177/2515245919847202


For further reading: 

            Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

            Abelson, R. P. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin, 97, 129-133. doi: 10.1037/0033-2909.97.1.129

            Prentice, D. A., & Miller, D. T. (1992). When small effects are impressive. Psychological Bulletin, 112, 160-164. doi:10.1037/0033-2909.112.1.160

            Pogrow, S. (2019). How effect size (practical significance) misleads clinical practice: The case for switching to practical benefit to assess applied research findings. The American Statistician, 73(S1), 223-234.  doi:10.1080/00031305.2018.1549101 

            Correll, J.,  Mellinger, C., McClelland, G. H., & Judd, C. M. (2020). Avoid Cohen’s ‘small’, ‘medium’, and ‘large’ for power analysis. Trends in Cognitive Sciences, 24(3), 200-207.  




5.1.2  Observed average effect sizes


Rains, S. A., Levine, T. R., & Weber, R. (2018). Sixty years of quantitative communication research summarized: Lessons from 149 meta-analyses. Annals of the International Communication Association, 42, 105-124. doi:10.1080/23808985.2018.1446350 


Schäfer, T., & Schwarz, M. A. (2019). The meaningfulness of effect sizes in psychological research: Differences between sub-disciplines and the impact of potential biases. Frontiers in Psychology, 10, article 813.  doi:10.3389/fpsyg.2019.00813 


For further reading: 

            Haase, R. F., Waechter, D. M., & Solomon, G. S. (1982). How significant is a significant difference? Average effect size of research in counseling psychology. Journal of Counseling Psychology, 29, 58-65.

            Cooper, H., & Findley, M. (1982). Expected effect sizes: Estimates for statistical power analysis in social psychology. Personality and Social Psychology Bulletin, 8, 168-173. doi:10.1177/014616728281026

            Hemphill, J. F. (2003). Interpreting the magnitudes of correlation coefficients. American Psychologist, 58(1), 78-79. doi:10.1037/0003-066X.58.1.78   

            Richard, F. D., Bond, C. F., Jr., & Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7(4), 331-363. doi:10.1037/1089-2680.7.4.331 

            Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. (2008). Empirical benchmarks for interpreting effect sizes in research. Child Development Perspectives, 2(3), 172-177.  doi:10.1111/j.1750-8606.2008.00061.x

            Ferguson, C. F. (2009). Is psychological research really as good as medical research? Effect size comparisons between psychology and medicine. Review of General Psychology, 13, 130-136. doi:10.1037/a0015103

            Chen, H., Cohen, P., & Chen, S. (2010). How big is a big odds ratio? Interpreting the magnitudes of odds ratios in epidemiological studies. Communications in Statistics: Simulation and Computation, 39(4), 860-864.  doi:10.1080/03610911003650383 

            Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect size benchmarks. The Journal of Applied Psychology, 100(2), 431–449. doi:10.1037/a0038047 

            Leucht, S., Helfer, B., Gartlehner, G., & Davis, J. M. (2015). How effective are common medications: A perspective based on meta-analyses of major drugs. BMC Medicine, 13, 253. doi:10.1186/s12916-015-0494-1

            Gignac, G. E., & Szodorai, E. T. (2016). Effect size guidelines for individual differences researchers. Personality and Individual Differences, 102, 74–78. doi:10.1016/j.paid.2016.06.069 

            Paterson, T. A., Harms, P. D., Steel, P., & Credé, M. (2016). An assessment of the magnitude of effect sizes: Evidence from 30 years of meta-analysis in management. Journal of Leadership & Organizational Studies, 23(1), 66-81. doi:10.1177/1548051815614321 

            Lovakov, A., & Agadullina, E. (2017). Empirically derived guidelines for interpreting effect size in social psychology. PsyArXiv manuscript. doi:10.17605/OSF.IO/2EPC4

            Brydges, C. R. (2019). Effect size guidelines, sample size calculations, and statistical power in gerontology. Innovation in Aging, 3(4), igz036. doi:10.1093/geroni/igz036  





5.1.3  The null as a range: Equivalence testing and second-generation p-values


Weber, R., & Popova, L. (2012). Testing equivalence in communication research: Theory and application. Communication Methods and Measures, 6, 190-213. doi:10.1080/19312458.2012.703834 


Blume, J. D., Greevy, R. A., Welty, V. F., Smith, J. R., & Dupont, W. D. (2019). An introduction to second-generation p-values. The American Statistician, 73(S1), 157-167.  doi:10.1080/00031305.2018.1537893


For further reading: 

            Wellek, S. (2010). Testing statistical hypotheses of equivalence and noninferiority (2nd ed.). Boca Raton, FL: Chapman & Hall/CRC.

            Goertzen, J. R., & Cribbie, R. A. (2010). Detecting a lack of association: An equivalence testing approach. British Journal of Mathematical and Statistical Psychology, 63, 527–537. doi:10.1348/000711009X475853

            Rainey, C. (2014). Arguing for a negligible effect. American Journal of Political Science, 58, 1083-1091. doi:10.1111/ajps.12102

            Lash, T. L., & Kaufman, J. S. (2015). Seeking persuasively null results. Epidemiology, 26, 449-450. doi: 10.1097/EDE.0000000000000318 

            Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8, 355-362.  doi:10.1177/1948550617697177

            Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1, 259-269. doi:10.1177/2515245918770963 





5.2  Effect size variability


5.2.1  Heterogeneity indices (I2, Q, Birge’s R, etc.)


Higgins, J. P. T., & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21, 1539-1558. doi:10.1002/sim.1186


For further reading:

            Birge, R. T. (1932). The calculation of errors by the method of least squares. Physical Review, 40 (2nd ser.), 207-227.

            Hall, J. A., & Rosenthal, R. (1991). Testing for moderator variables in meta-analysis: Issues and methods. Communication Monographs, 58, 437-448. doi:10.1080/03637759109376240

            Sánchez-Meca, J., & Marin-Martinez, F. (1997). Homogeneity tests in meta-analysis: A Monte Carlo comparison of statistical power and Type I error. Quality and Quantity, 31, 385-399.

            Engels, E. A., Schmid, C. H., Terrin, N., Olkin, I., & Lau, J. (2000). Heterogeneity and statistical significance in meta-analysis: An empirical study of 125 meta-analyses. Statistics in Medicine, 19, 1707-1728.

            Higgins, J., Thompson, S., Deeks, J., & Altman, D. (2002). Statistical heterogeneity in systematic reviews of clinical trials: A critical appraisal of guidelines and practice. Journal of Health Services Research and Policy, 7, 51-61. doi:10.1258/1355819021927674

            Higgins, J. P. T., Thompson, S. G., Deeks, J. J., & Altman, D. G. (2003). Measuring inconsistency in meta-analyses. BMJ, 327, 557-560. doi:10.1136/bmj.327.7414.557

            Huedo-Medina, T. B., Sánchez-Meca, J., Marín-Martínez, F., & Botella, J. (2006). Assessing heterogeneity in meta-analysis: Q statistic or I2 index? Psychological Methods, 11, 193-206. doi:10.1037/1082-989X.11.2.193

            Rücker, G., Schwarzer, G., Carpenter, J. R., & Schumacher, M. (2008). Undue reliance on I2 in assessing heterogeneity may mislead. BMC Medical Research Methodology, 8, 79. doi:10.1186/1471-2288-8-79

            Ioannidis, J. P. A. (2008). Interpretation of tests of heterogeneity and bias in meta-analysis. Journal of Evaluation in Clinical Practice, 14, 951-957. doi:10.1111/j.1365-2753.2008.00986.x

            Pereira, T. A., Patsopoulos, N. A., Salanti, G., & Ioannidis, J. P. A. (2010). Critical interpretation of Cochran's Q test depends on power and prior assumptions about heterogeneity. Research Synthesis Methods, 1, 149–161. doi: 10.1002/jrsm.13 

            Card, N. A. (2012). Section 8.4: Evaluating heterogeneity among effect sizes. In Applied meta-analysis for social science research (pp. 184-191). New York: Guilford.

            Langan, D., Higgins, J. P. T., & Simmonds, M. (2015). An empirical comparison of heterogeneity variance estimators in 12 894 meta-analyses. Research Synthesis Methods, 6, 195–205. doi: 10.1002/jrsm.1140 

            Wiernik, B. M., Kostal, J. W., Wilmot, M. P., Dilchert, S., & Ones, D. S. (2017). Empirical benchmarks for interpreting effect size variability in meta-analysis. Industrial and Organizational Psychology, 10(3), 472–479. 




5.2.2  Prediction intervals


Borenstein, M., Higgins, J. P. T., Hedges, L. V., & Rothstein, H. R. (2017). Basics of meta-analysis: I2 is not an absolute measure of heterogeneity. Research Synthesis Methods, 8, 5-18. doi:10.1002/jrsm.1230 


IntHout, J., Ioannidis, J. P. A., Rovers, M. M., & Goeman, J. J. (2016). Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open, 6, e010247. doi:10.1136/bmjopen-2015-010247 


For further reading:

            Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Chapter 17: Prediction intervals. In Introduction to meta-analysis (pp. 127-133). Chicester, West Sussex, UK: Wiley.

            Spence, J. R., & Stanley, D. J. (2016). Prediction interval: What to expect when you’re expecting … a replication. PLoS ONE, 11, e0162874. doi:10.1371/journal.pone.0162874 

            Partlett, C., & Riley, R. D. (2017). Random effects meta‐analysis: Coverage performance of 95% confidence and prediction intervals following REML estimation. Statistics in Medicine, 36, 301-317. doi:10.1002/sim.7140 

            Borenstein, M. (2018). Chapter 9.3: Prediction intervals. In Common mistakes in meta-analysis and how to avoid them (pp. 85-93). Englewood, NJ: Biostat.

            Nagashima, K., Noma, H., & Furukawa, T. A. (2019). Prediction intervals for random-effects meta-analysis: a confidence distribution approach. Statistical Methods in Medical Research, 28, 1689–1702.  doi:10.1177/0962280218773520 





5.3  The “replication crisis” revisited


Patil, P., Peng, R. D., & Leek, J. T. (2016). What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspectives on Psychological Science, 11, 539-544. doi:10.1177/1745691616646366 


De Boeck, P., & Jeon, M. (2018). Perceived crisis and reforms: Issues, explanations, and remedies. Psychological Bulletin, 144, 757-777.  doi:10.1037/bul0000154


For further reading:

            Hedges, L. V. (1987). How hard is hard science, how soft is soft science? The empirical cumulativeness of research. American Psychologist, 42(5), 443–455.  

            O’Keefe, D. J. (1999). Variability of persuasive message effects: Meta-analytic evidence and implications. Document Design, 1, 87-97. doi:10.1075/dd.1.2.02oke 

            Kaptein, M., & Eckles, D. (2012). Heterogeneity in the effects of online persuasion. Journal of Interactive Marketing, 26, 176-188. doi: 10.1016/j.intmar.2012.02.002

            Bahník, Š., & Vranka, M. A. (2017). If it’s difficult to pronounce, it might not be risky: The effect of fluency on judgment of risk does not generalize to new stimuli. Psychological Science, 28(4), 427–436.  

            Amrhein, V., Trafimow, D., & Greenland, S. (2019). Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician, 73(S1), 262-270.  doi:10.1080/00031305.2018.1543137 

            Kenny, D. A., & Judd, C. M. (2019). The unappreciated heterogeneity of effect sizes: Implications for power, precision, planning of research, and replication. Psychological Methods, 24(5), 578-589.  doi:10.1037/met0000209 

            Vivalt, E. (in press). How much can we generalize from impact evaluations? Journal of the European Economics Association. Available at: