Universal screening for social-emotional and behavioral risk
I am especially interested in measurement because those who assess individuals need to be wary of trusting instruments because they are available. Izumi & Eklund (2023) published “Universal Screening for Social-Emotional and Behavioral Risk: Differential item functioning on the SAEBRS” in School Psychology. Here’s the edited abstract and impact statement with some information in bold:
Universal screening for social–emotional and behavioral (SEB) risk is one strategy for schools to proactively identify students in need of additional supports and services. As schools serve an increasing number of children from racially and culturally diverse backgrounds, further research is needed to examine the differential functioning of brief behavior rating scales. The present study examined differential item functioning (DIF) on the Social, Academic, and Emotional Behavior Risk Screener (SAEBRS)–Teacher Rating Scale. Participants included 11,496 kindergarten through 12th-grade students. DIF analyses were conducted by race/ethnicity, grade level, and biological sex. Results indicated small-to-large effects of DIF for teacher ratings of Black students compared to their non-Black peers on each item resulting in a moderate effect at the test level (Total Behavior [TB] expected test score standardized difference [ETSSD] = −0.67). There was a small-to-moderate effect of DIF for teacher ratings of White students compared to their non-White peers at the test level (TB ETSSD = 0.43). There was a small-to-moderate effect of DIF by biological sex, with teachers rating males differentially with high risk (TB ETSSD = −0.47). There were no significant effects at the test level for differences in ratings by grade level. Future research is needed to identify the factors influencing the interaction between the rater, the student, and the rating scale that could lead to resulting differential functioning. This study suggests teachers rated students’ social-emotional and behavioral skills differently based on student race/ethnicity and biological sex. Teachers differentially rated Black and male students as demonstrating higher risk compared to White and female students.
This is a large sample with what seems to me to be thorough analysis. Differential item functioning (DIF) analysis is generally defined as an analytic method useful for identifying potentially biased items in assessments. The Mailman School of Public Health at Columbia University also differentiates Benign vs. Adverse DIF as follows:
An important distinction is between “benign” and “adverse” DIF (Breslau et al., 2008). Benign DIF occurs when the groups differ in their probabilities of endorsing an item because the item taps a dimension of the underlying trait or attribute measured in the scale that manifests differently between the groups. Adverse DIF occurs when groups differ in their probabilities of endorsing an item because of artifactual elements in the measurement process, such as different understandings of a word or phrase used in the item. Benign DIF is not a form of measurement error, whereas adverse DIF is. That is, benign DIF reflects real group differences in the manifestation of the underlying trait or attribute whereas adverse DIF reflects biases in the measurement process.
This is important work in highlighting the fact that ‘quick and dirty’ measures are often especially dirty. This is the kind of work that makes me worry about several possibilities. The most obvious is teacher expectancy – boys and Black students are more likely to be “trouble” in the classroom. Teachers expect that and, shockingly, their students fulfill the script they’re given. A second possibility is that girls and White students who are at risk may be neglected in intervention efforts. In addition, the notions of benign and adverse DIF are important. I will be looking for more research on this topic.