Developing and validating automated scoring for an audio constructed response simulation
2023
We evaluated the effectiveness of machine learning (ML) and natural language processing (NLP) for automatically scoring a simulation requiring audio-based constructed responses. We administered the simulation to 3,174 recent new professional-level hires working in a large multinational technology company. Human subject matter experts (SMEs) scored each response using behaviorally anchored rating scales of interpersonal and decision-making skills which we then used to train Bidirectional Encoder Representations from Transformers (BERT) NLP models to generate computer scores of those skills. Results demonstrate evidence of convergent validity between human and computer scores (correlations ranging from .66 to .74), as well as criterion related-validity (uncorrected correlations ranging from .08 to .21; corrected correlations ranging from .17 to .25) and incremental validity (2.8% additional variance) above and beyond the existing assessments against supervisor ratings of incumbent job performance. Computer scores showed similar subgroup differences to human scores and exhibited no predictive bias.
Research areas