Final Paper Number 414.081
Background: Despite the recommendation of publishing transcription protocol and annotator reliability for data generated from a speech corpus, most research does not include that information. Re-transcribing and directly comparing transcripts requires a very costly and labor intensive undertaking. Instead of using word-by-word re-transcription, Mean Length of Utterance in Morphemes (MLUM) can be easily calculated from transcripts and used as a reliability measure for intra- and inter-transcriber comparisons.
Objectives: Develop a computational method to determine transcriber consistency across a large speech corpus, beginning with an initial assessment of intra-annotator reliability.
Methods: Participants were recruited for an fMRI study at OHSU. Inclusion criteria were: age 7 to 17, IQ ≥ 70, having fluent/phrase speech, being native English speakers. The main data used in this study consist of the Autism Diagnostic Observation Schedule (ADOS-2) Module 3 that was administered at baseline. All administrations were recorded and transcribed according to transcription guidelines based on conventions used by Systematic Analysis of Language Transcripts (SALT) software. Transcripts for three tasks were selected: Emotions, Social Difficulties and Annoyance, and Friends and Marriage Conversations. Each activity for each child was split into two parts by selecting even and odd lines. MLUM was then calculated for each part. Intra-annotator agreement was evaluated using the intraclass correlation (ICC) calculated for two-way mixed effects models, single measurement, absolute agreement type.
Results: The sample comprised 57 children with ASD (mean age:11.3 years; 78.9% male; mean IQ: 101.7) and 60 controls without ASD (mean age:11.4 years; 56.7% male; mean IQ: 111.9). Across tasks and groups, means MLUMs ranged from 5.6 to 6.7 and mean number of utterances ranged from 43.2 to 72.8. ICC between the even and odd MLUM was 0.714 for Emotions (95% CI: 0.613 < ICC < 0.792), 0.624 for Annoyance (95% CI: 0.5 < ICC < 0.723), and 0.708 for Friends (95% CI: 0.604 < ICC < 0.788), indicative of moderate to good levels of reliability (see Figure). Paired-t-tests between the MLUM halves were all non significant, indicating very good within task intra-rater agreement between the two randomly generated MLUM estimates. When examined by ASD status, ICC did not differ significantly between the two groups although reliability in ASD was higher than for controls on the three tasks. Using age and IQ median splits (11.3 years and 110, respectively), we further established that age had no discernable effect on ICC across tasks, and that there was a trend for lower reliability on two tasks among subjects with higher IQ. However, all ICCs across age and IQ groups remained in the moderate (>.50) range.
Conclusions: Calculating MLUM for two random halves within each activity, while not ideal, provides an efficient and valid measure of intra- annotator reliability. Comparing this across an entire transcription team offers a substitute when full inter-annotator reliability is not feasible. We emphasize that all research conducted on speech corpora should present associated reliability data.