This system analyzes speech prosody using unique suprasegmental features and automatically calculates and determines language proficiency with machine learning classifiers.
About
Those learning a second language often undergo proficiency examinations or tests performed by human evaluators. These examinations are intended to allow the speech of the learner to be assessed and, in some systems, scored by the human evaluator using various criteria, such as fluency, to determine the learner’s proficiency. An example of a conventional test used for assessment is the Test of English as a Second Language (TOEFL) administered by Education Testing Service (ETS) of Princeton, New Jersey. Detection of prosody in human speech is more than conventional automatic speech recognition (ASR). Automatic speech recognition is the translation of spoken words into text. Some conventional ASR systems use training where an individual speaker reads sections of text into the ASR system. These systems analyze the person's specific voice and use it to fine-tune the recognition of that person's speech, resulting in more accurate transcription. ASR applications include voice user interfaces such as voice dialing (e.g., “Call Betty”), call routing (e.g. “I would like to make a collect call”), domestic appliance control (e.g., “Turn the TV on”), search (e.g., “Find a song where particular words were sung”), simple data entry (e.g., entering a social security number), preparation of structured documents (e.g., medical transcription), speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice Input). Conventional ASR technology recognizes speech by considering the most likely sequence of phones, phonemes, syllables, and words which are limited by a particular language’s grammar and syntax. The prosody of a speaker’s speech, among other segmental features, is used by hearers to assess language proficiency. The incorrect use of prosody is what makes a non-native speaker, who knows the correct grammar and choice of vocabulary, to still be perceived by a native speaker to have an accent. A speaker’s prosody may be assessed in two ways, 1) text dependent models and 2) text independent models. Conventional text dependent models use specifically prompted words, phrases, or sentences for assessment. Text independent models/systems, such as this innovative system, use unstructured monologues from the speaker during assessment. Where the systems can accurately use the prosody of the speech to improve recognition of the words spoken, the models created and used by the systems can also accurately assess and provide feedback to a non-native language speaker, such as through computer-aided language learning (CALL). Figure 1 compares the processes used in conventional systems and the present innovative system. By inspection, the ASR portion of the innovative system focuses on generating a set of phones from the utterance from the speaker. The system then calculates and determines various fluency and intonation based suprasegmental measure from the set of phones. Then a machine learning classifier, a form of artificial intelligence, assesses the phones and the suprasegmental measures and calculates a speaking proficiency score on a 1-4 scale. This novel system has the potential to more accurately assess the actual language proficiency of the speaker as it is able to focus on the prosody of the speech, rather than just the words themselves.
Key Benefits
Specific advantages of this system include: • Utilizes advancements in artificial intelligence and computing technology • ASR recognizes 60 phones better than thousands of words • Suprasegmental measures have been found to account for 50% of the variance in oral proficiency ratings • Use of machine learning performs better than multiple regression analysis • Higher consistency and equitable scoring than speech assessment by human scorers
Applications
The primary application of this technology relates to language proficiency tests. However, other applications using automated speech such as speaker identification and forensic phonetics may be envisioned for both the defense and commercial industry.