Help, Student Tested Average on ALL Standardized Tests but is Still Struggling

**Tip: Click on the highlighted words for further reading**

Recently, there has been an uptick of posts and comments from parents and professionals on social media with the following message: “The student tested average on all standardized tests but still struggles with XYZ deficits characterized by (insert language-based difficulties here). So today, I wanted to delve a little deeper into what it means to test in the average range on tests of intelligence, language, education, and reading, for school-aged children. But before we begin, let’s take a look at some of the commonly administered tests in these respective areas.

Comprehensive Language Tests:

Clinical Evaluation of Language Fundamentals-Fifth Edition (CELF-5)
Comprehensive Assessment of Spoken Language – Second Edition (CASL-2)
Oral and Written Language Scales Second Edition (OWLS-II)
Receptive, Expressive & Social Communication Assessment–Elementary (RESCA-E)
Test of Language Development-Intermediate: 5 (TOLD-I:5)
Test of Language Development-Primary: 5 (TOLD-P:5)
Test of Integrated Language and Literacy (TILLS)

Tests of Cognition

Wechsler Intelligence Scale for Children – Fifth Edition (WISC-V)
Wechsler Preschool and Primary Scale of Intelligence – Fourth Edition (WPPSI-IV)
Stanford-Binet Intelligence Scales – Fifth Edition (SB-5)
Differential Ability Scales – Second Edition (DAS-II)

Educational Tests:

Woodcock-Johnson IV Tests of Achievement (WJ IV-ACH)
Woodcock-Johnson IV Tests of Oral Language (WJ IV-OL)
Kaufman Test of Educational Achievement Third Edition (KTEA-3)
Wechsler Individual Achievement Test Fourth Edition (WIAT-4)

Tests of Reading:

Feifer Assessment of Reading (FAR)
Phonological Awareness Test-2: Normative Update (PAT-2: NU)
Comprehensive Test of Phonological Processing-2 (CTOPP-2)
Rapid Automatized Naming and Rapid Alternating Stimulus Test (RAN/RAS)
The Test of Silent Word Reading Fluency (TOSWFR-2)
Test of Silent Contextual Reading Fluency (TOSCRF-2)
Gray Oral Reading Tests- Fifth Edition (GORT-5)
Test of Reading Comprehension – Fourth Edition (TORC-4)

The above list of tests is by no means exhaustive, merely it reflects the more popular tests used in clinical practice for diagnostic purposes. Now that we have identified some of the commonly used tests, it is important to outline what makes an assessment a solid testing instrument for disorder identification purposes. For this purpose, knowing the discriminant accuracy of a particular test is vitally important.

Discriminant accuracy refers to the sensitivity and specificity of assessment instruments (Dollaghan, 2007). Sensitivity ensures that the assessment accurately identifies those students who truly have a language/reading disorder as having a disorder. Specificity ensures that the assessment accurately identifies those students who truly do not have any disorders as typical. Sensitivity and specificity determine the test’s degree of discriminant accuracy, or the ability to distinguish the presence of a disorder In 1994, Vance and Plante established criteria for discriminant accuracy or accurate identification of a disorder. 90% is considered good discriminant accuracy. 80% to 89% is considered fair, while below 80%, misidentifications occur at unacceptably high rates” and lead to “serious social consequences” of misidentified children. (p. 21)” Discriminant accuracy constitutes the most important information about the assessment. If the test has low sensitivity and specificity or if that information is missing from the test manual; OTHER psychometric properties simply do not matter!

Furthermore, if an assessment is missing information pertaining to discriminant accuracy, secondary to its lack of determination during the standardization process, then that assessment cannot be used for diagnostic purposes. That is because the administration of such tests cannot determine the presence or absence of a language or reading disorder. That means that even those students who attained average performance scores on such tests can be very significantly impaired, as reflected in daily social and academic functioning and low/struggling school performance.

Discussing discriminant accuracy means understanding cut scores. These are numerical boundaries between what is considered typical and disordered. The formula requires the mean and standard deviation of both clinical and non-clinical samples, and an estimate of the score at which subjects have a greater probability of belonging to a clinical sample rather than a non-clinical sample. Cut scores are test specific and vary not just from test to test but also by age. The problem is that cut scores are often applied based on arbitrary guidelines to determine the presence or absence of a disorder without a reference to how the students actually scored on specific tests (Spaulding, Plante, & Farinella, 2006).

So let’s take a look at the discriminant accuracy of some of the commonly used comprehensive language tests to see how they compare? Let’s start with CELF-5. According to the manual, its normative sample of 3000 children included 23% of children with language-related disabilities. However, disordered students should not be included in the standardization norms because it lowers the mean, increases standard deviation, and shifts the cut scores, which results in less likely identification of impaired students (“normalizes the disorder”). The overlap between disordered and typical becomes too great and it’s much harder to reliably identify those with an impairment. According to the CELF-5 authors, “Based on CELF-5 sensitivity and specificity values, the optimal cut score to achieve the best balance is -1.33 (standard score of 80). Using a standard score of 80 as a cut score yields sensitivity and specificity values of .97.” However, the CELF-5 sensitivity group was only made up of 67 children between the ages of 5;0 to 15;11 who scored below 1.5 Standard Deviations below the mean on any standardized language test. All of these students could have had severe disabilities, making it very easy for the CELF-5 to identify this group as having language disorders with extremely high accuracy. Because the information on the breakdown of abilities of those included in the sensitivity group is missing from the manual, the validity of a .97 sensitivity measure is highly questionable.

How about the CASL-2? Well, it has a sensitivity of 74% at -1 SD, making it unacceptable as per guidelines set forth by Vance & Plante, 1994. Both OWLS-II and the RESCA-E are actually missing discriminant accuracy information from their respective manuals. In chapter 6 of the OWLS-II manual, it states specifically that: “Sensitivity and specificity studies were not conducted for the OWLS-II. Individuals with diagnosed disabilities were included in the standardization sample as long as they spent most of their school day in a regular classroom.” Similarly, the authors of the RESCA-E (which included 292 children with a variety of language-impacting disabilities) did not conduct sensitivity and specificity studies. In fact, the authors of this test specifically cautioned clinicians that this test cannot be used for diagnostic purposes of disorder identification.

What about both versions of the TOLD tests? With respect to the TOLD-P:5, cut scores of 85 or 90 are adequate for composites,
but below cut score of 85 caution needs to be exercised depending on composite. But very importantly, on the TOLD-P:5, diagnostic categories appear to be arbitrary in nature as language disorders were artificially separated by different monikers all of which signified the same disorder (i.e. Developmental Language Disorder, Language Impairment, etc.). Furthermore, the specificity and sensitivity appear to be unreliable to diagnose most, if not all, of these conditions as per the numbers provided in the manual in Table 6.15. Similarly, TOLD-I:5 has identical issues as its primary version, only these are detailed in Table 6.14 in the manual.

Now lets, move on to the TILLS, which presently has the strongest psychometric properties of all the comprehensive language tests on the market. “The TILLS does not include children and adolescents with language/literacy impairments (LLIs) in the norming sample. Since the 1990s, nearly all language assessments have included children with LLIs in the norming sample. Doing so lowers overall scores, making it more difficult to use the assessment to identify students with LLIs. (Westby, 2016, pg. 11)” The sensitivity levels by age on the TILLS range from 80’s to ’90s, all within acceptable ranges as per Vance & Plante. Similarly, the identification cores based on cut scores for all the three age ranges of the TILLS are in the acceptable ranges as well.

But is that enough? If a student receives solidly average composite scores on the TILLS, does it mean that the presence of a language and literacy disorder can be ruled out? The answer is a resounding, NO, and here’s why.

All standardized language tests possess limitations, no exceptions! These tests are assessing discrete skills and not necessarily skills integration. Skills integration refers to the student’s ability to meaningfully use language to tell stories, comprehend grade-level text by answering verbal reasoning questions and summarizing text, as well as write grade-level personal, fictional, persuasive, and expository compositions. Even when tests attempt to assess skills integration, they generally tend to do so in a shallow manner. To illustrate, the Phonological Awareness subtest of the TILLS assesses only initial sound deletion from nonwords, vs. reading relevant skills such as sound blending segmentation and substitution (Kilpatrick, 2012) The Reading Fluency subtest of the TILLS assesses the student’s ability to read phonologically simple texts in an untimed manner. The Reading Comprehension subtest of the TILLS assesses the student’s ability to only provide 3 types of responses, “Yes”, “No” and “Maybe” instead of open-ended answers to verbal reasoning questions. Now that does not mean that the test has poor utility, not at all! TILLS is a fine instrument but it requires functional assessment supplementation relevant to the student’s reported deficit areas.

Now, let’s briefly jump to the tests of intellectual functioning. Here’s the deal, all research indicates that IQ and learning disabilities are mutually exclusive. A student can have an average or superior IQ and be very significantly learning disabled. So high performance on IQ tests absolutely does not rule out language, social, or academic deficits.

What about the commonly used education tests listed above. Bad news! These tests were developed to rank children within the range of the general population. There’s absolutely no mention of sensitivity and specificity in their respective technical manuals. This means that average performance on those tests absolutely does not guarantee average academic functioning or an absence of literacy deficits in the areas of reading, spelling, and writing.

How about the commonly used reading tests listed above? They have to have discriminant accuracy values, right? True, 5 out of 8 actually do. So let’s see how they “measure up” with respect to the Vance and Plante criteria. Let’s start with the FAR. Unfortunately, its sensitivity of 68% falls far below acceptable identification criteria. Coupled with the fact that it is based on the unsubstantiated “dyslexia subtypes” and actually purports to help determine an individual’s specific subtype of reading impairment, it is definitely not without very significant limitations. Similarly, both the TOSWFR-2 and the TOSCRF-2 have unacceptable sensitivity of less than 80%, so average performance on these two tests absolutely does not rule out the presence of significant reading challenges. The TORC-4 sensitivity is provided in the context of criterion comparisons (with other existing tests), but the problem is that those tests (e.g., WJ-III, TOWL-4, etc.) it’s being compared with, also have unacceptable diagnostic accuracy. Finally, the GORT-5 has acceptable sensitivity of 82% at a cut score of 90, but its reading comprehension questions are highly factual and can be easily guessed by students who have a decent amount of background knowledge. As such, this test is far better at establishing basic, vs. grade-level reading abilities.

So now that we have reviewed the information relevant to the limitations of common standardized tests, here are some practical suggestions on how professionals can confirm that the student’s language abilities are truly average. In order to make this claim, with 100% certainty the following supplementary assessments need to be performed:

Clinical Assessment of Pragmatics (CAPS) -Because children with language and literacy deficits display overt or subtle language deficits which affect their ability to retell stories, comprehend subtles of grade level text, as well as competent essays on a variety of subjects.
Narrative/Discourse Assessment -Because poor discourse and narrative abilities place children at risk for learning and literacy-related difficulties including reading problems (McCabe & Rosenthal-Rollins, 1994). Narrative weaknesses significantly correlate with social communication deficits (Norbury, Gemmell & Paul, 2014), which in turn cause poor reading comprehension and written composition abilities even in the presence of relatively intact non-word reading as well as reading fluency skills
Grade Level Reading Assessment – Because a basic reading assessment is not enough and reading needs to be clinically assessed on a deep vs. shallow level.
Grade Level Writing Assessment – Because solid writing abilities allow students to write competent essays, book reports, and subject-specific projects and prepare them for college entrance. Similar to the deficits in reading, having writing deficits will result in adverse academic outcomes. Once again missing deficits in these areas will result in poor intervention gains since the students will not effectively master their therapy goals. This will in turn result not just in poor post-secondary (college) but also poor vocational outcomes (job market).

So what is the best way to perform these assessments? Below are some useful links on this subject along with a few clinical observations.

How to Clinically Assess Narrative and Discourse Abilities?

How to Clinically Assess Pragmatic Abilities?

How to Clinically Assess Reading Abilities?

The most effective way is by asking a student to read a grade level one page expository text and then asking them the following:

Identify the passage’s main idea
Summarize the passage
Answer abstract verbal reasoning questions
Define literate vocabulary words

Useful links:

How to Clinically Assess Writing Abilities?

For this purpose, I highly recommend the use of persuasive writing prompts for upper elementary and middle school students and the use of expository prompts for high-school-aged students.

So if your student is performing in the average range on standardized assessments but still continues to struggle in school, please don’t automatically assume that their language and literacy abilities are in the average range even if a plethora of standardized assessments have purportedly found them to have average abilities. Dig deeper! Perform quality clinical assessments with a focus on skills integration and make sure the findings are analyzed appropriately by cross-referencing the student’s performance with grade/age level expectations. You would be surprised what kind of covert deficits good quality clinical assessments can unearth. Always remember, at the end of the day, all roads lead to language!

Share this with others

Leave a Comment Cancel Reply