Reliability & Validity

After completion of the test draft and review, the National Spanish Examination Review Committee rates the content appropriateness of a given test by performing a systematic review of the test.  This study relies very heavily on human judgment; therefore, the teachers who are chosen for this committee are experts in standards-based curriculum, instruction and assessment.

Post Assessment Analysis

After the tests have been administered, the AATSP Exams office analyzes the question details reports which provide information on how a specific sampling of students have done on a specific item.  A validity panel that consists of trained professionals who have taught across all levels confirms through the QDR that all test items have face and content validity.

It is imperative to analyze test answer results in order for any test to remain valid and reliable. AATSP Exams practices analyses of test items as tests are being created yearly. Overall, questions will be immediately changed/moved/removed if the following occur:

  • Distractor effectiveness-An outlier has less than 5% chosen (if otherwise a good question, then just that outlier would be changed)
  • Item Difficulty-A number smaller than 50% has answered correctly (often raised to the next level)
  • Item Difficulty-A number larger than 95% has answered correctly (often lowered to a lower level)
  • The question is outdated or otherwise has an issue (mentions watching a movie on VCR, for example)
  • Index discrimination analysis lower than 0.20

The test questions in any given level will be a mix of ranges of difficulty as well. The higher the level of difficulty, the more points are given (1-3 points per question on average). The item difficulty is the number of people that answered correctly divided by the total number that answered the question. Norm-referenced tests (NRTs) such as NSE, are contest based with students’ percentiles being viewed based upon state designed to be difficulty indexes between 0.4 and 0.6.

This is determined, in part, by the item analysis post testing. 

Analyzing test results yearly post testing and changing test assessment questions to make them more valid helps create a consistently reliable and valid test every year at AATSP Exams. AATSP Exams uses the Kuder-Richardson 21 formula currently to calculate reliability.  The average coefficient on a test is 0.90, with the highest 1.0.  The coefficients of each test in 2022 were between 0.98 and 0.99, showing that the National Spanish Exam is highly reliable. Results for 2022 are available online and can be found here.


Center for Teaching and Learning. (1990). Improving Multiple Choice Questions. For Your Consideration, CTL 8, 1-4. University of North Carolina at Chapel Hill. https://www.smu.edu/-/media/Site/Provost/assessment/Resources/MultipleChoices/Improving-Multiple-Choice-QuestionsUNCCH.pdf?la=en&hash=E8167D388358BFCCB9FD7ECFB154770FCEE73FEF#:~:text=Validity%20and%20Reliability,-The%20two%20most&text=Well%2Ddesigned%20multiple%20choice%20tests,scoring%20consistency%20is%20virtually%20guaranteed

Frost, J.; Content Validity: Definition, Examples & Measuring. Statistics by Jim. Found in: https://statisticsbyjim.com/basics/content-validity/

Hoover, R. (2008). Test Reliability and Validity Defined. 2000 & 2008 Research Studies on Ohio’s School Accountability Tests. Found in: https://rlhoover.people.ysu.edu/OAT-OGT/