Reliability & Validity

After completion of the test draft and review, the National Spanish Examination Review Committee rates the content appropriateness of a given test by performing a systematic review of the test.  This study relies very heavily on human judgment; therefore, the teachers who are chosen for this committee are experts in standards-based curriculum, instruction and assessment.

Post Assessment Analysis

After the tests have been administered, the AATSP Exams office analyzes the question details reports which provide information on how a specific sampling of students have done on a specific item.  A validity panel that consists of trained professionals who have taught across all levels confirms through the QDR that all test items have face and content validity.

It is imperative to analyze test answer results in order for any test to remain valid and reliable. AATSP Exams practices analyses of test items as tests are being created yearly. Overall, questions will be immediately changed/moved/removed if the following occur:

  • Distractor effectiveness-An outlier has less than 5% chosen (if otherwise a good question, then just that outlier would be changed)
  • Item Difficulty-A number smaller than 50% has answered correctly (often raised to the next level)
  • Item Difficulty-A number larger than 95% has answered correctly (often lowered to a lower level)
  • The question is outdated or otherwise has an issue (mentions watching a movie on VCR, for example)
  • Index discrimination analysis lower than 0.20

The test questions in any given level will be a mix of ranges of difficulty as well. The higher the level of difficulty, the more points are given (1-3 points per question on average). The item difficulty is the number of people that answered correctly divided by the total number that answered the question. Norm-referenced tests (NRTs) such as NSE, are contest based with students’ percentiles being viewed based upon state designed to be difficulty indexes between 0.4 and 0.6.

This is determined, in part, by the item analysis post testing.