Coop- OnTarget Common Assessment Analysis Reporting – Beta

OnTarget Common Assessment Analysis Reporting

1.Agenda #

OnTarget Training Agenda

  • Welcome and Introductions
  • Report Overview
  • File Center Upload Instruction
  • Assessment Analysis
  • Questions
  • Closing

 

2.Introduction #

The OnTarget Assessment Analysis Report is used to assist with determining the Reliability and Validity of locally
developed assessments. Statistical analyses are conducted on the student performance data submitted by the Local
Education Agency (LEA) for each item. Educators in schools and districts routinely develop and administer classroom,
benchmark, or common assessments throughout the school year to gauge student mastery on relevant material.
Typically, these tests undergo minimal or no statistical review for reliability and validity. Through the evaluation
identified through this Assessment Analysis Report teachers, schools, and districts may significantly improve the
quality of internally developed assessments to obtain higher quality and more reliable test results. The analysis
listed within this report have been in some instances taken directly from the analysis conducted on the State
Assessments such as STAAR, STAAR Alternate 2, and TELPAS and the descriptive are located on the Texas Education
Agency (TEA) website  2018-2019 TEA Technical Digest Ch. 3 Standard Technical Process) (TEA, Technical Digest 2018-2019).

These analyses are used to gauge the level of difficulty of the item, examine the degree to which the item
appropriately distinguishes between students of different proficiency levels, and assess the item for potential bias.

3.Student Information #

The purpose of this portion of the report is to see the number of students taking the assessment and the overall percentage of students “passing” the assessment based on a passing standard set by the district. Unlike other reports contained in the OnData Suite System, this report will not contain a means by which to drill down to the student.  After all, this report is all about the Reliability and Validity of the Assessment and not student performance.  

 

4.Assessment Frequency Distribution #

This graph from Ontarget outlines the number of students who scored at each raw score point.  The “red” line identifies the mean or the average raw score of the students who took this assessment. The average is calculated by summing all of the values in a data set and dividing by the number of cases. Each “yellow” dotted line represents the standard deviation of scores. 

A standard deviation tells of the average distance by which raw scores deviate from the mean. It essentially tells how much spread or variability there is in a set of scores. If the standard deviation is small, then the scores fall relatively close to the mean. As the standard deviation grows larger, the scores scatter farther away from the mean.  One could surmise that 68% of the population falls within one standard deviation (1SD) on either side of the mean.  

In the example below the average raw score is 22.56 with a standard deviation of ±5; therefore, 68% of the students who took this assessment are scoring between {22.56 – 5} or 17.56 raw score points and {22.56 +5} or 27.56 raw score points on a 34 question test.

5.Assessment Reliability #

Most tests are composed of many items. The reason for having many items, rather than just one, is to increase the precision (reliability) and the validity of an assessment. A key concept of classical test theory (CTT) is reliability. As the term implies, the idea of reliability has the connotations of repeatability, consistency, and predictability. It is a desirable property (Andrich & Marais, 2019). It is akin to the reliability you think of when starting your car. If every time you go to start your car, it turns on you believe your car is reliable.  If every time you go to start your car about half the time your car does not start, you believe it to be unreliable.   This notion of reliability does not happen the first time your car does not start, but rather after repeated attempts to start your car yields mixed repeated results. Whether a car is a reliable starter or not can only be determined after repetitions of starting the car. 

The concept of reliability is based on the idea that repeated administrations of the same assessment should generate consistent results. Reliability is a critical technical characteristic of any measurement instrument because unreliable scores cannot be interpreted in a valid way. The reliability of test scores must be demonstrated before issues such as validity, fairness, and interpretability can be discussed. There are many different methods for estimating test score reliability. Some methods of estimating reliability require multiple assessments to be
administered to the same sample of students; however, obtaining these types of reliability estimates is burdensome on schools and students; however, obtaining these types of reliability estimates is burdensome on schools and students. Therefore, reliability estimation methods that require only one that has been developed and are commonly used for large-scale assessments, including STAAR, STAAR Alternate 2, TELPAS, and TELPAS Alternate. This portion of the report compares the reliability coefficient for the STAAR test (if one was taken) and the common assessment being reviewed (TEA, Technical Digest 2018-2019).

Reliability is the study of error or score variance over two or more testing occasions, it estimates the extent to which the change in measured score is due to a change in true score. Theoretically, a perfectly reliable measure would produce the same score over and over again, assuming that no change in the measured outcome is taking place. In other words, a highly reliable test simply stated is when a student takes the same exact assessment at two and scores the exact same raw score each time. Below is an image that helps to explain reliability (Andrich & Marais, 2019).

 

1. If the test score is the target and every test taker scores around the same area but not close to the target, the test is said to be reliable, but not valid.

a. Low Accuracy – did not hit the center of the target, (i.e did not test what it intended to test)
b. High Consistency – all scores are clustered around the same area,
c. High Repeatability – Scores yield similar results.

2. If the test score is the target and every test taker’s scores are spread around the center but still all over the place but closer to the center, the test is said to be valid, but not reliable.

a. High Accuracy – hit closer to the center of the target, (i.e tested what it intended to test)
b. Low Consistency – scores are all over around the center,
c. Low Repeatability – Scores do not yield similar results.

3. If the test score is the target and every test taker scores spread around but still all over the place, and not close to the center target the test is said to be neither valid nor reliable.

a. Low Accuracy – did not hit the center of the target, (i.e did not test what it intended to test)
b. Low Consistency – scores are all over,
c. Low Repeatability – Scores do not yield similar results.

4. If the test score is the target and every test taker scores around the same area and also close to the center target, the test is said to be both reliable, and valid.

a. Low Accuracy – did not hit the center of the target, (i.e did not test what it intended to test)
b. High Consistency – all scores are clustered around the same area,
c. High Repeatability – Scores yield similar results.

6.Reliability Coefficient (alpha) #

Reliability Coefficients based on one test administration are known as internal consistency measures because they measure the consistency with which students respond to the items within the test. A reliability coefficient expresses the relationship between the scores of the same individuals on the same instrument at two different times. It does NOT provide any information about what is actually being measured by a test. It only indicates that “WHAT” is being measured by the test is being assessed in a “consistent precise way”.

Whether the test is actually assessing what it was designed to measure is addressed by looking at the test Validity. Other statistics such as Rasch and Winsteps provide statistics on various item analysis indices; however, the reliability coefficient used to measure internal consistency by OnTarget is Cronbach’s Alpha.

The image above on the left depicts two separate results, in the first result on the left the scores are highly scattered; therefore “Not Consistent”, is not reliable. The scores on the right are closer together and are trending in a positive direction; therefore, the scores on the right are considered “Consistent” or more reliable.

7.Factors that Affect Reliability of an Assessment #

There are many factors that can affect the reliability of an assessment including the number of questions being assessed, guessing by the students, mismarking an answer key, skipping questions, and even misinterpreting the instructions. It is important that the user reviews the responses prior to submitting them for analysis to mitigate the number of factors that may negatively affect the reliability coefficient score.

  • Number of the items – the reliability is clearly affected by the number of items
  • Discrimination of Items – the greater the discrimination of the items, the greater the reliability.
  • Independence of Items from one another – each of the items is a replication of each other item. We assume independence of the responses. This implies, for example, that one item does not artificially relate to any other item. An example of the violation of independence in tests of proficiency is when the answer to one item implies or gives a clue to, the answer to another item (Mehrens & Lehman, 1991).

8.Reliability Coefficient (Cronbach’sAlpha) #

Reliability coefficients based on one test administration are known as internal consistency measures because they measure the consistency with which students respond to the items within the test. As a general rule, reliability coefficients from 0.70 to 0.79 are considered adequate, those from 0.80 to 0.89 are considered good, and those at 0.90 or above are considered excellent (TEA, Technical Digest 2018-2019). However, what is considered appropriate might vary in accordance with how assessment results are used (e.g., for low-stakes or high-stakes purposes). The Reliability coefficient becomes larger (Increased Reliability) as error variance gets smaller and Equals 1.0 when there is no error variance. The Reliability coefficient becomes smaller (Decreased Reliability) as error variance gets larger.
Cronbach’s Alpha is considered an indicator of overall test reliability and ranges from 0 to 1, with 0 indicating NO test reliability and 1 indicating the HIGHEST test reliability (Varma, N.D.)

  • .70 to .79 = Adequate
  • .80 to .89 = Good
  • .90 and above = Excellent

9.OnTarget Assessment Reliability Comparison #

9.1.STAAR #

The following image depicts the Reliability Coefficient (alpha) as calculated by the Texas Education Agency and is available from the TEA website 2019 STAAR Mean P-Values and Internal Consistency Values by Reporting Category and Content Area for the assessment that has been selected to compare in OnTarget if one is available. If a STAAR assessment was not taken for the content area that is being analyzed, this area will be blank.

Additional reliability estimates will eventually be included such as TSIA2, ACT, SAT as well as other normed referenced assessments. The table provides the following:

  • Reliability Coefficient Alpha, which is further described in the Reliability Coefficient module,
  • Average Raw Score for all of the students who took this assessment in the state of Texas,
  • The standard deviation (SD), which is further explained in the Assessment Frequency Distribution module, and
  • The Mean p-Value, which is further explained in the p-Value module.

9.2.Locally Developed Assessment #

A Locally Developed Assessment is any assessment that has been developed by a local education agency using either a purchased test/item bank, teacher-made questions, items from a purchased curriculum, items from an adopted textbook or reference materials, or any other item included on a test administered by a local education agency.

The following image depicts the Reliability Coefficient (alpha) that has been calculated by OnTarget for the assessment being analyzed in this report. The table provides the following:

  • Reliability Coefficient Alpha, which is further described in the Reliability Coefficient module,
  • Average Raw Score for all of the students who took this assessment in the state of Texas,
  • Standard Deviation, which is further explained in the Assessment Frequency Distribution module, and
  • The Mean p-Value, which is further explained in the p-Value module.

While this alpha (.76) is lower than that of the STAAR test, it still falls within the “Adequate Range” as identified in the Reliability Coefficient (alpha) module. It is possible to increase the reliability coefficient alpha by removing items whose individual point-biserial correlation is lower than the minimum criteria. More information on this is presented in the Question Quality (Point-Biserial Correlation) module.

 

10.OnTarget Item Analysis #

An Item Analysis is a method of statistically reviewing items on a test, to ensure that every question meets minimum quality control criteria. This statistical analysis is conducted after the test has been administered and real-world data are available for analysis. The objective of this statistical analysis is to aid in the identification of problematic items on a test due to one or more of the following reasons:

  • Items may be poorly written causing students to be confused when responding to them.
  • Graphs, pictures, diagrams, or other information accompanying the items may not be clearly depicted or
    may be misleading.
  • Items may not have a clear correct response, and a distractor may qualify as an actual response.
  • Items that contain distractors that most students can see are obviously wrong, increasing the odds of a student guessing the correct answer.
  • Items may represent different content area than that measured by the rest of the test.
  • Bias for or against gender, ethnic or other group, may be present in the item or the distractors.

OnTarget uses a statistical analysis based on classical test theory to analyze the data submitted under Common Assessment. Item analyses are conducted for the purpose of reviewing the quality of items. Statistics generated for each item include p-value, and point-biserial correlation (Varma, N.D.).

Question Difficulty (p-Value)

One concept of item analysis is concerned with the difficulty of items relative to the population of persons administered the test to assess the item. It is essential that the item is not so difficult that a person can not engage with it or that it is so easy that it is trivial. Item difficulty, simply stated is the proportion of persons who answered the item correctly which is called the facility of an item and usually denoted by the letter p (Andrich & Marais, 2019)..

The p-value indicates the proportion of the total group of students answering a multiple-choice or gridded-response item correctly. When multiplied by 100, the p-value converts to a percentage, which is the percentage of students who got the item correct. The p-value statistic ranges from 0-1. An item’s p-value shows how difficult the item was for the students who took the item. An item with a high p-value, such as 0.90 (meaning that 90 percent of students correctly answered the item), is a relatively easy item. An item with a low p-value, such as 0.30 (meaning that only 30 percent of students correctly answered the item), is a relatively difficult item. High p-values should not be taken as indicative of item quality. (Varma, N.D.)

It is important to have enough “stretch” on the test in terms of rigor. The test should have a range of difficulty or pvalue to demonstrate a wide range of rigor. According to the research optimum question difficulty or p-value should range from .3 through .7 (Andrich & Marais, 2019).

The following image depicts the Question Difficulty (p-value) table that has been calculated by OnTarget for the assessment being analyzed in this report. The table provides the following:

  • Individual p-value for each question (Percentage of students answering correctly)

The highest p-value, 0.91 is associated with question 26. Meaning 91% of the students who took this assessment got this question correct. The higher the p-value, the easier the item. Low p-values indicate a difficult item. In general, tests are more reliable when p-values are spread across the entire 0.0 to 1.0 range with a larger concentration toward the center, around 0.50 (Varma, N.D.).

The following image depicts the Question Difficulty (p-value) using a frequency distribution that has been calculated by OnTarget for the assessment being analyzed in this report. The frequency distribution depicts the range of item difficulty or the “rigor” of the test. This graph allows the reviewer to more quickly analyze the range of the entire assessment by providing the number of questions that fall within that specific p-value range.

This frequency distribution depicts the following of this 34 question test:

  • 1 question that falls between p-value of .20 and .29,
  • 3 questions that fall between p-value of .30 and .39,
  • 1 question that falls between p-value of .40 and .49,
  • 7 questions that fall between p-value of .50 and .59,
  • 3 questions that fall between p-value of .60 and .69,
  • 7 questions that fall between p-value of .70 and .79,
  • 9 questions that fall between p-value of .80 and .89,
  • 2 questions that fall on a p-value of .90 and above.

While the 11 questions that fall on a p-Value of (.80) and higher are not bad questions, they do represent roughly ⅓ of the overall assessment and may not have the “rigor” expected to demonstrate complete mastery of the curriculum being assessed by this instrument. This does not make these questions “bad”; however, since so many (11 of 34) fall outside of the optimum range, it is recommended that the level of rigor or the Depth of Knowledge (DOK) required in these questions be reviewed, additionally, high p-values should not be taken as indicative of item quality, only the point-biserial should be used to judge item quality (Varma, N.D.).

10.1.OnTarget Question Quality (Point-Biserial Correlation) #

The point-biserial correlation describes the relationship between a student’s performance on a multiple-choice or gridded-response item (scored correct or incorrect) and performance on the assessment as a whole. A high point-biserial correlation indicates that students who answered the item correctly tended to score higher on the entire test than those who missed the item. This point-biserial is a special type of correlation between a dichotomous variable (multiple-choice item where 1 is correct and 0 is incorrect) and a continuous variable such as the overall raw score. Because this is calculating a correlation between the individual student’s overall score, (total raw score) and their expected score on an individual item (1 correct or 0 incorrect); it can not have a value less than 1 or greater than 1. In general, the greater the point-biserial correlation, the better (Andrich & Marais, 2019).

A low point-biserial correlation implies that students who got the item incorrect also scored high on the test overall while students who got the item correct scored low on the test overall. Therefore, items with low point-biserial values need further examination. Something in the wording, presentation, or content may be able to explain the low point-biserial correlation. However, even if nothing appears faulty with the items, it is recommended that they be removed from future testing. Removal of problematic items increases the overall test reliability. When evaluating items it is helpful to use a minimum threshold value for point-biserial correlation (Varma, N.D).

In general, point-biserial correlations less than (0.20) indicate a potentially weaker-than-desired relationship. Note that the point-biserial correlation may be weak on items with very high or very low p-values. For example, if nearly all students get an item correct (or incorrect), that item does not provide much useful information for distinguishing between students with higher performance and students with lower performance on the entire test. A point-biserial value of at least (.15) is recommended, however, research has shown that “good” items have a point-biserial value of (.25) and above. Additionally, there is no real relationship between the point-biserial correlation and the p-value statistics. Problematic items will always show point-biserial correlations, but the accompanying p-value may be low or high. The point-biserial value should be used to assess item quality; p-values should be used to assess item difficulty (Varma, N.D).

The following image depicts the Question Quality (Point-Biserial Correlation) table that has been calculated by OnTarget for the assessment being analyzed in this report. The table provides the following:

  • Individual Point-Biserial Correlation for each question

The following image depicts the Question Quality (Point-Biserial Correlation) using a frequency distribution that has been calculated by OnTarget for the assessment being analyzed in this report. The frequency distribution depicts the range of item “quality” on the test. This graph allows the reviewer to more quickly analyze the range of the entire assessment by providing the number of questions that fall within each specific Point-Biserial Correlation range.

This frequency distribution depicts the following of this 34 question test:

  • 6 questions that fall between a point-biserial correlation of .0 and .09 – (Poor)
  • 9 questions that fall between a point-biserial correlation of .1 and .19 – (Marginal)
  • 6 questions that fall between a point-biserial correlation of .2 and .29 – (Fairly Good)
  • 4 questions that fall between a point-biserial correlation of .3 and .39 – (Good)
  • 9 questions that fall between a point-biserial correlation of .4 and above – (Very Good)

Overall 19 of the 34 questions fall on a point-biserial correlation value of (.2) or higher and range from “Fairly Good” questions to “Very Good” questions. Another useful role of the point-biserial correlations involves validating the multiple-choice scoring key. Items with incorrect keys will show point-biserial value close to or below zero. As a general rule, items with a point-biserial value below (0.10) should be examined for possible an incorrect key (Varma, N.D.).

10.2.OnTarget Question Difficulty (P-Value) #

11.Validity #

Validity refers to the extent to which a test measures what it is intended to measure. When test scores are used to make inferences about student achievement, it is important that the assessment supports those inferences. In other words, the assessment should measure what it was intended to measure in order for any uses and interpretations about test results to be valid. Texas follows national standards of best practice and collects validity evidence annually to support the interpretations and uses of the STAAR test scores. The Texas Technical Advisory Committee (TTAC), a panel of national testing experts created specifically for the Texas assessment program, provides ongoing input to TEA about STAAR validity evidence. Validity evidence for an assessment can come from a variety of sources, including test content, response processes, internal structure, relationships with other variables, and analysis of the consequences of testing (TEA, Technical Digest 2018-2019).

The results of STAAR, including STAAR Spanish and STAAR Alternate 2, are used to make inferences about how well students know and understand the Texas Essential Knowledge and Skills (TEKS) curriculum. When test scores are used to make inferences about student achievement, it is important that the assessment supports those inferences. In other words, the assessment should measure what it was intended to measure in order for inferences about test results to be valid. For this reason, test makers are responsible for collecting evidence that supports the intended interpretations and uses of the scores (Kane, 2006). Evidence that supports the validity of interpretations and uses of test scores can be classified into the following categories:

  • evidence-based test content – refers to evidence of the relationship between tested content and the construct that the assessment is intended to measure.
  • evidence-based response processes – refers to the cognitive behaviors that are required to respond to a test item. Multiple-choice items are developed so that students must apply what they have learned about the content, thereby supporting an accurate measurement of the construct being assessed.
  • evidence-based internal structure – Texas collects evidence that shows the relationship of students’ responses between items, within reporting categories of items, and within the full tests to verify that the elements of an assessment conform to the intended test construct. Texas conducts annual internal consistency studies to gather evidence based on internal structure.
  • evidence-based relations to other variables – Another method Texas uses to provide validity evidence for the STAAR assessments is analyzing the relationship between performance on STAAR and performance on other assessments, a process that supports what is referred to as criterion-related validity. Evidence can be collected to show that the empirical relationships are consistent with the expected relationships. Numerous research studies were conducted as part of the development of STAAR to evaluate the relationships between scores on the STAAR assessments and other related variables.
  • evidence-based consequences of testing – Another method to provide validity evidence is by documenting the intended and unintended consequences of administering an assessment. The collection of consequential validity evidence typically occurs after a program has been in place for some time and on a regular basis (TEA, Technical Digest 2018-2019).

12.Validity Evidence #

Validity evidence based on test content supports the assumption that the content of the test adequately reflects the intended construct. For example, the STAAR test scores are designed to help make inferences about students’ knowledge and understanding of the statewide curriculum standards, TEKS. Therefore, evidence supporting the content validity of the STAAR assessments, including STAAR Spanish, maps the test content to the TEKS. Validity evidence supporting Texas’ test content comes from the established test development process and the judgments of content experts about the relationship between the items and the test construct. The test-development process started with a review of the TEKS by Texas educators. The educators then worked with TEA to define the readiness
and supporting standards in the TEKS and helped determine how each standard would best be assessed. A test blueprint was developed with educator input, which maps the items to the reporting categories they are intended to represent. Items were then developed based on the test blueprint. Below is a list of steps in the test development process that are followed each year to support the validity of test content in Texas (TEA, Technical Digest 2018-2019).

Since this is the process supported by the Texas Education Agency, it, therefore, makes sense that it is a process that would be followed by districts to the extent possible with locally developed assessments.

  • Develop items based on the TEKS curriculum standards and item guidelines.
  • Review items for appropriateness of item content and difficulty and to eliminate potential bias.
  • Build tests to predefined criteria.
  • Have university-level experts review high school assessments for accuracy of the advanced content.

It is imperative that each test item be reviewed for alignment, appropriateness, adequacy of student preparation, and any potential bias. The statistical analysis for each item should be reviewed and a recommendation should be made on whether the item should be reused as written, revised, re-coded to a different TEKS, or rejected.

Below are guidelines released by the Texas Education Agency that could be followed:

(TEA, Technical Digest 2018-2019) Item Review Guidelines

 

12.1.OnTarget Question Analysis #

The following image depicts the OnTarget Question Analysis that has been calculated by OnTarget for the assessment being analyzed in this report.

The table provides the following:

  • Question Number
  • Question Difficulty (p-Value)
    • .94 =Outside the Optimum Range
    • 94% of the students got this question correct.
    • The higher the p-value, the easier the item. Low p-values indicate a difficult item. In general, tests are more reliable when p-values are spread across the entire 0.0 to 1.0 range with a larger
      concentration toward the center, around 0.50 (Varma, N.D.).
  • Question Quality (Point-Biserial Correlation)
    • 0.01 = Poor
    • A point-biserial value of at least (.15) is recommended, however, research has shown that “good” items have a point-biserial of (.25) and above.
    • As a general rule, items with point-biserial values below (0.10) should be examined for possible incorrect key (Varma, N.D.).

Our sample data matrix contains an item that appears to have a “conflicting” p-value and point-biserial statistics. Item 2 has a low Question Quality (point-biserial correlation) (.01) but a high Question Difficulty (p-Value) (.94). It is imperative that each test item be reviewed for alignment, appropriateness, adequacy of student preparation, and any potential bias. The statistical analysis for each item should be reviewed and a recommendation should be made on whether the item should be reused as written, revised, re-coded to a different TEKS, or rejected. Even if a qualitative review of the item does not reveal any obvious reason for the low point-biserial, it is often advisable that this item be removed from future testing (Varma, N.D.).

The following images depict the OnTarget Question Analysis that has been calculated by OnTarget for the assessment being analyzed in this report.

The table provides the following directly related to the Texas Formative Assessment Resource Guidelines (TFAR) provided by the Texas Education Agency:

1.TEKS Alignment

Does the item . . .

  • align to the Texas Essential Knowledge and Skills (TEKS) student expectation (SE)
  • align to the depth of knowledge (DOK) and skill (identified by the cognitive verb) in the SE? (identify, describe, compare, analyze, etc.)

2. Bias and Sensitivity

Is the item . . .

  • free of bias and stereotypes (racial, gender, ethnic, religion, socioeconomic, political, environmental, etc.)?
  • free of sensitive, emotionally charged issues?
  • accessible and fair for students of diverse backgrounds so that students of one group do not have an unfair advantage over students of another group? (consider gender, rural/urban, race/ethnicity, etc.)

3. Language and Vocabulary

Is the item . . .

  • written using the appropriate verb tense throughout?
  • free of grammatical and spelling errors?
  • clear and concise, avoiding wordy, ambiguous, vague, irrelevant and unnecessary information and verbiage?
  • free of inappropriate colloquial and idiomatic language?
  • in active voice rather than passive voice (unless passive voice is necessary or easier to understand)?
  • free of vocabulary and academic language that are not grade-level appropriate?
  • free of words with multiple meanings?
  • free of unnecessary or unclear pronouns?
  •  free of complex lengthy clauses and sentences?
  • using consistent language when referring to the same object or concept?

4. Structure & Context

Does the item . . .

  • have a question, task, instructions, etc., that will be clear to students?
  • use technology in a way that closely aligns to the item’s content/skills SEs?*
  • avoid any clueing which may inappropriately influence a student’s response to an item?
  • have parallel structure so that the stem and answer choices make sense and answer choices are similar in
  • length, language, and structure?
  • have a context that is clear, grade-level appropriate, and free from unnecessary complexity?

5. Answer Choices

Do the answer choices . . .

  • include distractors in MC items that are plausible errors or misconceptions yet incorrect?
  • include distractors that are based on content that students for this grade level are expected to know?
  • avoid distractors that are too close to the correct answer that it is likely to confuse or trick students who really do
    know the answer or can be considered outliers?
  • have only one correct answer (MC items) and is it marked as correct?

Does the item writer…

  • include rationales, where appropriate, that are complete (sufficiently explained Key/distractors) and written
    in acceptable style?

6. Visuals

Does the item . . .

  • include only art/table that provides support for the student to demonstrate proficiency of the standard?
  • include enough information for the item to be answered?
  • include art/table that is legible, clear, and free from visual clutter?

Texas Formative Assessment Resource (TFAR) Item Guidelines

Texas Education Agency (TEA) Teacher Incentive Allotment (TIA) Guidelines on Writing Effective Assessment Items

These items should then be reviewed by the creator of the assessment along with other members of an assessment review committee identified at the district, campus, or grade level. The following image depicts the Ontarget Question Analysis that was a direct result of that review: (Note – OnTarget does NOT complete this process. This process is recommended to be completed by the assessment creator at the campus, district, grade, or teacher level review teams).

This process is completed with a copy of the test in hand so that all of the items can be reviewed by content-level experts.

The notes can be saved for posterity so that a longitudinal, historical comparison can be conducted.

This process can continue for every question or only for those questions identified as “Poor” or “Marginal” by the calculation developed by OnTarget.

13.References #

Andrich, D. & Marais, I. (2019). A course in Rasch measurement theory: Measuring in the Educational, Social, and Health Sciences. Springer.
Retrieved From https://www.springer.com/gp/book/9789811374951

Kane M. (2006) Content-Related Validity Evidence in Test Development. Rutledge.
Retrieved From https://books.google.com/books?hl=en&lr=&id=ed-NAgAAQBAJ&oi=fnd&pg=PA131&dq=kane+2006+validity&ots=FtVo7QW-Vq&sig=hei6v8oHyPG2iHDZ_0aP5O3Pi9w#v=onepage&q&f=false

Mehrens, W. A. & Lehman, I. J. (1991) Measurement and Evaluation in Education and Psychology. Holt, Reinhart,
and Winston. Retrieved from https://www.google.com/books/edition/Measurement_and_Evaluation_in_Education/kDTMAQAACAAJ?hl=en

Texas Education Agency. (2019a), STAAR 2018-2019 Technical Digest. Austin, Texas: Texas Education Agency.
Retrieved from https://tea.texas.gov/student-assessment/testing/student-assessment-overview/technical-digest-2018-2019

Texas Education Agency. (2019b), 2019 STAAR Mean P-Values and Internal Consistency Values by Reporting
Category and Content Area. Austin, Texas: Texas Education Agency. Retrieved from https://tea.texas.gov/sites/default/files/digest19-appendB-STAAR-7-Stats%20and%20Dist.pdf

Texas Education Agency. (2019c), 2019 STAAR Score Distributions and Statistics by Content Area and Grade.
Austin, Texas: Texas Education Agency. Retrieved from https://tea.texas.gov/sites/default/files/digest19-appendB-STAAR-7-Stats%20and%20Dist.pdf

Texas Education Agency. (2021), Texas Formative Assessment Resource Guidelines. Austin, Texas: Texas Education
Agency. Retrieved from https://drive.google.com/file/d/1MMnT9YjIB3n8UoOG7A24_wjt7FgGfIHw/view?usp=sharing

Texas Education Agency. (2021b), Guidelines on Writing Effective Assessment Items. Austin, Texas:  Texas Education Agency. Retrieved from https://drive.google.com/file/d/1LQaPYlZk-G3lwml5liqt3Ry77k3CTtbq/view?usp=sharing

Varma, S. (N.D.) Preliminary Item Statistics Using Point-Biserial Correlation and P- Values. Educational Data Systems. Retrieved from https://eddata.com/wpcontent/uploads/2015/11/EDS_Point_Biserial.pdf

Wright, B. D. & Stone M.A. (1979). Best test design. Mesa Press. Retrieved from https://research.acer.edu.au/measurement/1/

14.OntArget Question Analysis Printing #

Several options are offered by OnTarget for printing including:

This option may be selected after a digital review of the assessment has been completed and the committee would like a pdf version of their work for their own reference. It can also then be submitted to an Assessment department so that the changes can be made to the assessment. However, it is important to note that as the assessment is reviewed digitally, all notes and reviews can be saved electronically and stored on OnTarget for posterity.

Suggest Edit