Kin 175 Notes - mechanics

Concurrent Validity of Scores (x continous)

Concurrent validity (A quantitatively determined criterion related coefficient) is assessed when you want to know whether a test you want to administer can be used in place of another (perhaps better but less efficient in terms of time/resources) test that is already deemed to produce valid scores. The test already established as producing valid scores is called the criterion measure. To assess concurrent validity you compare your test results with the criterion. This is done by correlating your test scores (x) with criterion measures (y). The correlation coefficient used depends on the type of variable the criterion is. Three common criterion measures are:

1. Scores from another test (typically a continuous variable)

You would administer your test and the criterion test to the same group then correlate the two sets of scores using the PPMC.

2. Scores from an expert (typically a continuous variable)

You would administer your test to a group and have an expert observe the same group and score their performance without reference to your test. You then correlate your test scores with the expert's scores using the PPMC.

3. Skill level - using mutually exclusive criterion groups (expert/novice which means y in this case is dichotomous)

You would administer your test to a group of experts (highly skilled) and administer your test to a novice group (inexperienced) then correlate your test scores with the group designation using the point biserial correlation.

Note: If a criterion measure exists, why not just administer that test? Because often obtaining the criterion measures can be too expensive, take too long, or be too complex to be feasible.

Predictive Validity of Scores (x continous)

Predictive validity (A quantitatively determined criterion related coefficient) is assessed when you want to know whether test scores can be used to predict performance. The variable your are trying to predict (y) can be called the criterion measure. To assess predictive validity you see how strong the relationship is between your predictor variable and what you are trying to predict. This is done by correlating your test scores (x) with criterion measures (y). The correlation coefficient used depends on the type of variable the criterion is. The PPMC coefficient is the most common statistic employed in this situation though it is possible to encounter situations requiring the point biserial correlation.

It is useful to follow up with an estimate of how much error will be present in predicting y from x. The standard error of estimate formula can be used to quantify prediction error.

Note: When you choose to assess criterion-related validity (concurrent, predictive, construct) it does not take the place of content/logical validity, especially in an educational setting. Validity should be examined both qualitatively and quantitatively whenever feasible.

Validity of Classifications

When the purpose of a test is to classify people as masters/non-masters based on one cut score, traditional techniques for estimating reliability and validity do not apply. Data, and subsequent classifications, from a mastery test are valid when relevant, and when they produce correct and consistent classifications of individuals.

Content/logical validity (Mastery Test Classifications)

Determined by gathering qualitative evidence that the test measures the fundamental knowledge needed for entrance, exit, or classification purposes.

There is no statistic to calculate. Evidence is qualitative.

Concurrent validity (Mastery Test Classifications - x dichotomous)

In this new context, validity is defined as the correct classification of people into mastery states. The most common technique then for assessing the concurrent validity of a mastery test is to examine the test's sensitivity to instruction. That is, if mastery test classifications are valid, an instructed group should be classified as masters and an uninstructed group non-masters when they take the test.

Steps:

1. Set a cut score (most difficult element in mastery testing)

2. Administer mastery test and obtain criterion classifications (could be from another mastery test already known to produce valid classifications, skill level groups, or an expert’s classification) & record results

3. Set up a 2X2 table

			Criterion Classification
		Master		Non-Master
	Master	a		b
Mastery Test Classification
	Non-Master	c		d

4. Calculate the Phi Coefficient

Example:

			Criterion Classification
		Master		Non-Master
	Master	7		2
Mastery Test Classification
	Non-Master	1		6

Here is an example using expert classification as the criterion. Assume an expert observed a group and their classification of individuals served as the criterion measure.

			Expert Classification
		Master		Non-Master
	Master	4		1
Mastery Test Classification
	Non-Master	2		5

Another approach is to use mutually exclusive criterion groups as the criterion measure. For example, assume you administered a mastery test of throwing accuracy to two groups, one expert and the other novice and recorded these results

			Group
		Expert		Novice
	Master	8		2
Mastery Test Classification
	Non-Master	3		12

After selecting a cut score, as you set out to estimate the validity of classifications from a mastery test (its sensitivity to instruction) you must take care to use 'clean' criterion groups. Explicit, carefully considered, criteria must be used to decide who belongs in the expert/instructed and novice/uninstructed groups.

Reliability

Reliability of Scores
a. Internal consistency
b. Stability
Reliability of Classifications

a. Internal consistency
b. Stability

Reliability of TTC Scores

a. Internal consistency
b. Stability

Reliability of TTC Classifications

a. Internal consistency
b. Stability

Reliability

Data is reliable when there is little or no measurement error (when scores are accurate). Theoretically this is related to the assumption that any observed score is equal to a true score plus some measurement error (x = T + E). So when measurement error is equal to zero then the observed score is equal to the true score and the observed score is perfectly reliable. So the key to reliability is minimizing measurement error (highly unlikely to ever eliminate).

Examples of sources of measurement error

Random fluctuations, or a person's inability to score the same twice or perform consistently throughout one administration.
Measuring device/test
Test administrator
Temporary effects warm-up, practice
Test length

As a test administrator it is important to identify and eliminate as many sources of error as possible in order to enhance reliability. To estimate the amount of measurement error present in observed scores, the standard error of estimate (SEM) can be calculated following calculation of a Relibilty coefficient.

The SEM is a band + you place around a person's observed score that estimates measurement error.

Relationship between reliability and validity

It is possible to have a reliable data that is invalid. Data/information that is valid should also be reliable. However, reliability does not insure validity.

Reliability of Scores

Reliability is typically assessed in one of two ways:

Internal consistency - Precision and consistency of test scores throughout one administration of a test.
Stability - Precision and consistency of test scores over time. (test-retest)

To estimate reliability you need 2 or more scores per person. If a test is given just once the most common way of getting 2 scores per person is to split the test in half - usually by odd/even trials or items.

Once you have 2 comparable scores per person the question is how consistent overall were the scores. The inference here is that if two sets of scores are consistent there likely is little measurement error and so the scores are likely to be accurrate reflections of true scores and so the observed scores are considered reliable.

What statistic to use

In the past, reliability has been estimated using the Pearson Product Moment Correlation coefficient. This is not appropriate since (1) the PPMC is meant to show the relationship between two different variables - not consistency of two measures of the same variable, and (2) the PPMC is not sensitive to fluctuations in test scores.

X₁		X₂
10		18
12		19
15		25
17		27

From the two sets of scores above, r_x1,x2 = 1.00 but the scores are not consistent so the PPMC has overestimated reliability. The PPMC is an interclass coefficient; what is needed is an intraclass coefficient. Pearson is appropriately used to estimate Validity not Reliability.

The intraclass statistics that can be used are the intraclass R calculated from values in an analysis of variance (ANOVA) table and coefficient alpha. They are equally acceptable though the ANOVA table the Intraclass R is based on conveys additional information unavailable with coefficient alpha.

Internal consistency of scores (using Intraclass R) [or Coefficient Alpha]

This way of examining reliability requires that you give the test once to one group then split the test at least in half to get at least two scores per person. From these two or more scores per person an analysis of variance table (ANOVA) is constructed (typically via software). From this table an Intraclass R can be calculated to assess how consistent the two or more measures per person were.

The reason behind using an ANOVA table is that since you expect variability between students but not variability across measures you should be able to estimate reliability by comparing the variances found in an ANOVA table.

Intraclass R formula:

MS_B = mean square between (from ANOVA table)
MS_e = mean square error = (SS_w + SS_i )/ df_w + df_i when scores come from comparable parts
SS_w = sum of squares within (from ANOVA table)
SS_i = sum of squares interaction (from ANOVA table)
df_w = degrees of freedom within (from ANOVA table)
df_i = degrees of freedom interaction (from ANOVA table)

	ANOVA	Table
Source of Variance	Degrees of Freedom	Sums of Squares	Mean Squares
Between	N - 1	Given	SSb/DFb
Within	k - 1	Given	SSw/DFw
Interaction	(N-1)(k-1)	Given	SSi/DFi
Total	N(k) - 1

N = number of students
k = number of scores per student (not the number of trials/items)

Example: Assume the information in the ANOVA table below came from splitting a 50 item cognitive test administered once. To get two scores per person (number minimally needed to examine reliability) the number correct from the odd and even numbered items was recorded.

Source	df	SS	MS
Between	24	6800	283.33
Within	1	450	450
Interaction	24	3100	129.17
Total

This estimate of reliability is for a test half as long (25 items) as the one administered. Since test length affects reliability this estimate is inaccurate - since the intention was to determine the reliability of scores from the 50 item test - one more step needs to be taken. That is, use the Spearman-Brown Prophecy formula.

Note: Any time test length has been altered or you consider altering it, the spearman-brown formula can be used to estimate what reliability will be provided the items/trials added or deleted are similar to the rest of the test.

For this example, since test length was split in half, m = 2.

This is the estimate of the reliability of scores from the full length test.

Internal consistency of scores (using Coefficient Alpha)

The Cronbach's Alpha statistic can be used to estimate the reliability of data under conditions when you have two or more comparable scores per person. These multiple scores per person can come from splits of a test administered once (internal consistency) or they can come from multiple administrations of a test (stability).

Once you have 2 scores per person the question is how consistent overall were the scores. The inference here is that if two sets of scores are consistent there likely is little measurement error and so the scores are likely to be accurrate reflections of true scores and so the observed scores are considered reliable.

Example: Consider a 60 second sit up test administered only once. To get two scores per person you record the number of sit ups completed in the first 30 seconds and the number completed in the second 30 seconds.

x₃₀	x₃₀	Total
15	18	33
26	22	48
20	23	43
18	18	36
25	21	46
26	24	50
20	19	39

1. Get the standard deviations for each column

3.99
2.25
5.89

2. Square the standard deviations

15.96
5.06
34.69

3. Use coefficient alpha

S_xt = standard deviation of the total column (created by you)
S_p = standard deviations of the two or more measures per student

4. Since test length directly influences reliability it is necessary to boost this reliability coefficient since it tells you only the reliability of a test half as long (30 seconds) as the one you gave yet you set out to establish the reliability of the 60 second test. So, the statistic to help out is called the Spearman-Brown Prophecy formula. It can be employed any time you manipulate test length or want to hypothesize what would happen to reliability if test length were . . . The formula is:

Where m is the amount you wish to boost (or diminish) test length. In this case, since you split the test in half m will be 2 to boost reliability back up to the full length test. So, for the example above the reliability of the full length test is determined by:

Reminder:

The SEM is a band + you place around a person's observed score that estimates measurement error.

Important Note:

The standard deviation to be used in the SEM formula above is the standard deviation of the total column. The reason is you want a standard deviation that reflects the spread of scores on the full length test.

Interpretation:

An estimate of measurement error is interpreted relative to the standard deviation of observed scores. The smaller the SEM relative to the standard deviation the more accurate the measures are.

Stability of Scores (using Intraclass R) [or Coefficient Alpha]

This way of examining reliability requires that you give the test at least twice to one group to get at least two scores per person. From these two or more scores per person an analysis of variance table (ANOVA) is constructed (typically via software). From this table an Intraclass R can be calculated to assess how consistent the two or more measures per person were.

Intraclass R formula:

MS_B = mean square between (from ANOVA table)
MS_e = mean square error = (SS_w + SS_i )/ df_w + df_i
SS_w = sum of squares within (from ANOVA table)
SS_i = sum of squares interaction (from ANOVA table)
df_w = degrees of freedom within (from ANOVA table)
df_i = degrees of freedom interaction (from ANOVA table)

	ANOVA	Table
Source of Variance	Degrees of Freedom	Sums of Squares	Mean Squares
Between	N - 1	Given	SSb/DFb
Within	k - 1	Given	SSw/DFw
Interaction	(N-1)(k-1)	Given	SSi/DFi
Total	N(k) - 1

N = number of students
k = number of scores per student (not the number of trials/items)

Assume a test has been given twice and from the 2 scores per person the following ANOVA table is constructed.

	ANOVA	Table
Source	df	SS	MS
Between	20	4000	200
Within	1	500	500
Interaction	20	1000	50
Total

Stability of Scores Using Coefficient Alpha

This way of looking at reliability requires that you give the whole test twice to one group. If the measures are reliable they will be stable over the time between the two administrations and scores will be fairly consistent across the group (provided no significant changes take place between administrations).

Example: Consider a 60 second sit up test administered twice:

Day 1	Day 2	Total
52	50	102
41	43	84
40	38	78
34	36	70
38	40	78
40	42	82

1. Get the standard deviations for each column

5.49
4.46
9.82

2. Square the standard deviations

30.14
19.92
96.56

3. Use coefficient alpha

Reminder:

The SEM is a band + you place around a person's observed score that estimates measurement error.

Important Note:

The standard deviation to be used in the SEM formula above is the standard deviation of an average of the two (or more) scores per person. The reason is you want a standard deviation that reflects the spread of scores on one administration of the test. When you have two or more full length test scores per person, the best estimate of their ability is an average.

Day 1	Day 2	Average
52	50	51
41	43	42
40	38	39
34	36	35
38	40	39
40	42	41

Note: An estimate of measurement error is interpreted relative to the standard deviation of observed scores. The smaller the SEM relative to the standard deviation the more accurate the measures are.

Reliability of Classifications

Reliability is now defined as the accurate (determined by examining consistency) classification of people into mastery states. Again the two perspectives from which reliability can be examined are internal consistency and stability. A mastery test's scores and subsequent classifications based on a cut score are reliable if classifications are consistent over time (stability) or the probability of consistent classification is high (internal consistency).

Methods

Internal Consistency: Subkoviak's coefficient (not covered in this course)
Stability: Proportion of Agreement or Kappa

Stability of classifications (using Proportion of Agreement)

Set cut score. Note: When the cut score is very near the ability level of a large portion of the group, reliability will be impaired. But, if the cut score is set extremely high or low, then validity will be impaired. Selection of an appropriate cut score then is central to both the validity and reliability of classifications from mastery tests.
Give test twice to one group and record the classifications for each person
Make a 2X2 table of proportions

For example, assume the scores below were from a mastery skills test and the cut score used was 6.

So, the proportion of agreement is: .50 + .17 = .67 meaning that 67% of the classifications made across two administrations of this mastery test were in agreement.

Stability of classifications (using Kappa)

Kappa: This statistic, though fairly unstable when group size small, can take chance into account. In a research setting it is best to always report both Kappa and the Proportion of Agreement. In other cases if you want to report just one, report the Proportion of Agreement. It is easily understood and does give information on stability.

Reliability of TTC Scores

A trials-to-criterion test is one where a success criterion (R_c) is set (number of successful attempts to be accumulated) and examinees continue testing until they reach that set criterion. The test score then is the number of trials taken to reach the criterion and a low score reflects good performance.

Reliability in this context is related to the accuracy of the TTC test scores. Again minimizing measurement error is the key to enhancing reliability. The statistics available for estimating reliabilty include the Intraclass R, Coefficient Alpha, and UEV (for internal consistency only).

Internal consistency of TTC scores (using Intraclass R) [or Coefficient Alpha] [or UEV]

To estimate the reliability of scores from a TTC skills test administered once using the Intraclass R, you make use of the multiple scores per person produced when administering a TTC test and from those scores construct an ANOVA table from which you calculate the Intraclass R.

MS_B = mean square between (from ANOVA table)
MS_e = mean square error = (SS_w + SS_i )/ df_w + df_i )
SS_w = sum of squares within (from ANOVA table)
SS_i = sum of squares interaction (from ANOVA table)
df_w = degrees of freedom within (from ANOVA table)
df_i = degrees of freedom interaction (from ANOVA table)

N = number of students
k = number of scores per student (in this case it is the value of the success criterion)

Example: Assume the information in the ANOVA table below came from a TTC test administered once with a success criterion of 5.

Internal consistency of TTC scores (using UEV)

To estimate the reliability of scores from a TTC skills test administered once using UEV you work with the TTC test scores' mean and standard deviation.

For example, assume the information below came from a group of 30 who took a TTC badminton serve test where a success criterion of 6 was used.

Internal consistency of TTC Scores (using Coefficient Alpha)

To estimate the reliability of scores from a TTC skills test administered once using coefficient alpha you make use of the multiple scores per person produced when administering a TTC test and examine the consistency of those scores.

For example, assume the scores below came from a TTC skills test administered once with a success criterion of 4. A zero represents an unsuccessful attempt at the skill and a one a successful attempt. For every student, trials continue to be taken until 4 success have been accumulated. Test score then is the number of trials taken to reach the success criterion of 4.

The SEM is a band + you place around a person's observed score that estimates measurement error.

The standard deviation to be used in the SEM formula above is the standard deviation of the TTC test score column. The reason is you want a standard deviation that reflects the spread of scores on the full length TTC test.

An estimate of measurement error is interpreted relative to the standard deviation of observed scores. The smaller the SEM relative to the standard deviation the more accurate the measures are.

Stability of TTC scores (using Intraclass R) [or Coefficient Alpha]

This way of examining reliability requires that you give the TTC test at least twice to one group to get at least two scores per person. From these two or more scores per person an analysis of variance table (ANOVA) is constructed (typically via software). From this table an Intraclass R can be calculated to assess how consistent the two or more measures per person were.

For example, assume you administered a TTC test with a success criterion of 6 to a group then retested them three more times. From these four scores per person you construct an ANOVA table from which you calculate the Intraclass R statistic to estimate reliability.

Stability of TTC Scores (using Coefficient Alpha)

To estimate the reliability of scores from a TTC skills test administered at least twice using coefficient alpha you use the TTC test score from each administration of the TTC test for each individual.

For example, assume you administered a TTC test with a success criterion of 5 to a group then retested them two more times. From these three scores per person you assess reliability by examining the consistency across the 3 sets of scores using coefficient alpha.

The SEM is a band + you place around a person's observed score that estimates measurement error.

The standard deviation to be used in the SEM formula above is the standard deviation of an average of the two multiple scores per person. The reason is you want a standard deviation that reflects the spread of scores on one administration of the test. When you have two or more full length test scores per person, the best estimate of their ability is an average.

Note: An estimate of measurement error is interpreted relative to the standard deviation of observed scores. The smaller the SEM relative to the standard deviation the more accurate the measures are.

Reliability of TTC classifications

A student's TTC score is now converted to a classification (master or non-master) based on a predetermined cut score. As in any other mastery testing framework determining a good cut score is the most difficult element. One approach however, is to determine (from a philosophical perspective) what you believe the probability of success should be for a master then convert that probability to a cut score. The conversion is done by:

For example, assume the probability of success you would expect from a master of some skill is .65 and the success criterion you want to use in your TTC test is 7. The TTC cut score you would use to classify examinees as masters or non-masters would be:

So, TTC cut score = 10.77. Remember that a low score is good on a TTC test. Therefore individuals with a TTC score below 10.77 are classified as masters and those above 10.77 are classified as non-masters.

Stability of TTC classificaitons (using Proportion of Agreement)

Stability (using Kappa) - TTC

Whenever measures have a strong subjective component to them it is essential to examine objectivity. Subjectivity itself is a source of measurement error and so affects reliability and validity. Therefore, objectivity is a matter of determining the accuracy of measures by examining consistency across multiple observations (multiple judges on one ocasion or repeated measures over time from one evaluator) that typically involve the use of rating scales.

Objectivity of Scores (Intraclass R) [or Coefficient Alpha]

To examine objectivity you need to have either multiple evaluators assessing performance/knowledge on one occasion, or one evaluator assessing the same performance (videotaped)/knowledge twice. The two or more measures per examinee collected are then used to construct an ANOVA table from which the Intraclass R can be calculated.

If the measures (typically from a rating scale) are objective they will be consistent across the two or more measures per examinee.

Example: Consider a skills test assessing Tennis serve technique. One qualified observer scores each person in a group using a rating scale that assesses the execution of the tennis serve. That same observer rates the videotaped serves from each person a second time without reference to the first ratings. Objectivity is then estimated by using these two scores per examinee to construct an ANOVA table and calculate the Intraclass R.

Objectivity of Scores Using Coefficient Alpha

This way of examining objectivity requires that you have either multiple evaluators assessing performance/knowledge on one occaision, or one evaluator assessing the same performance (videotaped)/knoledge twice. If the measures (typically from a rating scale) are objective they will be consistent across the two or more measures per examinee.

Example: Consider a skills test assessing cartwheel technique. Three qualified observers score each person in a group using a rating scale that assesses the execution of the components of a cartwheel. Objectivity can then be estimated using coefficient alpha.

The SEM is a band + you place around a person's observed score that estimates measurement error.

The standard deviation to be used in the SEM formula above is the standard deviation of an average of the multiple scores per person. The reason is you want a standard deviation that reflects the spread of scores for the best measure of ability. When you have two or more test scores per person, the best estimate of their ability is an average.

Note: An estimate of measurement error is interpreted relative to the standard deviation of observed scores. The smaller the SEM relative to the standard deviation the more accurate the measures are.

Advantages	Disadvantages
Easy to administer	Not feasible for a group whose distribution of ability is positively skewed
Focus is on positive	When administered as a mastery test it is difficulty to set the cut score
Less time to administer when distribution of ability negatively skewed	Determination of a success criterion is difficult
Practice proportional to ability takes place

TTC test score	Variable to Predict
11	4
4	8
15	4
3	9
7	8
13	6
8	7

Test Score	Test Classification	Re-test Score	Re-test Classification
12	(M)	8	(M)
6	(M)	4	(NM)
5	(NM)	5	(NM)
4	(NM)	6	(M)
7	(M)	9	(M)
8	(M)	10	(M)

Student	trials to 1st success	Trials to 2nd success	Trials to 3rd success	Trials to 4th Success	TTC test Score
1	0001	00001	00001	00001	19
2	01	001	1	01	8
3	1	1	1	1	4
4	001	01	0001	001	12
5	1	01	1	1	5
6	0001	001	1	001	11
7	1	1	001	01	7
8	01	01	001	01	9

			Mastery Re-test Classification
		Master		Non-Master
	Master	a		b
Mastery Test Classification
	Non-Master	c		d

Source	df	SS	MS
Between	25	7075	283
Within	4	300	75
Interaction	100	9000	90
Total

Source	df	SS	MS
Between	27	6870	254.44
Within	3	1200	400
Interaction	81	9340	115.31
Total

TTC Test	TTC Re-test	TTC Re-test	Total
18	19	17	54
12	15	13	40
7	10	8	25
10	9	9	28
12	10	10	32
15	14	13	42

Test	Re-test	Re-test	Average
18	19	17	18
12	15	13	13.33
7	10	8	8.33
10	9	9	9.33
12	10	10	10.67
15	14	13	14

			TTC Mastery Re-test Classification
		Master		Non-Master
	Master	a		b
TTC Mastery Test Classification
	Non-Master	c		d

Source	df	SS	MS
Between	29	4350	150
Within	1	40	40
Interaction	29	1450	50
Total

Student	Judge 1	Judge 2	Judge 3	Total
1	12	10	8	30
2	8	10	11	29
3	3	4	5	12
4	14	10	11	35
5	10	10	12	32
6	8	8	9	25
7	7	8	9	24
8	10	10	9	29
9	12	10	10	32

Mechanics for Statistical Procedures

Validity

Validity of Scores

Content/Logical Validity of Scores

Concurrent Validity of Scores (x continous)

Predictive Validity of Scores (x continous)

Validity of Classifications

Content/logical validity (Mastery Test Classifications)

Concurrent validity (Mastery Test Classifications - x dichotomous)

Validity of Trials-to-criterion (TTC) Scores

Logical validity (TTC scores)

Concurrent validity (TTC scores - x continuous)

Predictive validity (TTC scores - xcontinuous)

Validity of Trials-to-criterion (TTC) Classifications

Logical validity of TTC Classifications

Concurrent validity of TTC Classifications

Reliability

Reliability

Examples of sources of measurement error

Relationship between reliability and validity

Reliability of Scores

What statistic to use

Internal consistency of scores (using Intraclass R) [or Coefficient Alpha]

Internal consistency of scores (using Coefficient Alpha)

Stability of Scores (using Intraclass R) [or Coefficient Alpha]

Stability of Scores Using Coefficient Alpha

Reliability of Classifications

Methods

Stability of classifications (using Proportion of Agreement)

Stability of classifications (using Kappa)

Reliability of TTC Scores

Internal consistency of TTC scores (using Intraclass R) [or Coefficient Alpha] [or UEV]

Internal consistency of TTC scores (using UEV)

Internal consistency of TTC Scores (using Coefficient Alpha)

Stability of TTC scores (using Intraclass R) [or Coefficient Alpha]

Stability of TTC Scores (using Coefficient Alpha)

Reliability of TTC classifications

Stability of TTC classificaitons (using Proportion of Agreement)

Stability (using Kappa) - TTC

Objectivity

Objectivity of Scores (Intraclass R) [or Coefficient Alpha]

Objectivity of Scores Using Coefficient Alpha