I. Introduction
In the field of education and psychology, the concern about judgments of raters has been growing. Rater effects such as severity/leniency, central tendency are commonly viewed as a source of method variance, that is, as a source of systematic variance in observed ratings that is related with the raters and not with the examinees (Cronbach, 1995; Eckes, 2005; Engelhard, & Wind, 2013; Toffoli et al., 2016). Since examinees naturally vary in terms of their abilities, we do not expect all of them to receive the same rating. Rather, we expect variance in the ratings of examinees’ abilities from one to another. Any variability in the ratings of examinees that is due to dependable differences in their abilities is desirable. However, ratings will be influenced by several extraneous factors such as items, raters, occasions and sub-categories. There are two general approaches to identify the effects of raters and other things. One is Generalizability Theory (GT) and the other is Many-Facet Rasch Measurement (MFRM) (Linacre, 1989).
Estimation of reliability in GT concerns itself with discovering how similar the observed raw scores might be to any other raw scores the examinees might obtain under very similar circumstances. Its aim is to estimate the error variance associated with examinee raw scores, but not to adjust any examinee’s raw score for the particular raters and items that the examinee encountered (Cronbach et al., 1972). In general the variance of observed scores are decomposed into a universe score variance and variances associated with the multiple sources of error and their interactions.
Estimation of reliability in GT proceeds in two stages. First, a Generalizability study (G-study), which contains the universe of admissible observations, is performed to obtain estimates of variance components for the universe of admissible observations. From these estimates, the second stage involves one or more Decision studies (D-study), which use the estimated variance components from the G-study to estimate variance components for alternative research designs (Smith & Kulikowich, 2004).
Proper use of Rasch models allows for separability of parameter estimates (van der Linden & Hambleton, 1997). This means the ability estimates of examinees are not influenced from the distributional properties of particular items attempted and the particular raters who rate the performance. MFRM concerns itself with obtaining for each examinee a measure from which the details of the examinee’s particular raters, items and tasks have been removed. In the MFRM perspective, the analytic purpose is to change raw scores from non-linear form into a linear measure, adjusting it for the specific items, raters, or tasks (Linacre, 1993). Smith and Kulikowich (2004) discussed several advantages of MFRM over GT specifically. Here, these two alternative approaches address two different, but related problems.
In the educational measurement field, there is a sizeable literature on rater effects using both of GT and MFRM (Kim, 2005; Kim & Wilson, 2009; Linacre, 1995; Lunz & Schumacker, 1997; MacMillan, 2000; Marcoulides, 1999). These studies mainly described comparisons of GT and MFRM and also explained how these two measurement techniques can be used to give information considering future assessment construction or data collection. Also they tried to detect and correct for rater variability using both methods. However, very few studies have been conducted comparing the GT and MFRM approaches focusing on interaction factors. The interaction effect is also used in both GT and MFRM, but in slightly different ways. In GT, the interaction is defined as factorial analysis of variance. The analysis reports one variance component for each interaction of two or more main effects. For example, the variance component for the interaction of persons and raters describes the extent to which persons were rank-ordered differently by different raters (Shavelson & Webb, 1991). In MFRM frame, interaction between the facets represent differential facet functioning, as in Differential Item Functioning (DIF) (Sudweeks et al., 2005). An interaction study helps to identify uncommon interaction patterns among facets in general, those patterns that point to consistent deviations from that is anticipated on the basis of the specific model (Eckes, 2005).
This paper reports the results of a pilot study of a mathematics creative problem solving test in Korea, 2010. In the context of rater-mediated, performance based assessment (mathematics creative problem solving test), raters and tasks are two major sources of score variability and measurement error. When new types of task such as creative problem solving assessments are used in the test, it is important to check the rating processes and the impact of new types of task through statistical or measurement methods. To date, very few papers have compared GT and MFRM approaches directly in mathematics creative problem solving test, however. The purpose of this study is
-
To determine the variability in the ratings that is due to inconsistencies between raters with items, criteria, and the interactions among the variables (G-study).
-
To estimate how many conditions of each facet are required to reach a suggested goal level of generalizability (D-study).
-
To investigate how raters differ in the severity or leniency in mathematic creative problem solving test.
-
To compare and contrast the use of GT vs. MFRM with focusing on interaction effects.
II. Review of the Methods
Overviews of essential features about GT were provided by Feldt & Brennan (1989) and Shavelson and Webb (1991). Also in-depth descriptions of the concepts and methods of GT have been introduced by Cronbach et al., (1972) and Brennan (2001).
GT examines an analysis of variance approach based on the raw scores to provide acceptable estimates of scoring variation because of raters, items, tasks, or others. By calculating the degree of the variance components, the sources of the largest measurement error can be found (Kim & Wilson, 2009). In the GT framework, the error term can be partitioned into systematic error and random error. Here, the systematic error represents facet variability that can be further partitioned depending on the number of facets involved in the research design and can be applied in determining the dependability of a measurement (Cronbach et al., 1972). Similar to variables having values, facets are comprised of levels that can be defined as random or fixed (Shavelson & Webb, 1991). Random facets include levels that can be exchanged from the universe of generalization. Conceptually, a facet that is random indicates that the levels included in the analysis are an unbiased sample of levels that could be drawn from the universe of generalization (Cronbach et al., 1972).
In general, the point of a G-study is to get estimates of variance components related with a universe of admissible observations. These estimates can be used to construct measurement structure for operational use and to provide information for making substantive decisions efficiently (Brennan, 2001). Next, the D-study focus on the specification of a universe of generalization, which is the universe to which the stakeholder needs to generalize based on the results of a measurement procedure (Brennan, 2001).
For example, Equations (1) and (2) respectively represent the relative and absolute variances of a fully-crossed design with a rater and element facets. Here, the relative error variance means the sum of all variance components that indicate an interaction between the object of measurement and one of the facets and the absolute error variance means the sum of all variance components except the variance component for students.
where
: the rater facet variance component
: the item facet variance component
: the person by rater interaction variance component
: the person by item interaction variance component
: the rater by item interaction variance component
: the person by rater by item interaction confounded with random error variance
: the number of raters to be used in this study
: the number of items to be used in this study
In GT, two types of reliability or dependability are considered: relative and absolute reliability. Relative dependability (G-coefficient ) refers to the consistency with which examinees can be ranked based on performance skill. Absolute dependability (Φ) is consistency with which scores occur around a particular scale point. Therefore, it is possible to determine consistency with which ratings from different raters occur around a specific quality point of performance (Shavelson & Webb, 1991). Using Equation (3) and (4), relative and absolute dependability coefficients for specific measurement designs can be estimated.
where
: relative error variance
: absolute error variance
Since the 1990s performance-based language assessment has been essential for testing student’s linguistic knowledge or thinking skill. Thus, the MFRM has been used to analyze individual rater’s characteristics and their detailed influence on the scoring process (McNamara & Knoch, 2012). Examples are rater’s leniency/severity, scoring consistency, and rater’s training effect. Nystrand et al., (1993) and Weigle (1999) investigated the effects caused by tasks and variety of tests, as well as their interaction and relationship with the rater’s characteristics. Gyagenda and Engelhard (2009) reported reliability of the raters’ assessment on students’ writing ability. Sudweeks, Reeve and Bradshaw (2005) studied biases and interactions amongst elements that were systematic error sources on university students’ essays. Johnson and Lim (2009) investigated rater’s first language influence on their assessments of English as a second language proficiency. Recently, the MFRM’s applications have appeared on not only traditional education, but also other research fields. For example, on studies of creative writing (Bardot et al., 2012), creativity (Hung et al., 2012), scale job analysis (Wang & Stahl, 2012), food behavioral analysis (Vianello & Robusto, 2010), and medical performance assessment (McManus et al., 2013).
The MFRM is derived from the Rasch model’s family for polytomous items. The partial credit model by Masters (1982) is generalized rating scale model that each item has its own scaling rate. This model allows for greater flexibility in how items are modeled. Equation (5) presents the partial credit MFRM which four facets (examinees, items, raters, and categories), introduced by Linacre and Wright (2002) and allows each item to have its own scale of classification:
where
Pnijk : the probability of category k being observed
Pnij(k − 1) : the probability of category k−1 being observed
Bn : the ability of person n
Di : the difficulty of item i
Cj : the severity of judge j
Fk : the difficulty of being rated in category k rather than category k − 1
The MFRM fulfills the equivalent requirement of objectivity of the other Rasch’s models. The testing scores are sufficient statistics for estimating each parameter and each facet parameters is independently estimated from the other facets. Thus, the examinee’s ability measures are independent on items and raters (Linacre & Wright, 2002).
Brennan (2001), Linacre (1993, 1995, 2001), and Kim and Wilson (2009) discussed the comparison issues for GT and MFRM in terms of major research questions, statistical model, design issues, methods of data collection, standard results, and limitations of these two approaches. They recommended it useful to conduct the GT analysis first to get an overview of how the assessment/test was performing and then use the MFRM in order to understand more of the details.
III. Methods
Data for this study were the score results of 172 10th grade students on a mathematics creative problem solving test. It was administered for 50 minutes during July 2010 in a high school located in an urban area of Korea. The test was composed of five open-ended questions(See Appendix 1) which was developed based on Nam(2007) and Shin et al.,(1999). Four raters(two mathematics teachers, two mathematics education experts) scored all of the students’ responses with a scoring rubric (Sheffield, 2006). Sheffield’s scoring rubric includes seven criteria: depth of understanding, fluency, flexibility, originality, elaboration, generalizations, and extensions for assessing mathematical creativity. The modified scoring rubric for this study is composed of four criteria fluency, flexibility, originality, and elaboration (See Appendix 2).
A sample of answer sheets written by 172 students was selected for analysis in this study. There were no missing values in this data. Four raters, all full or part - time instructors in the mathematics department, rated each of the 172 answer sheets. Each rater rated all answer sheets; therefore, the design for the study is a fully crossed, four-factor design: person by item by rater by criteria.
For the G-study, a fully crossed (172 participants, 5 items, 4 raters, and 4 criteria) random effects model was specified (See Figure 1). The G-study was conducted using the GENOVA program (Crick & Brennan, 1983). Variance components were estimated for each of the 15 sources of variability possible in the three-facet (items, raters, and criteria), fully crossed design (p×i×r×c). D-studies were also conducted for three different design structures. Here, error variances and reliability coefficients for relative and absolute decisions were calculated for each design, as well as for varying numbers of items, raters. The three designs (D-study) analyzed included the following:
-
Fully crossed design p × I × R : each rater rates all examinees on all items
-
Persons crossed with (items nested within raters) p × (I:R) : each rater rates all examinees on selected items
-
(Raters nested within persons) crossed with items (R:p) × I : each rater rates selected examinees on all items
The MFRM analysis was conducted using the FACETS program (Linacre, 2010). Four facets were analyzed including (a) 172 examinees, (b) 5 items, (c) 4 raters, and (d) 4 criteria. Once the parameters of each facet are calibrated from the four-faceted maineffects model, ten interaction analyses (or bias analyses) including all two-way and three-way interactions between rater and other facets were performed to identify the unusual patterns of rating performance across person, item, or criteria facets that are deviated from the expectation on the underlying model. The standardized residual, which is the standardized difference between the expected and observed ratings, is represented as logit score, and the interaction pattern with an absolute z-score greater than 2.00 was considered to be significantly biased. Here, fixed chi-square tests for each bias term were used to investigate whether the set of interactions can be acceptable after allowing for the measurement error (Linacre, 2010).
IV. Results
The estimated variance component for each the 15 sources of variation in the ratings are reported in Table 1. The variance component attributed to subjects represents variation due to individual differences. Ideally, the variance component for persons should be larger than any of the others. The estimated variance component (0.152) indicates that examinees differ in test. All remaining variance components explain sources of measurement error. The variance attributed to items (0.061) may be interpreted as implying that some items reflect more problem solving skills than others. The relatively small variance component for raters (0.010) means that raters do not differ in their ratings when averaged over other facets. Residual variance indicates that even after accounting for main effects, two- and three-way interactions among the source of error, 15.82% of the variance was still unaccounted. In this analysis, the three largest variance components include the main effect for items (8.62%); the two-way, person by item interaction (19.92%); and the three-way, person by item by criteria interaction (7.34%). This means that the rank order of the examinees was different on the five items. This may be due to scarceness in the number of items, we used only five items for the mathematics creative problem solving test. Therefore, this could provide evidence for the need for more items in this test.
From the variance components, we can get the reliability of relative decisions about students’ performances (0.580) and for absolute decisions (0.488). Table 2 indicates how the reliability of the rating for each student will likely vary with different numbers of items and raters (We do not consider criteria in this D-study).
The pattern of Table 2 shows that varying the number of items will have a greater effect on the reliability than the number of raters. In order to obtain a Generalizability coefficient of at least .70 it would be necessary to use at least ten items if the number of raters are larger than four. A p×I×R design assumes that all examinees are rated by every raters on every items. Since this design is not feasible for a very large number of examinees or items, another D-study was performed to project the effect of using other feasible designs (p × (I: R) and (R: p) × I). In these two designs, the pattern of coefficient line is very similar to p×I×R design.
FACETS program measures the students, raters, items, criteria and rating scales onto the interval scale and creates a single frame of reference for interpreting the results of the analysis (Eckes, 2009) (See Figure 3). Overall model fit can be assessed by examining the responses that are unexpected given the assumptions of the model. According to Linacre (2010), satisfactory model fit is represented when about 5% or less of absolute standardized residuals are equal or greater than 2, and about 1% or less of absolute standardized residuals are equal or greater than 3. There were 13,760 valid responses included in this analysis. Of these, 100 responses (0.72%) were related with absolute standardized residuals equal or greater than 2, and 55 responses (0.40%) were associated with absolute standardized residuals equal or greater than 3. These finding indicated satisfactory model fit for this analysis.
The estimated ability for the 172 examinees ranged from -3.30 to 1.23 logits. Fit statistics for each element within each facet report the extent to which the observed and expected ratings by the model differ, given the estimated parameters. These fit statistics are reported as mean squares, which is simply a chi-square divided by the appropriate degrees of freedom (Smith and Kulikowich, 2004). Plausible ranges for these fit statistics depend on the testing situation, but one suggested range of acceptable values is from 0.5 to 1.5 (Engelhard, 1992). In this analysis, twenty-two of the 172 examinees rated had infit and outfit mean squares of 1.50 or above; however, they were below 2.0, in a range where such values would not seem to distort the overall results.
Table 3 shows the relative difficulty of five items in this test. Positive values are indicative of items that were difficult relative to the other items while negative values are indicative of items that were easier. Item 1 is the most difficult with a measure of .64 and item 3 is the easiest with -1.01. The range of fit statistics of items is .88 to 1.12.
The rater severity and leniency measures for each of the four raters are also reported in Table 3. Positive values are indicative of raters who were severe relative to the other raters while negative values are indicative of raters who had a tendency to assign ratings that were lenient relative to the other raters (Sudweeks et al., 2005). Rater 3 is most severe with a measure of .31 and rater 1 is most lenient with −.45 in this study. The infit and outfit statistics for the four raters are within the acceptable range of 0.5 to 1.5 (0.82~1.17).
Table 3 has the information about relative difficulty of four criteria: Flexibility, Originality, Elaboration, and Fluency. Flexibility is the most difficult criteria with a measure of .59 and Fluency is the easiest with − 1.07 in this assessment. The range of the fit statistics of items is .71 to 1.31.
Six sets of two-way bias and four sets of three-way bias, and residual analyses were performed. Once the MFRM main analysis is finished by using the base model, the interaction analysis also be examined on the residuals of the main analysis, with the facet parameters from the main analyses fixed (Linacre, 2010). Here, the residuals between raw and expected scores are calculated for each combination of elements, and the residual scores for each facet are converted into logit measures and standardized z-scores (Lee & Kantor, 2015). Table 4 lists the total number of combinations of facet elements considered in each interaction analysis: the percentage of absolute Z score equal or greater than 2, the minimum and maximum Z scores, as well as their means and standard deviations. Z scores over an absolute value of 2.0 are held to indicate significant interaction (Linacre, 2010). In this study the percentage values for the Person × Rater, Person × Criteria, Person × Item × Criteria, and Person × Rater × Criteria interactions were generally low. More than 40% of the combinations of Item × Rater and Item × Criteria interactions were related with significant differences between observed and expected ratings. This means that in the interactions, the item involved is responding consistently to the rater or criteria in a way which is both different from other items.
Table 5 shows the estimated variance component for each source of variation in the rating. The largest percentage of significant interactions was found in the analysis of interactions between Persons and Items (26.23%). Four, three-way bias analyses were also performed, with the two highest percentages of significant three-way interactions in the analysis being Persons, Items, and Raters (6.56%) and Persons, Items, and Criteria (6.56%). In this interaction analysis the pattern of interaction is similar to the result of the G-study. Also, a comparison table of the results of interaction effects of GT and MFRM are displayed in Table 6. The GT results show that relative large variance component for the interaction between person and items (19.92%). This means that items are examined differently across students in this assessment. The variance components related rater interactions are small: raters have same standards across all students. The MFRM results indicate that each of the interaction effects (person by item, item by rater, item by criteria, rater by criteria, person by item by rater, item by rater by criteria) is statistically significant.
V. Discussion
The purpose of this paper was to investigate rater effects in a mathematics creative problem solving test. Both GT and MFRM analyses seem to consent which facets of the model generate the greatest proportion of variability in this study. For both the GT and MFRM results, the variance component for the Person by Item interaction is relatively large, indicating significant variability. Especially, MFRM interaction analyses revealed that about 26% of the Person × Item combinations, and about 6.5% of the Person × Item × Rater and Person × Item × Criteria combinations, produced unexpectedly high deviations from model expectations. Results from both methods also indicated that variance due to rater and interactions related with rater were relatively low. However, a few discrepancies were found in interaction analyses between GT and MFRM. In contrast to GT, which found relatively large variance estimates in person by item by criteria, MFRM indicated that there were significantly biased ratings in item by rater and item by rater by criteria interactions.
The reliability of the mean rating for each examinee based on five items, four raters and four rating criteria using a fully crossed design was 0.58 (G-coefficient) and 0.49 (phi coefficient). These values were lower than might be expected for measure of reliability. However, we found the guidelines from the D-study to obtain a more optimal reliability coefficients, it needed at least ten items. The use of a nested design in the D-study yielded reliability coefficients that differed by less than 3% from the fully crossed design. This finding means that considerable resources could be saved with minimal loss in generalizability by employing such a design.
To sum up, the findings of this study support the complementary roles the GT and MRFM play in performance assessment analysis. Therefore, depending on the purpose of a particular study, GT or MFRM may be the appropriate measurement technique to use. As it is introduced in previous research (Linacre, 1993, 1995, 2001; Kim & Wilson, 2009), GT is useful in providing group-level information (the internal consistency of the test and the inter-rater agreement on task level), and particularly in making overall decisions about test design. In other words, we can draw the relative influence of each factor on a measure of the target. Also researcher can estimate how many conditions of each elements are needed to reach a suggested goal level of generalizability. MFRM provides more specific information, which can be fed into the test development and improvement process at many points. Therefore, MFRM analysis enable us to investigate individual scores after controlling the facets.
In this paper, there are limitations that should be considered for conducting next study. Although the results of this paper showed the empirical evidence for the possibility of existence of rater effects on the mathematics creative problem solving test to examinee, we did not take any other factors except statistic or measurement properties. This limited us to suggest more practical implications, such as what characteristics of the participants, raters, or items may lead the rater effects.