Reliability Research

Human errors involving reliability of scoring content analysis can occur when technicians code the content of a clause in one scoring category of a scale rather another or fail to code a clause that others do code. Variations in such coding can occur because of ambiguities in the formulation of the content categories, idiomatic or unusual expressions used by the speaker, variations among technicians in their understanding of the content categories, and variations in the mental state of the technicians. Such coding errors can only be minimized , not eliminated, by the developer of content analysis scales. The user has to introduce personal controls in terms of training scorers and the design of scoring procedures in order to reduce error variance arising from the source to a degree suitable for the user's purposes.

In the development of the Gottschalk-Gleser Content scales, considerable effort was used in clarifying the definition of each content category to reduce ambiguity. Content categories of a new scale were given preliminary trials to determine whether the description and definitions of each category were readily understandable and capable of distinction from other categories in the scale. Content categories that contained ambiguities resulting in poor consensus between scores were reworded or eliminated.

A large number of studies were carried out testing the error variance of scoring the Gottschalk-Gleser Content Analysis scales (Gottschalk & Gleser, 1969; Gottschalk, 1995). For the Anxiety scale the reliability coefficient for scoring the Total Anxiety scale ranged from 0.84 to 0.93 for the average of two codings, and the error variance of such averages ranged from 0.03 to 0.10. In general, the generalizability of scores on the Total Anxiety scales as coded by a single technician was 0.80 and that for the average of two independent scorings was about 0.90. For the Hostility scales practically no variations in the general level of scores were found over a 2 year period or for any individual scorer. Scoring technicians were found to vary somewhat in their coding of a specific protocol from time to time, but the largest variance arose from differences in their perception of the scoring categories as applied to a particular protocol and in their set at any on time in coding a specific protocol. Thus, the use of the average scoring by two coders reduced the largest sources of error and efficiently increased generalization. The estimated generalizability of the average score for total hostility outward was 0.84. The Overt and Covert Hostility Outward subscales were slightly less reliably scored.

For the Social Alienation-Personal Disorganization scale, the reliability of single scoring was found to be 0.84, which compared favorably with the reliability of scoring the Gottschalk-Gleser Affect scales. Studies done in Australia and Canada in the English language found interscorer coefficients of reliability within a range similar to those reliability coefficients done in the United States, namely, 0.80 and above (Viney, 1983).

Return to scale development
Return to home page