Understanding Reliability and Validity


The challenges of achieving reliability and validity are among the most difficult faced by researchers. In this section, we offer commentaries on these challenges.

Difficulties of Achieving Reliability

It is important to understand some of the problems concerning reliability which might arise. It would be ideal to reliably measure, every time, exactly those things which we intend to measure. However, researchers can go to great lengths and make every attempt to ensure accuracy in their studies, and still deal with the inherent difficulties of measuring particular events or behaviors. Sometimes, and particularly in studies of natural settings, the only measuring device available is the researcher's own observations of human interaction or human reaction to varying stimuli. As these methods are ultimately subjective in nature, results may be unreliable and multiple interpretations are possible. Three of these inherent difficulties are quixotic reliability, diachronic reliability and synchronic reliability.

Quixotic reliability refers to the situation where a single manner of observation consistently, yet erroneously, yields the same result. It is often a problem when research appears to be going well. This consistency might seem to suggest that the experiment was demonstrating perfect stability reliability. This, however, would not be the case.

For example, if a measuring device used in an Olympic competition always read 100 meters for every discus throw, this would be an example of an instrument consistently, yet erroneously, yielding the same result. However, quixotic reliability is often more subtle in its occurrences than this. For example, suppose a group of German researchers doing an ethnographic study of American attitudes ask questions and record responses. Parts of their study might produce responses which seem reliable, yet turn out to measure felicitous verbal embellishments required for "correct" social behavior. Asking Americans, "How are you?" for example, would in most cases, elicit the token, "Fine, thanks." However, this response would not accurately represent the mental or physical state of the respondents.

Diachronic reliability refers to the stability of observations over time. It is similar to stability reliability in that it deals with time. While this type of reliability is appropriate to assess features that remain relatively unchanged over time, such as landscape benchmarks or buildings, the same level of reliability is more difficult to achieve with socio-cultural phenomena.

For example, in a follow-up study one year later of reading comprehension in a specific group of school children, diachronic reliability would be hard to achieve. If the test were given to the same subjects a year later, many confounding variables would have impacted the researchers' ability to reproduce the same circumstances present at the first test. The final results would almost assuredly not reflect the degree of stability sought by the researchers.

Synchronic reliability refers to the similarity of observations within the same time frame; it is not about the similarity of things observed. Synchronic reliability, unlike diachronic reliability, rarely involves observations of identical things. Rather, it concerns itself with particularities of interest to the research.

For example, a researcher studies the actions of a duck's wing in flight and the actions of a hummingbird's wing in flight. Despite the fact that the researcher is studying two distinctly different kinds of wings, the action of the wings and the phenomenon produced is the same.

Comments on a Flawed, Yet Influential Study

An example of the dangers of generalizing from research that is inconsistent, invalid, unreliable, and incomplete is found in the Time magazine article, "On A Screen Near You: Cyberporn" (De Witt, 1995). This article relies on a study done at Carnegie Mellon University to determine the extent and implications of online pornography. Inherent to the study are methodological problems of unqualified hypotheses and conclusions, unsupported generalizations and a lack of peer review.

Ignoring the functional problems that manifest themselves later in the study, it seems that there are a number of ethical problems within the article. The article claims to be an exhaustive study of pornography on the Internet, (it was anything but exhaustive), it resembles a case study more than anything else. Marty Rimm, author of the undergraduate paper that Time used as a basis for the article, claims the paper was an "exhaustive study" of online pornography when, in fact, the study based most of its conclusions about pornography on the Internet on the "descriptions of slightly more than 4,000 images" (Meeks, 1995, p. 1). Some USENET groups see hundreds of postings in a day.

Considering the thousands of USENET groups, 4,000 images no longer carries the authoritative weight that its author intended. The real problem is that the study (an undergraduate paper similar to a second-semester composition assignment) was based not on pornographic images themselves, but on the descriptions of those images. This kind of reduction detracts significantly from the integrity of the final claims made by the author. In fact, this kind of research is commensurate with doing a study of the content of pornographic movies based on the titles of the movies, then making sociological generalizations based on what those titles indicate. (This is obviously a problem with a number of types of validity, because Rimm is not studying what he thinks he is studying, but instead something quite different. )

The author of the Time article, Philip Elmer De Witt writes, "The research team at CMU has undertaken the first systematic study of pornography on the Information Superhighway" (Godwin, 1995, p. 1). His statement is problematic in at least three ways. First, the research team actually consisted of a few of Rimm's undergraduate friends with no methodological training whatsoever. Additionally, no mention of the degree of interrater reliability is made. Second, this systematic study is actually merely a "non-randomly selected subset of commercial bulletin-board systems that focus on selling porn" (Godwin, p. 6). As pornography vending is actually just a small part of the whole concerning the use of pornography on the Internet, the entire premise of this study's content validity is firmly called into question. Finally, the use of the term "Information Superhighway" is a false assessment of what in actuality is only a few USENET groups and BBSs (Bulletin Board System), which make up only a small fraction of the entire "Information Superhighway" traffic. Essentially, what is here is yet another violation of content validity.

De Witt is quoted as saying: "In an 18-month study, the team surveyed 917,410 sexually-explicit pictures, descriptions, short-stories and film clips. On those USENET newsgroups where digitized images are stored, 83.5 percent of the pictures were pornographic" (De Witt 40).

Statistically, some interesting contradictions arise. The figure 917,410 was taken from adult-oriented BBSs--none came from actual USENET groups or the Internet itself. This is a glaring discrepancy. Out of the 917,410 files, 212,114 are only descriptions (Hoffman & Novak, 1995, p.2). The question is, how many actual images did the "researchers" see?

"Between April and July 1994, the research team downloaded all available images (3,254)...the team encountered technical difficulties with 13 percent of these images...This left a total of 2,830 images for analysis" (p. 2). This means that out of 917,410 files discussed in this study, 914,580 of them were not even pictures! As for the 83.5 percent figure, this is actually based on "17 alt.binaries groups that Rimm considered pornographic" (p. 2).

In real terms, 17 USENET groups is a fraction of a percent of all USENET groups available. Worse yet, Time claimed that "...only about 3 percent of all messages on the USENET [represent pornographic material], while the USENET itself represents 11.5 percent of the traffic on the Internet" (De Witt, p. 40).

Time neglected to carry the interpretation of this data out to its logical conclusion, which is that less than half of 1 percent (3 percent of 11 percent) of the images on the Internet are associated with newsgroups that contain pornographic imagery. Furthermore, of this half percent, an unknown but even smaller percentage of the messages in newsgroups that are 'associated with pornographic imagery', actually contained pornographic material (Hoffman & Novak, p. 3).

Another blunder can be seen in the avoidance of peer-review, which suggests that there was some political interests being served in having the study become a Timecover story. Marty Rimm contracted the Georgetown Law Review and Time in an agreement to publish his study as long as they kept it under lock and key. During the months before publication, many interested scholars and professionals tried in vain to obtain a copy of the study in order to check it for flaws. De Witt justified not letting such peer-review take place, and also justified the reliability and validity of the study, on the grounds that because the Georgetown Law Review had accepted it, it was therefore reliable and valid, and needed no peer-review. What he didn't know, was that law reviews are not edited by professionals, but by "third year law students" (Godwin, p. 4).

There are many consequences of the failure to subject such a study to the scrutiny of peer review. If it was Rimm's desire to publish an article about on-line pornography in a manner that legitimized his article, yet escaped the kind of critical review the piece would have to undergo if published in a scholarly journal of computer-science, engineering, marketing, psychology, or communications. What better venue than a law journal? A law journal article would have the added advantage of being taken seriously by law professors, lawyers, and legally-trained policymakers. By virtue of where it appeared, it would automatically be catapulted into the center of the policy debate surrounding online censorship and freedom of speech (Godwin).

Herein lies the dangerous implication of such a study: Because the questions surrounding pornography are of such immediate political concern, the study was placed in the forefront of the U.S. domestic policy debate over censorship on the Internet, (an integral aspect of current anti-First Amendment legislation) with little regard for its validity or reliability.

On June 26, the day the article came out, Senator Grassley, (co-sponsor of the anti-porn bill, along with Senator Dole) began drafting a speech that was to be delivered that very day in the Senate, using the study as evidence. The same day, at the same time, Mike Godwin posted on WELL (Whole Earth 'Lectronic Link, a forum for professionals on the Internet) what turned out to be the overstatement of the year: "Philip's story is an utter disaster, and it will damage the debate about this issue because we will have to spend lots of time correcting misunderstandings that are directly attributable to the story" (Meeks, p. 7).

As Godwin was writing this, Senator Grassley was speaking to the Senate: "Mr. President, I want to repeat that: 83.5 percent of the 900,000 images reviewed--these are all on the Internet--are pornographic, according to the Carnegie-Mellon study" ( p. 7). Several days later, Senator Dole was waving the magazine in front of the Senate like a battle flag.

Donna Hoffman, professor at Vanderbilt University, summed up the dangerous political implications by saying, "The critically important national debate over First Amendment rights and restrictions of information on the Internet and other emerging media requires facts and informed opinion, not hysteria" (p.1).

In addition to the hysteria, Hoffman sees a plethora of other problems with the study. "Because the content analysis and classification scheme are 'black boxes,'" Hoffman said, "because no reliability and validity results are presented, because no statistical testing of the differences both within and among categories for different types of listings has been performed, and because not a single hypothesis has been tested, formally or otherwise, no conclusions should be drawn until the issues raised in this critique are resolved" (p. 4).

However, the damage has already been done. This questionable research by an undergraduate engineering major has been generalized to such an extent that even the U.S. Senate, and in particular Senators Grassley and Dole, have been duped, albeit through the strength of their own desires to see only what they wanted to see.

« Previous
Continue »