Undergraduate Survey Data Biases
We had our survey vetted by hundreds of students at the outset (now hundreds of thousands), and had a few psychological researchers evaluate the questions for imparted "researcher bias" (or experimenter bias). Some of the experimenter bias is deliberate, which we will discuss later, but right now I want explore a little of the response bias and intrinsic bias in the data collected so far (2015). Contentions that evaluaters have with online survey data surrounds the incorrect bipolar bias assumption (at least for our data), and neglect some of the more positive traits of online surveys -- namely lower procedural bias, at least surrounding completion pressure. (Though, survey layout and rendering on different devices now produces another bias). Without a controlled experimentation environment, one can never be 100% certain of what the biases are, but a lack of completion pressure reveals itself in the length of the commentary responses. One can never be 100% certain what the biases are in the controlled environment too, but desire to complete (and get out to eat, use the restroom, see the sun...) definitely exists there. My point is, they are different.
Selected Undergraduate Surveys
We filtered 100,000+ undergraduate surveys down to 67,589. We eliminated duplicates, all-zeros, all-very-high ratings matching a "certain" (undisclosed)*1 profile, spam, and erroneous university selections, user reported/challenged (and resultingly invalidated), and surveys that our statistical analyzer thought were less valid. Surely there is still some noise.
The survey data consists of a number of individual control fields (Gender, Self-rated Intellect, Instate/Out-of-state/International, ACT/SAT, Graduation Year), and "dependent" school ratings such as:
- Education Quality
- Whether Schoolwork is Useful
- Whether Schoolwork encourages Creativity
- Whether Academic Success is dependent upon knowledge and material mastery
- Competitiveness of classmates
- How much their mind was challenged
- How much they expected to be challenged
- Classes taught by TAs
- Faculty Accessibility
- If they are treated as a valued individual
- the Social Life
- Extra Curricular Activities
- The Surrounding City
- Campus Safety
- Campus Beauty
- Apparent Funding Use
- and if they would return Again if given the choice
You will notice that StudentsReview questions are biased heavily towards educational quality and light on sports/extracurriculars, with 9 (about half) of the 19 field questions having to with knowledge gaining, classwork and interactions with faculty. This is by design, as we've felt that strong (or weak) athletics programs are quite apparent, but the classroom experience is not. 6 of the remaining questions reflect the social life, and 3 questions describe the interactions with the physical campus and facilities -- although, Campus safety could be considered to be both facilities and social life. Finally, we ask a single "sum up" question -- if they would return again if given the chance, which can also act as a catch-all for missed questions.
Neutrality of Respondent Bias
First we wish to determine if there are any large scale correlations between the control variables and the dependent variables. Looking at the max magnitude of the pairwise correlation(control, dependent) coefficients: $max(|cor(control,dependent)|)$
|Intellect||ACT||SAT||Gender||From Area||Graduation Year|
The maximum magnitude correlation exists in the ACT column and is between ACT and Extra Curicular activities at 0.137. This is an extremely low correlation, and is still the largest in the data set. Variables with such a low correlation would be considered to be unrelated. The other control variables that StudentsReview collects are even lower -- indicating that across the data set, Intelligence, Gender, Graduation Year, ACT/SAT, and where a student visits from are not "predictive"*2 of his or her ratings at a given school. On the first pass, this shows a certain neutrality in the data that might be unexpected.
Sensibility of Data
Here's a few sanity tests. What are the correlations of some of the dependent variables? Are they expected?
|Education Quality||Academic Success||high 0.7301764||yes|
|Education Quality||Again||med 0.5803594||interesting|
|Academic Success||Useful Schoolwork||high 0.7063110||yes|
|Academic Success||Surrounding City||low 0.3160694||yes|
|Academic Success||Mind Expectations||low 0.1359635||surprising|
|Campus Beauty||Maintenance||med 0.5886384||yes|
First, most chosen values are exactly as expected -- the rated educational quality has a high correlation with rated academic success. Campus Beauty is correlated, but not highly, with Campus Maintenance. This of course makes sense -- an unmaintained campus, no matter how "beautiful", will not be for long, but a well maintained ugly campus is still ugly. But at least clean!
Two noteworthy values. Educational Quality is only moderately correlated with whether a student would return. This means that education quality on it's own does not fully account for a student's satisfaction. Not everyone is exclusively motivated by classroom learning as apparently I am. We'll have to see whether our survey captures enough features. Second, a student's rating of the "academic success" (being dependent upon mastery) is almost completely uncorrelated with how much they expected to be challenged by their coursework coming into the school. This is surprising because one might be inclined to think that a person with a high Mind Expectation would be critical of the Academic grading system. Or at least, I might be.
See Figure 1 for a visualization of the correlation matrix.
* "Qualifications" *3
- Undisclosed - Some of the filtering mechanism is undisclosed to keep malicious users from exploiting the survey
- Predictive - "predictive" in the non causal sense, across a closed data set. This is statistics after all -- "Linearly Descriptive" is probably better, but is not the conventional term.
- Qualifications - This is the qualifications section, which everyone seems to have to include these days.