Data Biases -- Is online survey data bipolar?
One of the common misconceptions concerning survey data on websites on websites is that it is inherently bipolar. That is, that only people who are really positive or really negative are motivated to get online and review. This is extremely simplistic... in the sense that it is simply false. The idea is so pervasive that we even believed it for the longest time ourselves until we learned otherwise. The theory makes sense though, which is why it has held on so firmly, and even gets published in the New York Times, and gets propagated by other review sites as well. The idea is predicated on an assumption though, that once stated, becomes apparent for its trivial incorrectness. That idea is:
Nobody would want to take time out of their busy schedule to write a critical or thoughtful or discriminating review of something that they are in contact with every day.
You see, the idea is based on a very personal lack of motivation. For instance, because I myself am not motivated to do some action, then therefore there is something fundamentally flawed about those who would. But consider, some people are motivated by altruism. Some people use a service, think it's great, but can see how it might not work for others. Some people read a review and are motivated by the "missed point" or other inaccuracy, and want to add their observation as well. Or just have some time to kill and decide to make a contribution.
There is also the second problem the assumption, which is that it implicitly assumes only two possible states for a review -- that it is either positive or negative. From the human condition, this makes sense, as we all want "yes" or "no" answers very quickly so as to go about our lives. But the problem here is that if you read data with only two possibilities in your head, then you will only see two possibilities in the results -- you are already closed to the idea that there is much, much more. So it helps to read data with an open mind, or as my MIT advisor pointed out to me, that there's no substitute for looking at the raw data.
So without further ado, lets look at some quick survey data. I grabbed about 70,000 surveys and filtered out the all zeros and all 10s surveys, which we generally do by default anyway (though there are exceptions -- for instance if we believe the survey is accurate for a number of reasons). I didn't try super hard to make the filtering good, just allowing that the law of large numbers would smooth it all out anyway.
Figure 1: Undergraduate Survey Data
In Figure 1, From the data surveys of undergraduates, you can see the distribution of "Educational Quality" from A+ - F. I labeled A+/A as "positive", and "D+-F" as "negative". First you'll notice that the bins are not the same size, and second, you may disagree with A- being considered "Neutral". I feel the point of the "positive" vs "negative" argument is actually about the lack of discrimination. That is, to give a school an "A-", the user had to choose not to give it an A or an A+, and probably he or she had a reason for that distinction. Having that reason I saw as an indicator of the information content of the survey, and the discrimination being a closer indication of a "neutral" type of reviewer. Choosing either A- or B+ doesn't really affect the results a lot, and I think it is the "most appropriate" labeling for now, so lets continue. Later I'll label the bins evenly.
Figure 2: Labeled Positive/Neutral/Negative counts
In Figure 2, you can see the total labeled counts of positive, negative, and neutral from . It is obvious that that with the chosen A- split point that neutral data surveys dominate positive ones, which also dominate the negative ones. If the neutral "split point" is changed to B+, then positive and negative switch, but still remain close. The important point though isn't which one wins, or loses, but by how much. These data points (the counts) are not far apart, so it is hard to see how a hypothesis that "only positive or negative readers take the survey" is supportable in any statistically significant way. Depending upon our significance level, we might hope to see a 10:1 or a 20:1 ratio between the counts of positive+negative and neutral. Instead we are closer to being able to make the claim that "positive + neutral" reviewers take the survey than positive+negative reviews. Even that claim though, is a long way off.
Evenly Labeled Positive, Negative, and Neutral data counts
Here we've labeled the survey data as positive, negative, and neutral, more or less evenly. There are 13 bins, so we could either discard the "F" data or include it and make the "negative" column a little heavier in so doing. With even labeling, you can still see that the counts do not support the idea of a bipolar "positive and negative only" dataset, with the neutral category about as large as the negative one.
Figure 3: Self labeled undergraduate+alumni commentary
In Figure 3, we take a different tack -- let's look at self labeled comments from undergraduate and alumni reviewers. In self-labeled comments, the reviewers are asked to choose whether their comment is positive, negative, advice, or neutral. Now I've collected advice+neutral together, because I believe "advice" is inherently about being discriminating about the service provided and how to get the best of it, though it could be mis-categorized "advice" such as "don't go here". But I also believe that a truly neutral review is likely to be closer to an advice categorization than say, "meh". Looking at the counts, you can see that positive leads negative, which leads advice+neutral only slightly. Again, the importance is the ratio. Here the ratio is even closer to a uniform distribution than the raw survey data itself. There's no 10x or 20x difference between the counts or even a group of counts.
More Interesting Questions
- Do the survey data means match the "true value"? That is, how closely does the sampling of those who took the survey match the sentiment on the campus? Or perhaps the average sentiment of the usual student. Are they consistent with other sources? Even if the data is bipolar and asks a single question about the school "yes" vs "no", the true value question would still exist. We also take a different tack -- we want to know whether the survey data mean matches the value a prospective student is likely to face. Such features such as gender, major, instate/out of state, etc suddenly become more relevant.
- How much does the survey itself drive the data values? What are the psychological biases imparted by the survey?
- Do the overall survey data match all of the student body distribution across the country? Which students are more likely to survey vs the actual student distribution. For instance, if only women take our survey, but the student body across the country is 50-50 (it isn't) male:female, then the properties of the self-selected group imparts biases that aren't apparent onto all of the results.
- Suppose that "neutral" is only a C+? Would seeing (and thus analyzing) the data set as only containing positive and negative data support the idea that it does only contain a marginal, forgettable amount of neutral?
In conclusion, I think that if people look at online survey data more closely, and without their own biases, particularly one that allows dismissal, they will find the data to be more robust, interesting, and informative than they had assumed. I also think that if the other companies look at their own data more closely, they will find the "bipolar" assumption to be false as well. Surely Amazon would disagree with it.
Basically, if one assumes a priori that there is no knowledge to be gained, then they cannot gain knowledge.