On the complexity of product ratings...

Imagine the following scenarios... Say you need to buy an earphone set and you have access on how other people have rated different earphones in a scale of one to five. How would these ratings influence your decision? Or, say you send your manuscript to an editor, who forwards it to a set of friendly reviewers. A day later, you got the reviews back for your manuscript and they are mostly positive. What is the chance that reviewers' opinion reflect the true value of your manuscript? Here is another example... Would you go to see a movie in the theater just because your colleague has recommended it? How would that change if you know your colleague since 15 years?

These questions represent situations where one could potentially take into account opinions of others during a decision process, where estimating the true value of a product (earphone, manuscript, movie) is essential. When we decide to buy product A instead of product B, we believe that the intrinsic value of product A is higher than B. However, what we have is just an estimate, and the true value is most often unknown, or uncertain. By using opinions of other people, for example observing ratings associated with products, we can attempt to reduce this uncertainty. While these ratings can potentially be very informative, it is not clear how customers should use them. Therefore, the question I am wondering is how should a rational agent that relies exclusively on information provided by the ratings should actually act?

Earphones sorted by Amazon's Average Customer Rating.

 Let's stick to the example with products. Product A had an average of 4.1 stars (out of 5) received from 100 people, whereas product B had 5 stars from 12 people. Assuming that one initially had no prior preference for either A or B, and have no other source of information than these ratings, how should a customer make his/her choice?

I would personally go with the product A because it seems to me that achieving 4.1 stars with 100 people provides me a more reliable source of information about the quality of the product. Product B having an average of 5 stars based on only 12 people, could just be there as a pure streak of luck. That's why I believe that 4.1 is more close to the true value of the product, whereas the average rating of product 5 is a riskier choice. This is just an intuition and I think the question is actually an empirical one, that could be tested pretty straightforwardly in the laboratory.

What about the situation where two products have exactly same ratings and the same number of raters? Would you randomly pick one of them and consider them equal? Intuition says in such situations one would need to consider the distribution of ratings in order to glean further insights. Ratings could consist of a pile of stars around one single rating value (i.e. uni-modally distributed), or alternatively could consists of lots of 5s and 2s present simultaneous. While both scenarios return the same average rating, I would personally avoid the second product as the bimodal distribution would be an indication of a significant cluster of customers with a bad experience with the product.
Hamburger Restaurants sorted by Yelp's Highest Rated.

Both Yelp and Amazon provides distributions of ratings in the form of simple histograms, which I often rely upon to select products. However neither Yelp nor Amazon provides intuitive summary metrics on what customers should closely pay attention to in these histograms. Furthermore, it is actually not very clear how a person without a quantitative understanding of histograms should interpret this data which has a fair level of complexity. Even assuming that average customer have a good understanding of data presented in the form of an histogram, then still the question is about finding the best way to integrate this information into a rational/optimal decision process. I think at this point many customers are left alone with their own choice and their own way of interpretation. They are heavily biased to use the average number of stars as the single metric to achieve a decision.

This decision could also result from a strategical thinking (I honestly don't know). While providing clear metrics on how to interpret histograms might help customers to make better choices and thus increase their long-term satisfaction. It is also possible that different customers may actually have different tolerance levels on how they deal with uncertainty about the quality of a product. In this case, this lack of support may be a good choice as it would leave individual customers with their own decision metrics, no matter how they achieve this. And it would certainly give customers the possibility to integrate other sources of information about the product.

From the perspective of companies relying on crowd-sourced knowledge, the only position where they can assert their own opinion is certainly when they present products in the form of a sorted list. Even if the customer sorts products based on ratings of other customers, it is still largely under the control of the company how this sorting metric has to be computed. Therefore, it is in the interest of the company to use a metric that actually reflects the uncertainty about the quality of the products. This would avoid pushing risky or uncertain products higher up in the ranking. I think it would certainly help to sort products differently for different users to balance out the associated uncertainty about the quality of the products.

I presented in this article few idea at the intuition level. How could we translate this onto a mathematical language. I believe a Bayesian framework can be of help here. In this Bayesian perspective, we can detach the true value (i.e. intrinsic quality) of a product from its observations (i.e. user ratings) and treat observed ratings as data that is generated from an underlying (unknown) probability distribution that characterizes the true value of a product. Using the Bayesian mechanics we can work backwards and obtain the most likely probability distribution that could have generated the observed rating scores. And finally, once we know these parameters we incorporate uncertainty into the way how different products are listed or presented. This will be the topic of the next article.