Home : Products : Gossamer Links : Development, Plugins and Globals :

## Bayesian formula to calculate average ratings

Heya all,

I've been reading for a while to try and find a formula that will give some kind of weighted representation of link ranking based on both the average rating for a link and the number of votes. Obviously just using average ranking*number of votes doesn't work. With that formula a link with 300 votes averaging 2 will score higher than a link with 100 votes scoring 5. That's not the result I want and is a bad statistical representation.

In my case, currently I would like to use the Review Ratings (not the link votes, although I will probably tweak any global for that in the future) and the number of reviews written (again, not the number of votes per link) to come up with a ranking value. I think I've found a good formula. Who wants to give the global a go?

I'm borrowing from The Internet Movie Database and some of the explanation from WOW Web Designs. I'd like to use the same true Bayesian estimate formula used by the Internet Movie Database, for calculating average ratings.

Code:
weighted rank (WR) = (v / (v+m)) * R + (m / (v+m)) * C

where:
R = average for the design (mean) = (Rating)
v = number of votes for the design = (votes)
m = minimum votes required to be listed in top 10
C = the mean vote across dimension

This formula normalizes scores, that is it pulls a particular score (R) to the mean (C) if the number of votes is not well above m. In other words, if a particular design has only a few votes above the minimum required votes to be listed in top 10 (m), the average score is decreased a little if it is above the mean, or increased a little if it is below the mean in accordance with the normal distribution rule of statistics.

Here is an example:
Code:

WR = (6 / 10) * 5.33 + (4 / 10) * 7.18 = 6.07
| | | | | |
v v+m R m v+m C

The formula normalizes the average rating of a relatively low rated design from 5.33 to 6.07 since the number of votes (v=6) is only slightly above the minimum required votes (m=4) and the mean across the dimension (C=7.18) is quite high. If, in the future, this particular design gets more votes, the difference between R and C will increase as the number of votes increase. The idea is that the more the votes, the more representative the average rating is.

So (and I've been told I ask for much ) I'm looking for a global that will apply this formula across a set of links and obviously only consider those links with v >= m for the final output but use all reviews to calculate C. It will probably make sense return the results to 2 decimals and if there's a tie between some records to then sort them (second level) according to original average rating per link.

Here's a global where Laura has done some related calculations already.

Safe swoops
Sangiro

#### Last edited by:

sangiro: Oct 7, 2004, 3:28 AM
Hi,

Not sure whether you're still interested in this but I've had a go at it (I was doing something similar for Jag). Note that I haven't tested this to see whether the results are sensible. It relies on you adding a new field to the Links table called New_Rating.

Laura.
The UK High Street
Hi,

That looks very interesting ! I've been trying to look at some way of getting an overall average on my site. I was interested in this because I feel that the way that my site is set up with articles (detailed.html) more than links it wasn't really reflecting the interest people had for say an article that may not even have a web site link in it. This means that when people read the article they are not going to be creating a click or hit value but they have obviously gone to the trouble of clicking through to the detailed page which on present LinksSQL rating doesn't count at all.

Thus I'm using the detailed page plug-in from Pugdog that adds an extra field where detailed page click values are entered.

Thats mainly why I had asked Laura if there was some way of getting an average that I could use with her plug-ins that reflect this in a better way.
See http://www.gossamer-threads.com/...i?post=274375#274375

How have you taken into account the fact that people read articles on your web site when you calculate your average ? (If I remember correctly you have articles and links ?)

Cheers, John
Significant Media

Two questions about voting and weighting, betraying my ignorance (and unfamiliarity) with both the voting and the way the search engine works.

1. Input: What does the distribution of voting look like, for ratings of a large number of entries? That is, if you had a scale from 0 to 10, does the average rating end up being about 5, with a normal distribution? [Like Barbarella with Jane Fonda, http://www.imdb.com/title/tt0062711/ratings ) Or does it tend to bulge at 7.5, with a Bolzmann look? (Like The Last Seduction with Linda Fiorentino http://www.imdb.com/title/tt0110308/ratings ) Given the optimistic and positive nature of people, I suspect the latter.

2. Output: What works best in terms of retrieving results? I'm not sure that this is a question that can be answered, because it's hard to measure "what works best". But it's probably possible to define a weighting scale that is awful, terrible, worse than useless. Again, if a person uses a scale from 0 to 10, does a weighting of 1 mean a result is unlikely to ever be returned, no matter how good the keyword match? That could mean that weighting factors should range from 4 to 6 to be useful. Does a weighting change from 6 to 7 produce too small of effects to notice? Is the weighting scale a geometric one or a linear one?

In pharmacology (and biochemistry), the logarithm of the dose of a drug is related to its effect, and the shape of the curve is "sigmoid" (an integrated normal distribution).

Hopefully the questions are somewhat coherent.
If you want real statistically significant meaningful data, you need to count the number of votes of each kind. A 2 and 10 both average out to a 6, but 2 6's would be statistically more meaningful than a 1 and a 10.

I have released code that keeps track of each rating a user makes, and allows them to vote once on a link. Again, statisticaly meaningful would mean one-person/user one vote.

If you look at:

http://postcards.com/...iled_page.cgi?ID=403

You'll see that I migrated the site over to a new rating system, and keep counts of the new ratings, while merging in the mix of old/new as best I could come up with. My new sites, use the new ratings from the git-go, but Postcards.com has been on-line for about 10 years, and has used Links for the past 6 or so, and has gone though loads of changes, periods of not tracking hits, etc. It's getting a new face lift soon, for it's 99th incarnation ;) But the point is, the stats package I developed was designed, initially to preserve old stats, while tracking new stats in more detail.

You could take the function that calculates the overall rating, and modify it to use a weighted formula, based on each of the types of votes it gets.

A site with 12 "10" ratings would probably rank higher than a site with similar amounts of votes and ratings, but no "10"'s, as an example.

good luck on figuring out how to do it <G> If you can, I can plug the formula into this code, and re-calculate all the averages pretty easily (the averages are recalculated every time a vote is added, so it's fairly trivial to do an overall update).

PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Oh... it uses several extra fields to preserve old ratings, the new rating, etc. The individual ratings are kept in a separate table, and the users votes are tracked. I think I have two tables, not one, but it's been awhile since I actually looked at the code.

Username/LinkID/Rating

and

And a table with LinkID/Average/Count/1/2/3/..../10 etc.

Though, I might have figured out a different way to do it.

Point is, since the user's vote on a specific link is tracked (for all votes), you can also track the statistical validity of a users ranking (which is what I had originally thought you were driving at), eg: a user who only gives 10's or 5's, or a user who is always off the mean/median in one direction, etc.

you can give a user a "karma" or "expert" score in that manner, and use that to weight their votes for a specific resource, in addition to just the number and scale of the rankings.

PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Hi Pugdog,

At the moment I'm using two different version of your detail_page.cgi and they both add to Detailed_Hits when a user clicks on one or the other. I have one for the article and another that actually aloows users to view details about the article like date published, modified etc.

I was wondering whether there is any way to distinguish between unique hits and general hits ?

Thanks, John
Significant Media
Hi,
What do you mean?
The original release tracked "Detailed Hits" and "Site Visits".
Detail_page.cgi tracked hits to the detailed pages, and jump.cgi tracked the site jumps.

It is possible to run multiple copies of the detail_page script/plugin, as you are doing.

I've been thinking about allowing other tracking fields, you can set them in the admin, and by passing in a ?track= field to the script, you'd override the default field, and use the tracking field.

To prevent errors, and hacking, you'd have to hard-code an allowed list of tracking fields (in the admin area, maybe) and use the default field (or generate an error/hack alarm) if an invalid field was passed in.

You could do the same thing, and get rid of the ?track= field, with a bit of creative hacking.

In the detail_page.cgi (which is just a shell of a script now), you would set a parameter that is passed in when the detail_page.pm is called. So, you'd call different scripts:

detail_page.cgi
article.cgi
profile.cgi

and all they really are is detail_page.cgi with a parameter in them like:

\$IN->param('Track_Field') = "Article"

This would be check in Detail_Page.pm, and if "Article" was a valid tracking field, the default tracking field from the plugin.cfg would be set to \$IN->param('Track_Field') after the plugin defaults routine was called.

If this doesn't make any sense, I might have a little time to hack this into the the detail_page.cgi plugin, since it _shouldn't_ be too tough. The validity check would make sure you actually created the columns in the Links table, and if not, it would generate a trapped error. Also, adding the type of "hit" to the Hits table might not be a bad idea, so you can generate more accurate stats.

The advantage of doing it this way, is there is one plugin, but several ways to call it.

PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Hi Pugdog,

I have done what you suggest in that I have a slightly modified detail_page.cgi that inserts images in the article and then a details.cgi script that is identical to the original detail_page.cgi and uses details.html to display details about the article. However each time I click on either, the total is incremented even if I refresh the page which I agree is normal for "Hits" but I was wondering how/if it was possible to get a "Unique Hits" total that would not be incremented when the same visitor looked at a page 'n + X' times. Is there a routine in LinksSQL that can differentiate unique hits from hits ?

If I understand correctly it would then be a question of having say Detailed_Hits_Unique as a new column to be incremented on a unique hit basis whereas the Detailed_Hits column would just increment normally per hit on a detail_page.cgi or similar based script ?

I'm a bit lost on how to differentiate between unique and non unique hits though in order to populate the two different fields in the database.

I agree about the script, I'm really happy with my detail_page.cgi and it's children LOL, they're great for what they do ;)

John
Significant Media
Hi,
You must be using an older release.
Or one of the ones that had the hit_track code removed.

This is the modified part, that tracks by IP/Time. If your database only tracks logged in hits, you can track by Username.

I don't run under mod_perl, and laura said there is a problem using date_set_format with mod_perl. I think she posted a work around.

What this does is it looks for an entry in the click_track table OLDER than the delete_by date. If it finds one, it freshens it, and updates the Link table. If it doesn't find one, it inserts one, and updates the links table.
If it finds an entry "newer" than the delete_by date, it just ignores it, and updates nothing.

Code:

my \$ip = \$ENV{REMOTE_ADDR} || 'None';
my \$click_db = \$DB->table ('ClickTrack');
my \$DETAILED_CFG = Links::Plugins::get_plugin_user_cfg ('Detailed_Page');
my \$column_name = \$DETAILED_CFG->{'column_name'};
my \$delete_by = \$DETAILED_CFG->{'delete_by'};
Links::init_date();
GT::Date::date_set_format('%yyyy%%mm%%dd%%HH%%MM%%ss%');
\$delete_by = GT::Date::date_get ( time() - 18000 ); # 5 HOURS
## do some house keeping on old click-track records.
## these never seem to be deleted.
\$click_db->delete(GT::SQL::Condition->new('Created', '<', \$delete_by));

my \$click_sth = \$click_db->select ( { LinkID => \$id,
IP => \$ip,
ClickType => 'Hits'
} ) or return { error => \$GT::SQL::error, LinkID => \$id };
if (\$click_db->hits) {
my \$record = \$click_sth->fetchrow_hashref;
my \$created = \$record->{'Created'};
if (\$created < \$delete_by) { ## if the record is older than delete_by
\$click_db->update (
{ ## refresh the date
Created => \"NOW()"
},
{
LinkID => \$id,
IP => \$ip,
ClickType => 'Hits'
},
{
GT_SQL_SKIP_INDEX => 1
}
) or return { error => \$GT::SQL::error, LinkID => \$id };

\$LINKDB->update (
{
"\$column_name" => \"\$column_name + 1"
}, ## set
{
ID => \$id ## where
},
{
GT_SQL_SKIP_INDEX => 1 ## cut overhead
}
) or return { error => \$GT::SQL::error, LinkID => \$id };
}
} else {
## print "No record was found, I'm going to try to update links, and insert click_db<BR>";
\$LINKDB->update ( {
"\$column_name" => \"\$column_name + 1"
},
{
ID => \$id
},
{
GT_SQL_SKIP_INDEX => 1
} ) or return { error => \$GT::SQL::error, LinkID => \$id };
\$click_db->insert ( {
LinkID => \$id,
IP => \$ip,
ClickType => 'Hits',
Created => \"NOW()"
} ) or return { error => \$GT::SQL::error, LinkID => \$id };
}

PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
I've just found the post where you discussed this with sangiro and Laura :

http://www.gossamer-threads.com/...i?post=273819#273819

Quote:
Personally I would use
date_get (time, '%ddd% %mmm% %dd% %yyyy%');
but if you want to use date_set_format you really need to set the format back at the end of the sub using
GT::Date::date_set_format(\$CFG->{'date_db_format'});

setting the date back needs to be done before this :
Quote:
return (\$link;

Anyway I'll give it a wirl, but this explains why it isn't in the release version since it can potentially create issues with a mod_perl installation. I have not got mod_perl turned on at the moment but will be aiming to.

Thanks for the help

I'm not that much of a wiz like you with perl but I see that it picks up the IP with \$ENV{REMOTE_ADDR} and follow most of it but I can't figure out which field it will be updating as compared to presently where it updates the Detailed_Hits field. Do I need to create a new field for unique hits or does this mod mean that it will now just log unique hits and not all hits ?

This 'should' be the correct version with get_date to avoid the issues discussed by Laura :

Code:
my \$ip = \$ENV{REMOTE_ADDR} || 'None';
my \$click_db = \$DB->table ('ClickTrack');
my \$DETAILED_CFG = Links::Plugins::get_plugin_user_cfg ('Detailed_Page');
my \$column_name = \$DETAILED_CFG->{'column_name'};
my \$delete_by = \$DETAILED_CFG->{'delete_by'};
Links::init_date();
GT::Date::date_get (time, '%ddd% %mmm% %dd% %yyyy%');
\$delete_by = GT::Date::date_get ( time() - 18000 ); # 5 HOURS
## do some house keeping on old click-track records.
## these never seem to be deleted.
\$click_db->delete(GT::SQL::Condition->new('Created', '<', \$delete_by));

my \$click_sth = \$click_db->select ( { LinkID => \$id,
IP => \$ip,
ClickType => 'Hits'
} ) or return { error => \$GT::SQL::error, LinkID => \$id };
if (\$click_db->hits) {
my \$record = \$click_sth->fetchrow_hashref;
my \$created = \$record->{'Created'};
if (\$created < \$delete_by) { ## if the record is older than delete_by
\$click_db->update (
{ ## refresh the date
Created => \"NOW()"
},
{
LinkID => \$id,
IP => \$ip,
ClickType => 'Hits'
},
{
GT_SQL_SKIP_INDEX => 1
}
) or return { error => \$GT::SQL::error, LinkID => \$id };

\$LINKDB->update (
{
"\$column_name" => \"\$column_name + 1"
}, ## set
{
ID => \$id ## where
},
{
GT_SQL_SKIP_INDEX => 1 ## cut overhead
}
) or return { error => \$GT::SQL::error, LinkID => \$id };
}
} else {
## print "No record was found, I'm going to try to update links, and insert click_db<BR>";
\$LINKDB->update ( {
"\$column_name" => \"\$column_name + 1"
},
{
ID => \$id
},
{
GT_SQL_SKIP_INDEX => 1
} ) or return { error => \$GT::SQL::error, LinkID => \$id };
\$click_db->insert ( {
LinkID => \$id,
IP => \$ip,
ClickType => 'Hits',
Created => \"NOW()"
} ) or return { error => \$GT::SQL::error, LinkID => \$id };
}

I'll try it out with and without mod_perl and let you know how it goes.

John
Significant Media
Hi just found this link (after 20 minutes of sifting through search results) and it's version 2.1.1 and has some of the code referred to in the above discussions. I suppose I'd be better uninstalling the version I have and re-installing this one. I'm pretty sure the version I have was downloaded from the GT plug-in area and it's down as being :
# Version : 2.1.0
# Updated : Oct 2 2002

The version from the link below doesn't have a version header but was posted on Nov 4 2002

http://www.gossamer-threads.com/...i?post=221901#221901

Thanks, John
Significant Media