Archive for the ‘ranking’ Category

Costs and Transparency in Ranking Systems

Monday, March 12th, 2007

Can a ranking system be transparent, inclusive, and successful? That was the topic of a long conversation last week with Lijit’s senior developer Derek Greentree. We kept coming back to questions about transparency and the cost of acquiring votes. And in the end we decided that this is some sort of rule:

The maximum success possible for a system is a function of the transparency of the algorithms and the cost of acquiring votes.

Consider this rough chart:

System Transparency Cost/Exclusivity
Online
Digg Low Low – Pass a CAPTCHA.
Google Low Lower – Create a web page with links.
SomethingAwful.com Forums High Med – $10 cover charge
Offline
Political Democracy High High – Become a citizen.
Academy Awards High. High – Become a member of the academy.
American Idol High Med – Cost of Text message

I hear you asking, “Why do Digg and Google get “Low” marks for transparency?”

Digg ranks news stories by the number of members which vote for (“digg”) each candidate. It’s pretty much a pure democracy, with an added time component: old articles are worth less. On the other hand, Google ranks pages by a more complicated algorithm known as PageRank, which treats links on web pages as “votes” for other pages and some pages’ votes worth more than others. It’s a bit like the electoral college, with an added semantic component: pages not related to the search query are worth less.

Do those descriptions sound about right? The thing is, neither is true these days. PageRank is now only one small ingredient of a page’s search ranking. Anyone who pays attention to their page in search listings is familiar with the “Google Dance” when ranking can change unpredictably and sometimes unfairly. Google has become a black box. Digg’s newfound popularity has it struggling to deal with spammers, and has also begun to shroud its algorithms in secret. The most recent Wired magazine, has an article “Herding the Mob” quotes Digg founder Kevin Rose as saying there are antihacking techniques that he can’t talk about. 

Jay and Kevin said they couldn’t explicitly detail how Digg’s ranking algorithm works because it would be used by those who want to game the system (the aiding the enemy defense is popular these days), but they gave enough information to understand the basics of how Digg’s version of a democracy works.

So what we see is that these two popular online ranking systems began with public algorithms, but have retreated into secrecy. 

woman showing finger after iraq voteOn the other hand, systems like the US election process remain part of the public record. Of course, in a democracy it costs a lot to get a vote. For one thing, you have to be born. And if you really want to cheat, you have to mess around with getting the ID’s of dead people and other very messy activities. In the recent Iraq elections the took the extra measure of dipping each voter’s finger in permanent ink to prevent double voting. 

Is this trend necessary? What are the underlying principles?

The trend seems to be that to thwart spammer in popular systems, transparency must go down or cost must go up. And in the online world, costs are dropping so low that transparency is being forced down as well.

The web has seen a lot of systems that begin with low costs and high transparency. That’s the very definition of openness. But as the systems experience success, they have 3 choices:

  • Raise the costs. E.g. SomethingAwful.com added a $10 cover charge to participate in voting. Metafilter added a $5 cover charge.
  • Obscure the algorithms. E.g. Digg adding secret “anti-gaming” algorithms
  • Become irrelevant. E.g. Usenet forums overrun with spammers

The most popular choice seems to be obscuring the algorithms.

Should we be alarmed at this? Imagine if the US government took the same approach: they will tell us who won the election, but the exact algorithm used to determine the winner can’t be revealed! One can argue that getting on the front page of a Google search or the front page of Digg is not nearly as important as an election. But the value of such positioning is only increasing in value, and the bad guys are already trying to rig these elections!

I would argue that low transparency is a form of editing. When Digg or Google says that they must keep their algorithms secret, they are in effect saying “Our algorithms are fair, but we can’t tell them to you. You can trust us.” But do we really trust them? Should we? If some quirk of Google’s algorithms somehow helps a company they have a partnership with, how motivated will they be to fix it? 

Anyways, those are some beginning thoughts on the subject. Any ideas from you would be appreciated, as I feel there is a lot more to explore here.

Prelimenary results from Blogroll Ranking

Monday, February 26th, 2007

Who are the influential bloggers? Which blogs matter? What metrics would you use to even begin to answer these questions?

I’ve been exploring alternate methods of ranking in the past months. The best results are coming from examining Blogrolls. When you think about it, blogrolls compromise the links in a huge implicit trust network. For now I’m calling the calculated score “PeopleRank”. It’s kinda like PageRank, in that blogroll links from higher PeopleRank-ed blogs count more. E.g. if Om Malik has you on his blogroll, that counts a lot more for your ranking than the blogroll of your niece on Livejournal. (No offense to your niece.)

So here are the top 50 blogs as ranked by the preliminary algorithm: (Commentary and caveats follow)

Blog Name URL People Rank Blogroll Count
TechCrunch (Arrington & Friends) http://www.techcrunch.com/ 16.88550 74
Fred Wilson http://www.avc.blogs.com 13.65663 59
Om Malik http://www.gigaom.com/ 10.90295 51
Subscribe to Posts (RSS) http://feeds.feedburner.com/ 10.35721 58
Battelle, John http://www.battellemedia.com/ 9.43316 36
kottke http://www.kottke.org/ 9.30745 23
Micro Persuasion http://www.micropersuasion.com/ 9.05083 35
dooce http://www.dooce.com/ 8.75597 24
CNNMoney.com http://money.cnn.com/ 8.24951 14
Advertise on this blog http://money.cnn.com/services/mediakit/ 8.24951 14
Creating Passionate Users http://headrush.typepad.com/creating_passionate_users/ 8.05627 51
Instapundit http://www.instapundit.com/ 8.01555 30
Brad Feld – Feld Thoughts http://www.feld.com/blog/ 7.76376 57
BuzzMachine http://www.buzzmachine.com/ 7.68799 31
Seth’s Blog http://sethgodin.typepad.com/seths_blog/ 7.64178 44
Full Content http://www.gizmodo.com/index.xml 7.39462 10
Comments http://www.gizmodo.com/xml/comments 7.39462 10
How to Change the World http://blog.guykawasaki.com/ 7.36782 39
Read/WriteWeb http://www.readwriteweb.com/ 7.32572 27
Canuckflack http://www.canuckflack.com/ 7.25962 11
Slashdot http://www.slashdot.org/ 7.22526 32
Gizmodo http://www.gizmodo.com/ 7.22314 19
Movable Type http://www.movabletype.org/ 6.92314 15
David Jones/PR Works http://www.prworks.ca/ 6.67162 11
GestureBank http://blogs.zdnet.com/ 6.61738 20
Hugh Macleod http://www.gapingvoid.com/ 6.58896 19
Michelle Malkin http://www.michellemalkin.com/ 6.53256 28
New World Notes http://secondlife.blogs.com 6.47961 6
Bad Astronomy http://www.badastronomy.com/ 6.34440 9
Talking Points Memo: by Joshua Micah Marshall http://www.talkingpointsmemo.com/ 6.30786 23
James Governor http://www.redmonk.com/jgovernor/ 6.11552 23
Three Kid Circus http://www.threekidcircus.com/threekidcircus/ 6.10842 109
Sweetney http://www.sweetney.com/ 6.08445 107
Rain City Real Estate Guide http://www.raincityguide.com/ 6.06087 11
Fussy http://www.fussy.org/ 6.00416 16
SpiffyJapan http://www.spiffyjapan.com/ 5.97301 5
Jottings By An Employer's Lawyer http://employerslawyer.blogspot.com 5.95257 7
VentureBlog http://www.ventureblog.com/ 5.91916 24
Joho the Blog http://www.hyperorg.com/blogger/ 5.85586 23
Jeneane Sessum – Allied http://allied.blogspot.com 5.73544 91
Her Bad Mother http://www.badladies.blogspot.com 5.73306 108
George’s Emplt http://www.employmentblawg.com/ 5.71551 7
B.L. Ochman's Weblog http://www.whatsnextblog.com/ 5.69226 11
Captain's Quarters http://www.captainsquartersblog.com/mt/ 5.65295 28
Techdirt (Mike Maznick) http://www.techdirt.com/ 5.64693 21
Venture Chronicles http://jeffnolan.com/wp/ 5.63134 33
This Blog Sits at the http://www.cultureby.com/trilogy/ 5.50986 9
Shel Holtz http://blog.holtz.com/ 5.49340 10

Caveats of this calculation:

  • Results with ~5K blogs crawled.
  • Blogroll Count = Number of blogrolls this blog appears on = How many people publicly admit to reading this blog.
  • The interesting datapoints are where the PeopleRank ordering puts a blog higher in the list than one with a higher blogroll count — those fewer subscribers must be “more important”.
  • This crawl took Lijit user blogs as the starting seeds giving an overall tech bias.
  • However, there was a period when the crawler went unchecked into what can only be called “The Mommy-o-sphere” so there is an over representation of Mom-blogs in teh dataset.
  • Our blogroll detector algorithm still gets false positives, thus the high rank for “Subscribe to Feedburner” and the multiple ColoradoStartups.com listings.
  • Some blogs use a Blogrolling widget for a “Web Ring” functionality, thus erroneously appearing as blogrolls. This explains most of the 100+ blogroll counts.
  • We need better de-duping. Several blogs appeared until multiple URL’s, reducing the overall score.

So how is this different from existing rankings? Til now, the most common methods have fallen into one of two camps:

  1. Number of subscribers. I.e. a pure democracy. Use some combination of Feedburner (for RSS readers) and some web analytics (for web readers) to count the raw number of people reading a blog.
  2. Raw number of incoming links (citations). This is similar, except that links are counted instead of subscribers.

Note that neither method discriminates between the blogs “casting the votes”. It doesn’t matter if that 24th reader of your blog happens to be Scoble. Nor does it matter if those 3 citations to your blog in the last month (Technorati defines this as “very low authority”) came from Seth Godin, Fred Wilson, and Guy Kawasaki.

Initial results are encouraging, and I hope to do more analysis this week. What do you think? If you have any suggestions or ideas, please get in touch with me.


Featuring Recent Posts WordPress Widget development by YD