Archive for the ‘search’ Category

Search Engine Education and Testing

Thursday, January 24th, 2008

Last week I had the pleasure of spending some time in Maine with elderblogger Ronni Bennet. We talked (amidst solving the all worlds other problems) about the importance today of teaching students the art of  good search queries, and the importance of recognizing trustworthy information.

She suggested a good idea: A test where students are required to find 3 articles about a topic: One that is well-written and authoritative, one that is poorly written and/or not authoritative, and one that would take more time to decide if it is authoritative or not. The student would document what search terms they used and why they chose them, and what aspects of the articles established trustworthiness.

CalculatorA further thought was how search engines may be in the same place that desktop calculators were 30 years ago. Some people wondered if the easy answers given my these new machines would mean the end of math education. How can you learn when the answers come too easy? Of course, now we know that calculators are just a tool which, when you know how to use one, makes you a better mathematician. Today people worry about plagiarism and the ease with which you can find information about anything online. But in the future, surely we will see search engines as indispensable tools for writing and research? Teaching the skills of how to use one (including how to evaluate what you find) will be regarded as an essential part of education.

I’ve searched for anyone proposing a similar idea to Ronni’s, but haven’t found anything. Maybe I could use a refresher course myself!

Facebook enters the third phase of internet search

Friday, July 6th, 2007

Facebook is using their social network information to optimize search results for users. This is exciting stuff! It is the third phase of internet search, much like we are building here at Lijit. What is this “third phase”?

I wrote a post detailing the 3 phases back in 2005 as a grad student. This graphic sums it up:

(Note the funny disclaimer at the end explaining why I couldn’t build it myself at that time. Now I’ve got a company and an incredible team helping me make that a reality. How cool is that?!)

Facebook doing true social network search. (as opposed to social democracy like Swicki or 2005-era Wink) This was explained, with impressive stats, by a post today on Facebook’s blog today.

Facebook search results are sorted by an approximation of social graph distance. People closer to you in the graph—your friends and people in your networks—are likely to be more relevant to you and thus are ranked higher. We also use this concept of “social proximity” to order results within applications like groups and events.

How does this compare to Lijit? Most importantly, we operate in the “wild and free” web. The data is not nearly so structured out here. Network relationships are overlap and can even contradict: MyBlogLog, LinkedIn, Blogrolls and more.

But on the other hand, we find data everywhere. The web is a big place, and the stuff you’re looking for probably isn’t always in Facebook!

Foxmarks does search, finds spring of trustworthy metadata

Tuesday, June 26th, 2007

TechCrunch reports today that Mitch Kapor’s Foxmarks will become a search engine, and that it wowed everyone at foo camp.

All search engines need algorithms for ranking pages. Google ranks pages based on links from other pages. Foxmarks will instead look at links in bookmarks.

No search engine has done this before, so it seems quite natural that their is a “wow” factor to the results. They’ve found an unpolluted spring, a fount of trustworthy metadata. That’s the way HTML links were back in 1996. That’s before there was money to be made from having the right constellation of links pointing to your site.

This is not the first time a search engine has been created using bookmarks. I wrote this in 2005 about Zniff:

A new search engine, Zniff, takes a step in the right direction by using publicly available social bookmarks as indicators of worth. Paradoxically, this approach is doomed to fail if it enjoys any success. If it becomes popular, it would be all to easy for tricksters to create false bookmarks for the sole purpose of inflating the ranks of chosen pages. It’s the same lesson that Google is learning now with googlebombing: You can never trust random pages on the internet. Not even social bookmark pages.

I sure was surly back then!

How solid is the “doomed to fail if it enjoys any success” principle? Even though links are no longer so trustworthy, Google still serves up good results. Will Foxmarks be able to continue wowing people once the SEO bad guys start making fake bookmarks?

The Game of Searching

Friday, May 4th, 2007

Search is a game. It’s a language game, it’s a coordination game.

This thought over coffee and oatmeal this morning with my economist friend Paul Ramer. He told how he was searching for how to do something in Excel. We decided there are 2 important steps here.

  1. Is this something someone else would have encountered and written about?
  2. What words would they have used in their account?

Anyone who’s been on the web for more than an hour has realized that the answer to question 1 is always “yes”. The challenge and the skills are in question 2.

This got me thinking of Wittgenstein’s Language Game, the concept that language is a sort of game where the players both “win” when they agree on standard words. If I refer to this thing as a “stuhl” and you call it a “chair”, we’re going to have a hard time communicating about those things. We both lose. (Especially I was looking to buy one and you were looking to sell one!)

Perhaps a more accurate description is to call it a Coordination Game. In a famous study in 1958, Thomas Schelling asked a group of university students a simple question: “You have to meet someone in New York City at a certain time, but you don’t know where, and you can’t contact them ahead of time. Where would you go to meet them?” Amazingly, over half of the students chose the exact same place: the information booth at Grand Central Station. Out of all the million places in New York, a majority chose one place.

So when Paul was looking for help with Excel, the game unfolded thus: He has to think of “What would other people write if they were talking about this problem?”. That’s what is so damn fascinating! Having to project yourself into the shoes (and minds) of everyone else and guessing what they would do. The game is easy for things like “britney spears pictures”, but harder for things like “thomas schelling meeting place grand central study“.  (That’s the search term that won it for me after about 20 iterations.)

So it’s no accident that Google’s magnificent Image Labeler Project is in fact a game. (A game so addictive that they’ve but it stops that will prevent someone from playing for longer than 10 hours, according to some folks at the recent ICWSM conference!)

One last point on the money side: this is also the game which [malicious] keyword advertisors seek to subvert. When you buy a popular keyword (e.g. “britney spears”) but your content has nothing to do with her, then the searcher loses. They lose because they had to pay attention (which is a real cost) to an ad that wasn’t relevant. One beauty of AdSense is that it too is a game, with rules that penalizes such ads. Ads that dont’ generate enough clickthroughs won’t get shown as often. Yahoo ads didn’t have this feature until the recent release of Panama.

That’s a lot of ideas from one breakfast!

If you’re around: I’ll be at the Boulder Denver New Technology Meetup tonight. This month we’re at the new Denver Art Museum, an awesome structure designed by Daniel Libeskind. Come by and say hello!

Social Search: Democracy or Network?

Tuesday, April 3rd, 2007

A nice article at SearchEngineLand today about The Impending Social Search Inflection Point. Good that they realize that the 3rd phase of search will be social, as I pointed out 2 years ago.  I realized however that there are two very distinct ideas emerging about what “social” means in this context. And people are always mixing them without realizing it. 

  1. Social Democracies: Wikipedia, Digg are social in that everyone can have a say in the final outcome. But in the end, there is only one outcome. 
  2. Social Networks: MySpace, Facebook, LinkedIn are social in that people form connections and these connections determine your rights within the system. Information available to you depends greatly on these connections.

The SearchEngineLand article renews my frustration that many see Social Search as automatically being a Social Democracy. From the article:

What is social search? To paraphrase Microsoft’s Ramez Naam, it’s like every human being is a neuron, and humanity as a whole is one giant brain, smarter as a connected whole. (emphasis added)

And later:

The wisdom of crowds – so well articulated by James Surowiecki – is at the root of emerging information retrieval tools.

Of course there are contexts in which you’d like to have every human being casting a vote. I think the stuff on Digg is pretty entertaining, and I often consult Wikipedia. But when I want to buy a HDTV, or a I want a good restaurant in Boulder, or I want some expert writings on term sheets…I do not want the whole world chiming in! I don’t want the “wisdom of the crowds”, I want the “wisdom of my crowd.”

In game theory language, I want the cost of influencing my search results to be based on social connections, not on the mere fact that someone can pass a CAPTCHA. (See Costs and Transparency in Ranking Systems)

And that’s what Social Networks are all about. It’s about filtering out the noise and finding what you need based on trusted relationships. Why is it so hard to see how this applies to search?? This is what we all did before the advent of the internet: Curious about cars? Ask your car friend. Curious about TVs? Ask your local home electronics guru. A friend is getting into MLM? Don’t ask them for product tips anymore.

In fact, people are already “searching” using social networks. But the tools suck and are pretty much limited to business contacts and dates. (FaceBook’s bookmarking feature may be a sign that they are waking up to more possibilities.)

Given that everyone uses social networks in the real world, it’s ironic how hard it is to explain this “Social Network Search” of Lijit. I guess it’s like trying to explain to explain English grammar to a native speaker: they know it so well and instinctively that they hardly realize there is a technique to it.

(Sorry for the lack of posts lately. I’ve been uber-busy and then uber-sick. Normal posting shall resume shortly!)

G-Day is coming.

Friday, December 15th, 2006

Imagine a time in the future when keychain USB sticks hold petabytes of data and you can download every movie ever made to your cell phone in a matter of seconds. All it takes is one disgruntled Google employee with the right connections, and the complete record of every Google search ever made is now available. Maybe all emails in Gmail too. It’s AOL all over. In less than day this archive shows up on hacker sites and has been downloaded millions of time. Pundits call it “G-day”. You can’t put this geenie back in the bottle.

Thinking about absolute transparency, lifetime storage, and about probabilities over the distant future…and it occured to me that this scenario is not only possible, but inevitable.

It is almost certain that all of our searches will some day be made public.

To deny this means that one of the following statements is true:

  1. It will never be possible to quickly copy/transfer terrabytes of data.
  2. At some point in the future, Google (or subsequent acquirers of Google) will decide to thoroughly delete all search records.
  3. There will never be a security breach at Google in which search records are copied or transfered out of Google’s control.
  4. No one in the future would be interested in obtaining a copy of all Google search records.

Of course, the same goes for every other search engine, ever other aggregator of data. One file with all of Hotmail, one with MySpace, one with every video from YouTube, etc… The ramifications of this forced-radical-transparency are huge. Your grandkids, if they want, will probably be able to see every search you’ve made, every email you’ve written. Will presidential candidates will have to explain their search record.

What do you think? Is it really inevitable, or will it happen too far in the future for it matter, or will our guardians be able to keep that data secure for all time?

Search overload

Tuesday, October 17th, 2006

The search overlay business is getting crowded.

In January of 2005 I had the idea to overlay Outfoxed reports onto Google search results. It wasn’t the prettiest integration in the world, but it really got people’s attention. (early screenshots)

SiteAdvisor was the first commercial company to launch with this feature, in December of that year. It was acquired by MacAfee just a few months later. But now it seems that just about everyone has figured out this trick, and I fear if the situation will get out of control.

As far as I know, the companies currently search overlays are: SiteAdvisor, StumpleUpon, Compete, and of course Lijit (Which is somehow both the new kid on the block and the granddaddy of them all!). I also know of several other companies which are planning to roll out this feature.

So with all of those installed, my search result looks like this (click for fullscreen view).

What a mess!

So what will be the future of search overlays? A passing fad or a wave of the future? It’s a powerful concept: using the power of browser plugins to forcibly add functionality to a website.

[Stan puts on his sales guy hat...]

And of course I’m biased, but I think the Lijit approach has the most staying power. Because Lijit can include data from any RSS stream, it can absorb the functionality of any other overlay. E.g., if SiteAdvisor or StumbleUpon published an RSS feed of their URL evaluations, any Lijit user could add them to their trusted network and have these reports included in their Google results without requiring more plugins.

Lijit pumps your RSS subscriptions into your search results. That’s pretty cool.

UPDATE Sami points out a new service WOT which also does search overlays and provides reputation information.

Scoble: it’s the humans who “optimize” the Web

Saturday, August 5th, 2006

Talking about search, Scoble expounds that it is other humans we care about, not corporations.

When I search on “Office Furniture” why is the first thing I see stores? I don’t wanna see freaking corporate info. I wanna know what HUMANS like to use in their offices.

None of the big search companies have figured out that it’s the humans who “optimize” the Web.

I’ll be looking for who lets me get to the other humans the fastest.

Here, let’s try this. If I can spend less than $500 for an office chair, which one is best?

Optimize that!

That’s it. And not just ANY humans, but those from our circle of friends. If you’re looking for the best office chair, do you really want everyone in China having an equal footing with your friends?

What Scoble really wants is the third phase of internet search:

Phase 1


User searches for “Thailand”, and the page containing “Thailand” the most times is chosen by search engine.

Phase 2


User searches for “Thailand”, and the page with the most incoming links is chosen by search engine.

Phase 3


User searches for “Thailand”, and the page containing photos of a friend’s Thailand vacation is chosen search engine.

(For best results, this would require complete integration with a search engine’s database. That’s not possible at the moment, so the current Outfoxed can only re-order the search results that are returned by a search.)

Google & Co. as the new DNS

Monday, July 24th, 2006

The domain name situation is more grim than I ever imagined. If you don’t believe me, try going to Instant Domain Search and start typing o’s. You’ll find everything up to and including ooooooooooooooooo is taken. Try any word from English, Spanish, German, Swahili, or Hindi. All taken. I even tried resorting to trying to buy an expired domain.

On Saturday I saw an TV ad that said simply “Google ‘Denver Ford’” for more information. Part of this is surely that the car dealership couldn’t get www.denverford.com (it’s for sale). But the more important point is that the search textbox is replacing the browser URL textbox.

No one types URL’s into their browser anymore. Most people don’t know how. This is why so many people ener “amazon” into Google rather than typeing “amazon.com” into the brower textbox. (I can’t find the article about Google that gave the statistic. Anyone know the one I’m thinking of?)

The main points:

  1. The search textbox is replacing the browser textbox.
  2. Domain names, especially short names, don’t matter so much and the ones for sale are certainly overvalued.
  3. Search engines are becoming the new DNS^.

blog searching and authority

Wednesday, May 31st, 2006

A post on TechCrunch today about blog search.

There is a big need for the equivalent of Google Page Rank for blog search relevance. Link analysis on a post just doesn’t work - the content is too fresh to develop meaningful link analysis results.

Didn’t Mike get the memo? PageRank doesn’t work anymore. It’s been gamed for so long that an entire industries have sprouted up around it: SEO, comment spam, splogging, googlebombing, etc…

He goes on to list the 3 main strategies being used for blog searching. The problem as I see it is that all of them give each blog an absolute authority value. (Or in the case of Sphere, multiple absolute aurhtoriy values: different values for different topics.)

This is based on the old mindset that the media is just “out there” and we readers somehow find it. But the truth is that every one and every blog is connected, and it’s these connections that matter. Is a blog authoritative for me about the Israeli-Palestinian conflict? That depends very much on who I am and whether people that I trust would consider the blog to be authoritative.

Bottom line: We’re all connected to each other –and every blog post, magazine article, or video clip– by fewer than 6 degrees of seperation. That’s where we will find a reliable measure of a blog’s authority for a given reader.