Wednesday 30 July 2008

How cuil is Cuil?

Cuil is "a more comprehensive and efficient way to scour the Internet." So says Cuil developer Anna Patterson, who reckons her new search engine covers  "a wider swath of the Web with far fewer computers than Google." [Hat tip Gus van Horn]

Check it out for yourself.  I did, by the time-honoured method of searching for myself.  Looks like an over-reliance on Wikipedia to me.  How uncuil.

4 comments:

Dave Mann said...

Yeah, its crap. I searched for my own business and it came up with a garbled reference under my name to my competitors, and a photo that wasn't relevant at all to any of them.

I think this Cuil site is over-hyped. Hype is the way sales and marketing often works, of course, but Cuil is transparently bullshit.

Anonymous said...

This is a hype. A big big hype. They're not going to unseat Google. It is typical that these so called former Google employees claim to have invented something that Google's army of PhDs (in mathematics, computer science, computational linguistics, statistics) missed. Google's (& Microsoft's, Yahoo's) R&D army's job is to scour the literatures and take note of any new theory that is published related to search technology. They either implement those published algorithms directly or modified them to become better performance (ie, in terms of speed-wise or high accuracy in search relevancy/retrieval ) than the one they adapted it from.

There is nothing in Cuil technology that the the likes of Google/Microsoft/Yahoo R&D army is not aware of. Even if their algorithm is new, it won't take long for researchers around the world to independently discover the same thing, which would definitely be ended up being published in the literatures. This sort of things happen all the time.

I went to try Cuil and I am not impressed at all. I'll stick to Google. But the hype can only draw the suckers in , such as those VCs (Venture Capitalists) that would give you money if you can just convince them that you can topple Google. No questions asked, no independent verifications is needed to check the claim (technology-wise), just say you can and then viola, you've got $35 millions seed fund to start with.

VCs will regret that this $35 mils will go down the drain in about 6 months or so, as Cuil will be struggling to get surfers to use its search engine.

BTW : I am aware/know various search engine algorithms, which some of them that I do use for data-mining purposes.

Anonymous said...

I found this link of Anna Patterson from wikipedia. It looked to me that her main expertise is page/document indexing , a process where a collection of documents/web-page, etc... is gathered and processed to store in a database that is easy for the search engine to match/retrieve items from via query( ie, search terms). Indexing is a different beast from search algorithms. Sure, a search engine needs an indexed database, but the crown jewel of search technology lies mainly in the ability of the search algorithm to retrieve relevant information rather how big is the indexed database, which Cuil is claimed to have indexed over 120 billions pages. A huge collection with poor retrieval is useless compared to a small collection that have high relevancy of retrieval.

A simple analogy is that it would be useless for the warehouse to store everything salable on the planet only to make customers circling round & round trying to find items (suppose that they're not labeled) compared to a smaller store that once an item is inquired to the shopkeeper to see if it is available, the shopkeeper replies immediately with a quotation of a price and the availability of the item.

In search engine technology, what matters is the speed/relevancy of the retrieved information and not how many items (pages/documents) that has been indexed in the database.

I am sure that Anna Patterson has probably hired some smart PhD mathematicians/computer scientists/physicists/statisticians, etc... to spearhead its search technology development, but I am amazed about its hype where mainstream media have also been taken for a ride. Even New York Times ran a story of Cuil. Probably because she was a so called former Google.

Anonymous said...

I've also found Anna Patterson's former web page at Stanford.

I just had a quick scanned thru her list of 10 peer review publications and it confirmed what I stated before. None of the paper titles (& abstracts also available) are anything to do with search algorithms. Search algorithms are mainly based on numeric computation (number crunching) while her publications are on symbolic reasoning topics, ie, the computer manipulation of symbols to deduce a conclusion, given certain facts (inputs).

Google/Yahoo/Microsoft all use numeric-based search algorithms. There is a big move now from all search vendors to include Natural Language Processing (NLP) capability and whichever gets there will unseat Google, ie, the dominant search vendor in the near future is one that incorporates both numeric-based algorithms plus NLP. None of the current search vendors have done that yet.

It seems to me (inferred) that the poor results of Cuil's retrieval is perhaps that their search engine is a pure symbolic NLP-based one, since symbolic-based search engine is inferior to numeric-based search engines such as Google's PageRank algorithm or Microsoft (Live Search ) Block-Level algorithm.

The main advantage of numeric-based search is its speed and high relevancy. The draw back of symbolic-based search & NLP is the slow retrieval, less accurate and curse of dimensionality (explosion of possibilities).

Example a query/search such as the following:

It may take a moment for your comment to appear

In Google or any of those numeric-based search engine, the phrase is broken up into:

{appear, comment, moment, take}

and used as the query. The query phrase is bundled up in alphabetical order, and the the other terms (a, it , may, for, your , to) are discarded because they don't add any semantic to the query. In this way of breaking up the query phrase, the natural meaning of the original query sentence is lost.

In NLP and symbolic-based search, there is nothing to be discarded. The whole query phrase is taken together where an exhaustive match to find out if a verb, proverb, noun, adjective is being followed correctly in a humanistic natural way of the English language. This means that it scans the whole phrase from left to right term after term and checks if the proper English language is being adhered. Every term that it encounters, has a potential of many possibilities to branch out and search to see which ones that it semantically follows a natural order. This is time-consuming/slow and also high error rate.

The first person to solve those problems in NLP will be the next Bryn & Page.