Thursday 10 August 2006

What have you been searching online?

Russell Brown's been browsing the online internet searches of half-a-million people, and he's pulled out a few plums to titillate. Sample:

User 6760296, a 14 year-old, tells quite a sweet story: from "what a girlfriend should do" and "rulebook for dating" to "how to change my password on myspace" then "ways a 14 year old can earn money" and "can 14 year olds mystery shop" and, optimistically, "get paid to look at peoples myspace". We eventually find out what the job hunting is in aid of: "cheap ipods".

A long string of searches for Biblical topics from account 1347872 is twice interrupted by a little spate of searching for "breasts" and "small breasts". (The collision of God and - sometimes quite deviant - sexual themes actually seems to be a common characteristic of the search histories. Go figure.)

Thinks he's invading people's privacy? Well, AOL don't: it's their logs he's been browsing. "Earlier this week," explains Russell, "some people at America Online did something blindingly stupid... In a misguided attempt to reach out to the research community, AOL placed a big chunk of its customers' data online."

Oh dear. So what do you have to confess about your search habits?

LINK: Contender: Worst mistake ever - Hard News (Russell Brown)

RELATED: Geek Stuff, Blog

7 comments:

Berend de Boer said...

People who looked for "libertarian" also looked for "groups that worship marijuana", "rituals", "aztecs and mushrooms", "effects of mushrooms on the brain", "getting alcohol underage" and "adolescent depression".

Go figure.

Peter Cresswell said...

Is that you confessing your search habits, Berend? Don't worry, your secret's safe with us. ;^)

Anonymous said...

I often find that people who have no clue to technology & research such as Russell Brown are the ones who bemoans about this thing or that thing, etc. What AOL had done is quite perfectly normal in the research community for machine learning, statistics, data-mining & artificial intelligence, signal processing. Data are donated to the research communities in those disciplines listed above by different private sector businesses & government agencies in health, retails, banks and financial institutions, and so forth. Indeed all data is anonymized, which in case does not break the law of privacy (sorry, I am not a lawyer but I assume that there is such law). The donor of the data had to anonymized them as they had no choice, because if they don't, they face legal action (again, I only assume as I am not a lawyer). Perhaps, a lawyer in this forum can clarify that. Anonymized donated data usually stripped the names, age, and address. Some only stripped the names as analysis of data that contains age & address give vital clues to certain attributes of the data of how it related to the subject of investigation.

One of the most popular data centre that is hosting donated data is University of California at Irvine Machine Learning Repository.

"UCI Machine Learning Repository"
http://www.ics.uci.edu/~mlearn/MLRepository.html

In this repository, researchers can download data to test their pattern recognition algorithms on. It is important that commercial algorithms are tested on real data before deployment. There are tons of donated data at UCI ranging from 'breast cancer', 'fraudulent insurance claim', 'Amazon customer transactions' and so on, also data that are available from many different domain.

There are commercial software available today that detects insurance scam, software available to detect fraudulent banking transactions, and many many more.

If you have experienced a transaction using your credit card that suddenly request came that you (the holder) is required to provide further validation infos such as 'What is your father's name?' or 'Which school you went to?' , etc, before that transaction is proceeded. Then you must understand that the software that is sitting behind the credit company’s server has been trained on such fraudulent pattern using publicly donated data such as the one available from UCI repository. The second validation usually arise when the algorithm recognise unusual pattern (anomalous) in comparison to your past history of transactions. Only the legitimate holder will know such second validation question as 'name of school', etc. The thief might have got your other card details but have no clue to which school you went. Perhaps, such unusual pattern as a huge spike in the amount of cash to be withdrawn (say about $1000) at a retail out-of-town, where the pattern recognition algorithm can recall from its memory that you have no such history of such large request, where your average transaction per day might be around say $150. Obviously, the jump from $150 per day average to $1000 is unusual, where second validation is needed. This protects customers and the banks from losing money.

Now, donated data is very important for researchers to validate their algorithms, and see if they are robust before commercial deployment. So, what AOL has done is a good thing as I have mentioned, it is the norm practice in the community of machine learning, data-mining , artificial intelligence, statistics. Obviously, the criticisms are nutters who want to do good for others but actually, donating anomymized data does more good than bad. It helps software diagnose breast cancer better, it helps banks to automatically detect fraudulent transactions and prevent theft, etc, etc.

I do download certain data’s from UCI repository frequently to test my 'data-mining' algorithms. I know for sure that the machine learning & data-mining community would love to see AOL donate their data to the UCI repository.

Anonymous said...

Falafulu Fisi said...
[Now, donated data is very important for researchers to validate their algorithms, and see if they are robust before commercial deployment.]

I have just started consulting to the biggest retail e-commerce website in Australasia and my role is to design & develop the search algorithms & pattern recognition algorithms so that the vendor can understand the buying behaviour of their online customers which is good for marketing purposes. There are around 1.5 million products currently available with registered customers are in the thousands and still growing. The estimate for end of year or early 2007 available products from this site is expected to reach 2 millions. This is a massive datasets to mine. The objective is to incrementally implement algorithms similar to Amazon online bookstore. I know the type of algorithms that Amazon is implementing, but I will be looking at more recent advanced new algorithms, which have become available in computing literature. Amazon is good at cross selling, that is 'Customer that bought X also bought Y'. 'Buy X & Y together for a discount of 15%', etc, etc. Such recommendations are mined from the database about the past buying behaviour of all customers to find what items have been bought together. Such 2 or more items are recommended by the pattern recognition algorithm on the fly if a customer is searched for either one of those items. Volume of sales will increase if more products are recommended to the buyer to buy as discounts (such as X & Y).

Obviously, I am one of those who would be very happy if AOL donated their huge dataset to UCI Machine Learning repository. I need to test algorithms using a huge dataset where it is going to be deployed for analysis of 1.5 million products. The largest datasets available from UCI contains about 60,000 rows. This dataset is a bit small to test something that expects to work on a massive dataset that could go up to 2 million products. It is my loss that AOL doesn’t make their data available.

Berend de Boer said...

Falafulu, we can say a lot of things about Russell Brown and his political views, but to say he doesn't have a clue about technology is the only sentence you have to write to proof conclusively you don't have a clue.

This was a very, very stupid move.

Peter Cresswell said...

"It is my loss that AOL doesn’t make their data available."

Isn't the point here that they have made the info available? It's apparently a >300MB file, but it is apparently available.

Anonymous said...

dontdelete.com is the best one...they have the randomizer that is hilarious and scarey