Thursday, September 22, 2005

Data display and visual data mining:

Before you ask I will occasional put up updates about Hurricane Rita, but since hundreds of people are doing that I will keep to my regular format.

When many people hear the words data mining they think it is something really complex, but if it is done right it can be the so simple that it is amazing. (For those who know me, the vagueness that follows should tell you that this is only the surface of a much bigger idea. I think I might have to start a protected site for the really cool stuff, since I could make entering the site equivalent to a digital NDA, which would protect me from getting ripped off and protect my international patent rights. Actually I have considered starting an idea clearing house, sort of a low cost distributed think tank, but more on that some other time.)
There is no greater pattern recognition engine than the human mind. I can see everything from how the tile is laid on the floor in mall, to ancient stream beds under vegetation in satellite photos (thank you Keyhole) and so can you if you know what to look for. However, to do this the information must be displayed properly and that is really hard part. My job is biomarker discovery from massive noisy data sets, so I know how hard it is to display data so that it has high informational content and is intuitive. First off tables aren’t the answer, lets say I was running data mining for Amazon.com, oh sure I could handle it as tables but that would be intractable. So, now let’s say I do it as a tree so I can have 3 to 5 dimensions of data, still not all that useful since you need to see the clusters in context. If you have ever even been on the Amazon site you have been mined, those people who bought x also bought y and z is simple data mining. However, to do it right you need more variables you need time of year, location of buyer, and several other parameters. To best predict for the year in general you have to exclude all purchases made for school supplies, and holiday gift giving, since those are excepts that will bias your results. Ok I have strayed little from the point, and that is what I get for making up examples as I go.
Let’s say you wanted to provide the best page design for a website Amazon when it just started. So you start out with a standard site and simple linear structure. Now you watch the site logs for a while and want to mine the results. Ok the raw logs provide no information that you can extract, and trying to mine the data blindly for trends won’t work because your customer base is too diverse. Now say you put the location of all the website hits on the map, you can quickly see the areas that have clusters, and sub divide since winter in Chicago is different from winter in Key West, so now you know that winter clothes should have two sections light and heavy. Next you look at gender, sure it is obvious that men and women buy different clothes, but you women will consider men’s jackets but men are less likely to buy a woman’s jacket, so you divide your suggestion. You also notice that women buy more from your mulled spice selection and men from the snow blowers, so you make it easy to link from common items like winter clothes to these items you might be able to “talk” them into. If you aren’t selling but providing information you should mine for webpages that are commonly viewed by the most groups. Now a machine can do this by looking at usage stats but a person still has to see that the branch point for headaches is migraine and stress and that by altering the structure the information can be conveyed more efficiently so that to people can quickly move to the right information.
Hmm this one is proving very hard to write, because the examples are either too complex or too easy. Ok last try, let’s say I wanted to break up my site so you can find info faster. I would first look at the logs, and see that foreign visitors came in to look at the Rita stuff, people from cities with major universities looked at the science ideas and the most distributed group looked mostly at personal advice. Before anyone freaks, I don’t have log files that can give me this much info. I would then break my site up so all the hurricane news was front and center. I would have ways to read all the science ideas or advice about men separately, with an anchored keyword search. To generate maximum revenue with minimum cheapening of my site , I would place the Google ad engine on each search generated page, since the user would select a key word set they were interested in. That way even the none too bright Google ad engine could lock on to such a huge target and find ads that you might click on. Why not put the Google ads one every page? Since I write on such a varied subject list no keyword(s) would dominate so you would see ads for the most general things it could find or nothing at all so no clicks.

Exercise 1:
Try it with your gmail account, email this to yourself and see what you get: cold fusion, Baltimore, apple, penis, jobless, hydrogen, free, yellow, Allah, oil, Texas, joule, JFK, thin. Ok, I tried it and the ad engine failed to find anything, so when you try it just pick ~4 words.

I could further assist the reader (and ad engine) by having a section entitled “Ladies things you need to make you man read” or other stuff like that, to cater to each interest group. If I was truly bold and had lots of free time I would get an RSS reader, and pull down news and other people’s blogs, then mine that to determine what was on people’s minds and how I could find a different angle so as to be different and more interesting. I will stop here since I am sure you can follow the applications of this method.
What is the point of this? With simple data visualization tools and your mind you can mine complex datasets very easily.

No comments: