My Simple Minded World

The untapped potential of search

Posted in The Interweb by Omar Ismail on May 6, 2008

The shopping world is run with faceted taxonomies. You know, the standard drill of choosing narrower and narrower categories, and then adding filters on things such as price, manufacturers and features. All of this navigation is powered and made possible through highly structured data with explicit relationships stored in a database or equivalent.

The data is structured, with hard links and concepts, that are also inflexible. If I want to create new attributes to filter by I have to modify the database, create new entries and establish the relationships. In reality what ends up happening is that taxonomy designers spend a lot of time at the beginning of development figuring out what is the best Hierarchical structure because they know it won’t get changed a lot in the future.

Now, what if you could accomplish the same drill-down and filtering use cases without storing hard database relationships?

It turns out that you can. With a BUT.

Main point: Text indexing is a superset of structured taxonomies.

Let me say that again…


Remember that a service like Google indexes everything. You can search every page against any character sequence. Well, if you place your taxonomy information on the page in a text format then it gets indexed just like everything else.

Taxonomy information.. indexed? Isn’t that the same as hard database relationships? EXACTLY! Without the database! Or rather, Google’s index┬áIS the database.

Here’s the huge boost from this: If the search index is an isomorphism of your explicit taxonomy, then it’s also an isomorphism of unknown taxonomies that you haven’t even thought of.

As long as you put as much information as possible on the page then Google will index it, and voila every kind of taxonomy you can think of is created and buried inside the search index. What this means is that you can go back and create taxonomies without any loss of information!

In the structured approach if I wanted to be able to filter on HDTVs that have a 120Hz mode, I have to create a new facet called “Refresh Rate” and then go back and add the 120Hz attribute to all those televisions that apply.

In the unstructured approach I just write down in the text somewhere that the television supports 120Hz, alongside the contrast ratio, and all the other specifications that may or may not be important. Now, I can just search for those features and I’ll have the filter applied automatically. Beautiful!

Now for the problems.

A raw text search of “120Hz” doesn’t differentiate between Does have 120Hz, and doesn’t have 120Hz. Also there’s no way to apply your own sorting, and GOOG doesn’t handle ranges well. And this is why there is untapped potential. Google just announced that they’re creating an supplemental index for Custom Search, so why not add some extra extensions?

As the webmaster of ProductWiki I know the structure of the page better than a bot ever will. If I can provide search hints to say “THIS PART OF THE PAGE IS MORE IMPORTANT” that would be nice.

Also, these search companies need to handle date, and numeric ranges a lot better. I should be able to do $1000..$2000 and it’ll return me everything that has $1103.23 to $1,500. Same with dates, let me put in a variety of formats (isn’t even that important) and the parser understands what to look for.

Now I can do this kind of expansion of terms myself, but damnit this is their core competency.

In conclusion: I finally realize the power of unstructured search. It really does become the Database of Everything and that’s really friggin cool. Now with that power comes great responsibility, so search companies let’s step things up a notch and get some more advanced query handling happening.

Tagged with: , ,