Livin’IT – The ‘Retrophone’ Experiment

25 07 2009

Returning to your favourite things is potentially a dangerous game to play.

Speaking as someone with a  obsession with collecting vinyl records, I’m often strangely compelled to dig out something from the racks that I’ve not played in years and give it a spin. Sometimes this provides proof of the quality of your recollection (Cookie Crew’s ‘Got To Keep On’ is still a fine tune), on other occasions reminds you that your subsequent experiences have rendered your memory inaccurate in the extreme (Overlord X’s ‘14 Days In May’ is not the masterpiece that I remembered it as being, lyrically worthy as it still is).

The last couple of months I’ve been travelling all over the UK (and touching down a couple of times in Southern Europe too) with work and my trusty HTC Tytn II Windows Mobile smartphone has done me proud. It might not exactly be cutting edge anymore – a feeling I know only too well – but despite the battering it has taken over the last two years, it’s remained about the most trusted piece of hardware I own.

We’ve been adjusting our corporate phone contracts recently, something that will mean that I would have to separate my personal number from a (new) work number, meaning that I would need a new phone for one of these two numbers. Now, aside from the fact that I have never understood why twin SIM handsets never became a briefly glimpsed niche product (as that would be ideal), I started to look at the alternatives.

I knew I needed my new work phone to do everything that my Tytn II could do as a bare minimum, and whilst browsing through the handsets available on our new network it occurred to me that there really wasn’t anything there which adding anything new to the party.

Going back a few years, the first MS Mobile phones I used suggested the adage ‘…as a phone, they make a good PDA…’ and subsequent Blackberries only confirmed that experience. Today, things are somewhat better, MS Mobile is very usable and iPhone OS is gradually bridging the professional/consumer smartphone market with every release. Android looks like it will develop into something interesting, but current support for Exchange seems limited right now to 3rd party apps (like Touchdown) and that is a primary requirement for any work device in my current job. In short – despite some interesting upcoming HTC handsets – there seemed no reason to migrate to anything new. So, unlock the Tytn II and swap the SIM. Problem solved…. sort of.

Now, I had the reverse issue. The existing SIM had to live somewhere. I’ve had the same mobile number since 1998 (and my original Motorola brick complete with a mighty 15 mins of talk time per month) and it is still the primary way in which people know they can reach me. After a similar browse through the consumer end of the market for something appropriate (most of which do seem to be hybrid MP3 players or cameras first, phones second), I decided to do something radical. Or rather, not radical, but regressive.  I’d take a step back and remove myself from the arms race. I’d go ‘Retrophone’.

So, on the first day of my two-week break from work, I put on my circa ‘99 pair of Levis ‘Engineered’ jeans and Adidas ‘Stan Smith Comfort’ (both of which had seen better days) and went to the post office to reunite myself with an old member of the family. One who’s birth date matches the era of those (now battered) items of clothing.

The Ericsson T28 'Retrophone' charging on the kitchen worktop.

The Ericsson T28 'Retrophone' charging on the kitchen worktop.

Y’see, recalling that the aforementioned Motorola was a complete dog of a device, I went for my 2nd ever handset, the mighty Ericsson T28. Actually, mighty isn’t exactly accurate, given its diminutive dimensions and weight (only 81g complete with the onboard Lithium Polymer battery, a first for a mass market device back then). A few minutes on eBay got me a reconditioned example, complete with a charger and two batteries for a whole £15 delivered.

Aside from the fact it took 48 hours for the package to arrive from Hong Kong to my local Royal Mail office, who then lost it for 3 weeks, the first thing that struck me was how little there was to the package. None of the ephemera that you get in a modern phone box (CDs, cables, headphones….), just a charger. You could at the time of launch buy a serial cable for attaching it to your PC, but in the spirit of ‘keeping it retro’, I didn’t have one first time around and I wasn’t going to have one this time either.

My experiment was going to be a simple one. Could I cope without all the smartphone functionality ? OK, I still would have all that for my work device, but this was a consumer test. How much of the smart stuff on the HTC would I miss paring my mobile computing down to the bare minimum ? No mobile web, no GPS, no keyboard or predicative texting, no desktop syncing.

5 days in and so far there are a few things that are immediately glaringly obvious;

  • I’d forgotten what a pain that little aerial was. Literally. In the aforementioned jeans (so far only trousers tested), I now recall the unerring ability for the aerial to lodge itself in my groin on each occasion I sit down with the phone in my pocket. I do appreciate why I had removed that from my consciousness. Minus 1 for ‘Retrophone’.
  • The battery life is spectacular. Even using the (obviously fake) battery supplied on a first charge, the ‘Retrophone’ was still well and truly alive after 5 days. Now, I know it’s not actually doing very much in comparison to that in the Tytn II which does tend to exhaust itself within 12 hours, but still….. Plus 1 for ‘Retrophone’.
  • Texting is not easy. This is a device from before the days of predicative texting (which in itself is still a consistent partial fail), so you are back to the days of hitting ‘2′ once for ‘A’ or ‘9′ four times for ‘Z’ (plus * if you want to change case). Additionally, you’re also dealing with single-deck messages limited at 160 characters…. so if someone sends you a multi-deck message, you get two separate messages that don’t always arrive in order. Mind you, I have managed to recall some of the limited texting skills I had back in the day and whilst I miss my soft keyboard, currently …. A score draw.

I’ve still got a full week of holiday before I return to work, so I’m hoping to shakedown any more obvious flaws in the next few days. Once we’re back into the working cycle, we’ll see how well ‘Retrophone’ copes alongside the Tytn II in the daily grind.

Stay tuned.

Matt Mullen is an Industry Consultant at Nstein Technologies [http://www.nstein.com]. He promises to keep mentions of his groin to the bare minimum in future. Apologies.





Orgdata – The Attributes of Football

15 06 2009

There are many things that define us as people, over which we have no control. Some of these are obviously decided at a genetic level; colour of your eyes, skin tone…. the fact that I had dead straight hair until I was about 15 and then it went inextricably super-curly almost over night. We’re just born this way and there is no way to fight it. Even with ‘Frizz Ease’.

For many of us, the same goes for sporting allegiance.

As soon as I was old enough to be able to pick out colour and shape, It was pretty clear what life had in store for me. I was dressed in red and white stripes and I was from that moment, I was to be a Saints fan. Getting on for 40 years later and save for a single fleeting day of glory day back in the mid-70s that I can barely remember, I can can almost count my football genes as unsuccessful as my skintone genes (10 minutes outside in a sunny day and I’m generally sporting burns of the same palette as one of those aforementioned stripes).

Right now in the UK, we’re supposed to be in the football ‘Close Season’. All the leagues are done for another year, silverware distributed and players off rapidly gaining weight on their summer holidays whilst keeping half an ear for the mobile call from their agent to alert them to a pre-season move elsewhere.

The reality is of course is that as far as News goes, there is no such thing as a football close season. Arriving back at a London mainline station one afternoon this week to begin the final leg of my journey back to the coast, the giant TV News screen screamed the headline ‘£80m!’, the fee agreed for Cristiano Ronaldo’s transfer from Manchester United to Real Madrid. This, just days after the same buyers had agreed to pay A.C Milan £56m for the services of Kaka.

The next day, I watched a solitary Saints player walking to the ground – at the end of my street – for his preseason fitness tests. Tests which, unless something concrete changes in the next few weeks, might be somewhat redundant. Saints were forced into financial administration at the tail end of last season, an event triggered initially by exceeding an banking overdraft facility by £5k.

There are times when I forget how dense the information is surrounding specialist areas of knowledge and football is a perfect example. Growing up as a kid, we used to collect the the football stickers produced by the Italian company ‘Panini’ and try and get complete collections of the players, grounds and badges of all the top teams stuck into our albums. Of course in the process you gained an increasingly detailed and somewhat arcane knowledge of subject…. even now I don’t have to even think about these things, so ingrained are they in my consciousness.

Sometimes you’ll overhear a football conversation on a train. Someone will mention ‘City’. I can’t help myself wondering which ‘City’ they are talking about. Manchester ? Leicester ? Norwich ? Then you’ll hear something else that helps ‘… down at Dean Court…’. Ahh, ok so they’ve been to A.F.C. Bournemouth, which makes them much more likely to be Lincoln City fans. Or Chester City, Exeter City…. and it’s not until you turn to see them and see Maroon & Gold and know right away that it was actually Bradford City after all. They were Bantams*.

Last time, I talked about Geodata, adding descriptive information specifically related to ‘Places’ that might be found in text (for example geographic coordinates) and some of the opportunities that present themselves when you mix them cleverly into the user experience.

These additional bits of information we can call ‘Attributes’. For example, the city of Southampton, might look like this when described at a geographical level;

Southampton
<Latitude = “N 50° 54′ 0””>
<Longitude = “W 1° 24′ 0””>

That information in itself is enough to plot it onto a mapping application. However, there’s obviously more that we can add. For example, population.

Southampton
<Latitude = “N 50° 54′ 0””>
<Longitude = “W 1° 24′ 0””>
<Population = “246,201″>

Now we can plot this onto a map and also weight the the location pin by the size of city. Of course those vested in the subject will correctly recognise the rather facile nature of the above example and rightly point out that it is massively over simplified. Mapping information is something that is so well covered across the globe (for example in repositories like Geonames and by organisations like OS in the UK), that maintaining this sort of detailed data at a local level (‘Curating’) is just not necessary.

In his recent exemplary article, my colleague Chris Scott posted the question ‘Semantic Web ? What’s in it for me ?’ and whilst I don’t intend to retread what he describes in great detail, there is much in there that will help us here, as we’re beginning to make the journey towards the world of ‘Linked Data’.

What ‘Linked Data’ is beginning to do to a greater or lesser degree is to almost commoditise very high-level generic factual knowledge. Any of us can hook-up applications to ‘The Cloud’ and get hold of ‘attribute’ information which will helps us improve the user experience of our sites. All we need to do is to hold the linkage between us and it, the ‘Uniform Resource Indicator’ (‘URI’) and we can call the data whenever we need it.

However, publishers hold another very precious thing within their organisations; their own specialist information, their own highly valuable ‘Knowledgebase’. For example, what does the average UK newspaper hold in terms of specialist data on football ? Far more than exists currently within recognised Cloud resources for sure.

Looking back at the early paragraphs of this post, it is packed with footballing information, both ‘entities’ (in this case ‘People’ and ‘Organisations’), but also what we could refer to data being ‘attribute’ data of entities themselves.

I’m a Saints fan. In that case ‘The Saints’ could be said to be an ‘Nickname’ attribute for the entity ‘Southampton Football Club’, the same way as ‘The Bantams’ is of  ‘Bradford City Football Club’. When we can categorise an article as being about ‘Football/England’ and we identify ‘Saints’ as a term within the text, we are able to use the presence of that term to be suggest that it is also about the parent term, even if that is not actually present in the text directly. This collection of terms, we can call ‘Orgdata’ and something like club nicknames is barely scratching the surface of the attribute data that can be described against an entity like a football club.

This presents publishers with both choices and opportunities.

By locally curating their own knowledge – and adding their own specialist terms and attributes to the extraction, normalisation and knowledge management capabilities of Text Mining tools such as Nstein’s TME – they are able to additional flavour and richness to that data when they present it to web users. This not only helps build the overall user experience, but also helps enrich the actual content itself, greatly assisting the ability to package data for example, for syndication.

The opportunities of course do not end there. For many areas of specialist knowledge, there is not right now a ‘de facto’ trusted source, no Geonames or IMDB to refer to.

In the Semantic Web where Linked Data is an essential component part, is being that trusted source the next critical step towards making content pay its keep ?

Matt Mullen is an Industry Consultant at Nstein Technologies [http://www.nstein.com].

(* Yes, I realise you’d probably be able to disambiguate in that case by accent. Assuming you’re good at that sort of thing, natch)





Geodata – The Properties Of Property

7 06 2009

Every so often, as regular as the arrival of yet another UK non-Summer, comes news of another apparent government IT project ‘failure’.

Much in the same way as people seem to be greeting the current travails of the Newspaper industry with some sadistic relish, news that a few million has been spiffed on a grand project that has failed to do what it was supposed to, delivers a similar public response.

Everybody feigns surprise at the news at first and then has a good old moan about how all technology is useless, misuse of public funds and modern life being rubbish. And then promptly forgets about it until the next time it happens.

The first government IT project in this country was not exactly a roaring success either. The government saw a fantastic demo given by an enthusiastic entrepreneur, bought into the panacea described and dropped a fair wodge of public cash with a start-up, who promptly burned through the cash in a few months and delivered absolutely nothing back in return.

Another take of modern life being rubbish ? Not really. This was 1822. The entrepreneur ? Charles Babbage. The product ? The Difference Engine.

When Microsoft by-lined Bing as ‘The Decision Engine’, I winced slightly with Babbage in mind. Last weekend, before is was formally available, I wrote of my hope that we might finally see a proper search ‘product’ that utilised some of the best practice you can find in the vertical search market. Then it launched. My shoulders slumped after a few minutes of play and I went back to hoping again.

‘Google Squared’ – another beta arrival from the Google’s big house of endless betas – popped up in the second half of the week and was at least entertaining in its inability to handle fairly basic complex linguistic searches (especially if you add a geographical element to the search query).

This weeks crop of betas has not been all uninspiring though. From the folks at MySociety (and paid for by Channel 4’s 4ip investment fund) comes ‘Mapumental‘. Whilst it’s in invite-only beta right now, there is an introductory video to watch whilst you wait for an invitation to run up in your inbox.

The principle is pretty simple. Let’s say you’re considering a new job, simply enter the postcode of the location you will be working at (UK only I’m afraid) and by selecting a map location, it will tell you the estimated journey time by public transport (using public data).

The map itself can be manipulated in a couple of other ways. By setting a maximum time that you are prepared to drag yourself out of bed in the morning, the map will show you the places that you can realistically live and still be at your desk by 9am. You can then set a property value – based on the average sale price, itself public data again – and the map will further refine the areas showing you where you can afford to live.

Finally, to gild the lily further, you can set a final control called ‘Scenicness’, which can be set to show only those places remaining that have been scored as being having varying degrees of ‘prettiness’ (via the MySociety’s photo scoring site ‘Scenic’). Ok, this bit isn’t especially scientific, especially when the places voted as being most scenic tend to be those without actual houses or places of work. Still, nice idea in principle.

The tool itself is a great deal of fun to play with and the results are a useful guide… however, if MySociety are suggesting that it is possible to commute into Central Southampton from Sandown, IOW in 2 hours by public transport, then I suggest that they’ve never attempted that journey (train, ferry, train) for real. However good the tool is, it’s still relying on the quality of the data that supplies the abstraction.

Given our national obsession with property and property prices, it is no wonder that  some of the best vertical search tools of this type come from the UK. Looking at Globrix – a UK-based property search tool – is a good example of how using the detailed data that can be extracted from text can provide a high quality of user experience.

In my recent search trilogy I discussed the idea of ‘Aboutness’, basically the understanding of what information a piece of content contains, and how we can use that information (‘Metadata’) to drive the sort of user experiences that keep people on our sites longer. Globrix uses some of these ideas – for instance finding linguistic  ‘concepts’ within the property detail text – and allowing those to be used to refine search. Location information via Postcode drives a map abstraction of these results produced in real time. It’s an impressive effort.

Obviously this sort of mapping and the use of ‘geodata’ is a key for organisations for whom property is their key business, but how can other content-heavy organisations like Newspapers use some of this technology to help develop their own user experience ?

News content is often heavy with geographical information. Many of the most persuasive ‘keywords’ – those terms we use intuitively to make our split-second decisions on whether to read or ignore – are those that tell us how ‘close’ this content is to us. This is not the just high level terms like country, but more distinct…. city, town, district, even landmark.

As previously discussed, manually adding relevant tags to content quickly becomes difficult when you are dealing with large scale operations. Add into that the requirements of geographic information and the task becomes even more daunting. What are these requirements ?

Well, with a normal tag, it is just the descriptive term itself that might be required (e.g. simple : ‘politics’ or complex : ‘council elections’). If we want to start utilising geographical tag data, then this this needs to be far more distinct, especially if we want to start tying this together with mapping applications. Aside from the ‘disambiguation’ that we discussed before with reference to ‘People’ (but is equally valid with ‘Places’ as placenames are far from unique) to be able to map the stories we need to know where they actually are, with a great deal of precision.

In Nstein’s Text Mining Engine (TME), we automatically apply additional geodata to those ‘Places’ that TME automatically detects within text. This geodata is supplied in two forms:

- Co-ordinates : The traditional longitude/latitude ‘Sexagesimal’ information.
- WGS-84 : The same data as your car’s GPS / Sat Nav uses to build maps and that used by almost all online mapping applications.

Adding this information turns your ‘tags’ into ‘geotags‘. Once you have this information, you can display your content not only by subject maps (like our previously discussed ‘Topic Pages’), but geographical maps. For a demonstration using Nstein’s WCM product, we built a a simple widget using Google Maps, which showed editorial staff at a glance the geographical ‘Aboutness’ spread of their content, using the geotags generated by TME, displayed on a Google Maps globe.

Google Maps of course is just a start. For example if Google Latitude starts to get real user adoption  – the upcoming iPhone version will surely help that process – then ‘News content about where I am right now’ will be a viable option for mobile users. Ally this with (user opt-in) advertising content…. and there are some interesting applications on the horizon.

Matt Mullen is an Industry Consultant at Nstein Technologies [http://www.nstein.com].

Every so often, as regular as the arrival of the UK non-Summer, come news of another

supposed government IT project failure.

Much in the same way as people seem to be greeting the current travails of the Newspaper

industry with some sadistic relish, news that a few million has been spiffed on a grand

project that has failed to do what it was supposed to, delivers a similar public response.

Everybody feigns suprise at the news at first and then has a good old moan about how all

technology is useless, misuse of public funds and modern life being rubbish. And then

promptly forgets about it until the next time it happens.

The first government IT project in this country was not exactly a roaring success either.

The government saw a fantastic demo given by an enthuiastic entrapeneur, bought into the

panacea described and dropped a fair wodge of public cash with a start-up, who promptly

burned through the cash in a few months and delivered absolutely nothing back in return.

Another take of modern life being rubbish ? Not really. This was 1822. The entrepraneur ?

Charles Babbage. The product ? The Difference Engine.

When Microsoft bylined Bing as ‘The Decision Engine’, I winced slightly with Babbage in

mind. Last weekend, before is was formally available, I wrote of my hope that we might

finally see a proper search ‘product’ that utlised some of the best practise you can find

in the vertical search market. Then it launched. My shoulders slumped after a few minutes

of play and I went back to hoping again.

‘Google Squared’ – another beta arrival from the Google’s big house of endless betas -

popped up in the second half of the week and was at least entertaining in its inability to

handle fairly basic complex linguistic searches (especially if you add a geographical

element to the search query).

This weeks crop of betas has not been all uninspiring though. From the folks at MySociety

(and paid for by Channel 4’s 4i investment fund) comes ‘Mapumental’. Whilst it’s in

invite-only beta right now, there is an introductory video to watch whilst you wait for an

invitation to run up in your inbox.

The principle is pretty simple. Let’s say you’re considering a new job, simply enter the

postcode of the location you will be working at (UK only I’m afraid) and by selecting a

map location, it will tell you the estimated journey time by public transport (using

public data).

The map itself can be manipulated in a couple of other ways. By setting a maxiumum time

that you are prepared to drag yourself out of bed in the morning, the map will show you

the places that you can realistically live and still be at your desk by 9am. You can then

set a property value – based on the average sale price, itself public data again – and the

map will further refine the areas showing you where you can afford to live.

Finally, to gild the lily further, you can set a final control called ‘Scenicness’, which

can be set to show only those places remaining that have been scored as being having

varying degress of ‘prettiness’ (via the MySociety’s photo scoring site ‘Scenic’). Ok,

this bit isn’t especially scientific, especially when the places voted as being most

scenic tend to be those without actual houses or places of work. Still, nice idea in

principle.

The tool itself is a great deal of fun to play with and the results are a useful guide…

however, if MySociety are suggesting that it is possible to commute into Central

Southampton from Sandown, IOW in 2 hours by public transport, then I suggest that they’ve

never attempted that journey (train, ferry, train) for real. However good the tool is,

it’s still relying on the quality of the data that supplies the abstraction.

Given our national obessesion with property and property prices, it is no wonder that

some of the best vertical search tools of this type come from the UK. Looking at Globrix -

a UK-based property search tool – is a good example of how using the detailed data that

can be extracted from text can provide a high quality of user experience.

In my recent search trilogy I discussed the idea of ‘Aboutness’, basically the

understanding of what information a piece of content contains, and how we can use that

information (‘Metadata’) to drive the sort of user experiences that keep people on our

sites longer. Globrix uses some of these ideas – for instance finding linguistic

‘concepts’ within the property detail text – and allowing those to be used to refine

search. Location information via Postcode drives a map abstraction of these results

produced in real time. It’s an impressive effort.

Obviously this sort of mapping and the use of ‘geodata’ is a key for organisations for

whom property is their key business, but how can other content-heavy organisations like

Newspapers use some of this technology to help develop their own user experience ?

News content is often heavy with geographical information. Many of the most persuasive

‘keywords’ – those terms we use intuatively to make our split-second decisions on whether

to read or ignore – are those that tell us how ‘close’ this content is to us. This is not

the just high level terms like country, but more distinct…. city, town, district, even

landmark.

As previously discussed, manually adding relevant tags to content quickly becomes

difficult when you are dealing with large scale operations. Add into that the requirements

of geographic information and the task becomes even more daunting. What are these

requirements ?

Well, with a simple tag, it is just the descriptive term itself that might be required

(e.g. simple : ‘politics’ or complex : ‘council elections’). If we want to start utilising

geographical data, then this this needs to be far more distinct, especially if we want to

start adding this into mapping applications. Aside from the ‘disambiguation’ that we

discussed before with reference to ‘People’ (but is equally valid with ‘Places’ as

placenames are far from unique) to be able to map the stories we need to know where they

actually are.

In Nstein’s Text Mining Engine (TME), we automatically apply additional geodata to those

‘Places’ that TME automatically detects within text. This geodata is supplied in two

forms:

- Co-ordinates : The traditional longitude/lattitude ‘Sexagesimal’ information.
- WGS-84 : The same data as your car’s GPS / Sat Nav uses to build maps.

Adding this information turns your ‘tags’ into ‘geotags’. Once you have this information,

you can display your content not only by subject maps (like our previously discussed

‘Topic Pages’), but geographical maps. For a demonstration using Nstein’s WCM product, we

built a a simple widget using Google Maps, which showed editorial staff at a glance the

geographical ‘Aboutness’ of their content, using the geotags generated by TME.

Google Maps of course is just a start. For example If Google Lattitude starts to get real

user adoptions  – the upcoming iPhone version will surely help that process – then ‘News

content about where I am right now’ will be a viable option for mobile users. Ally this

with (user opt-in) advertising content…. and there are some interesting applications on

the horizon.





Livin’IT – Bing There, Done That

30 05 2009

15 years ago I bought an LP.

It was one that I had been really looking forward to being released. The two singles that had preceded it had whetted my appetite and having devoured the full-length release just once, I immediately phoned everybody I knew to demand that they too go out and buy it. A few of us traveled the length of the country a few months later to hear it live and for the first and only time in my life, I cried at a gig due to its quality.

Of course what happened in the year or so after my purchase, was that gradually it was everywhere (every shop, every TV show) closely followed by a small army of similar closely modeled products, each facsimile filtering out and diluting many of the elements that made the original the artifact what it was. And indeed, still is.

This morning I dug out the LP for its annual single airing, and whilst it spun on the Soundburger, I sat down to catch up on the week’s industry news. Primary amongst this was of course Microsoft’s bi-annual attempt to do mass-market search, ‘Bing’.

Now aside from the curious naming decision, there is plenty in this that many of us in the industry will find strikingly familiar. Indeed, Paul Miller’s blog post sums it up all very neatly; this is technology which is already out there and well proven, and I suppose explains my initial ‘meh’ reaction to the announcement. But in that lies the explanation; ‘…us in the industry…’.

Just because we are familiar with faceted-search, result clustering, semantic keyword analysis et al, doesn’t mean that that anyone outside our little cosseted gang is. ‘Bing’ is supposed to be a mass-market tool, in a way that something like Clusty will never be and Newssift was never intended to be. And as for Wolfram Alpha… that was the answer to a question that nobody actually asked.

So, it’s not about the technology per se in this respect, but the packaging. A good case in point would be Apple.

Now, I’ve been accused in the past of being somewhat anti-Apple. For the record, it’s not the case, I guess I just don’t buy into the Applecult. Whilst I do not see Apple as a technology company at the bleeding edge, I do see them as a hugely admirable and successful product company. What do I mean by this ?

What are Apple held in high regard for ? Macs, iPod & iPhone. None of these were conceptual developments that came from within Apple. I mean they did not invent the idea of the personal computer with a GUI (Xerox would be a better bet – check this at 4:20 for what I’m referring to). They did not invent the MP3 player and they did not invent the mobile telephone.

Now the last one is more complicated. iPhone is not just a mobile ‘phone, but a smartphone. I’ve had an increasingly improving experience as a Windows Mobile user for the last 5 years and Symbian has been around a tad longer than that.

Again, iPhone is not just a smartphone, but an application platform. But that again was not their concept, Nokia got their first by a fairly long chalk. Ok, they screwed up their market advantage with a series of baffling ideas (segmenting the OS to device releases is just one of the boneheaded decisions) and are now dropping marketshare every quarter.

What Apple have done is to package technology better than everyone else. They’ve made products that the mass-market can use out-of-the-box and where there have been shortcomings in the products (it’ll be the 3rd iteration of the OS before iPhone can do what my HTC TyTn II could do 2 years ago), they’ve been almost of background importance to the functioning of the product in the eyes of the consumer.

They work, they look good and they are easy to use. How could I fail to appreciate that ?

With ‘Bing’ the key will not be whether the technology is unique. Not whether you can argue that it has been done elsewhere before. What will be important is whether the best practice demonstrated by a range of small-scale and niche players can be packaged right for the mass-market, but not over-diluted. Whether the execution is perfect.

In short, to suceed ‘Bing’ must not feel like using technology. It must instead feel like using a product.

Matt Mullen is an Industry Consultant at Nstein Technologies [http://www.nstein.com].





Linking – Is All Similarity The Same?

20 05 2009

Today I was lucky enough be able to speak on the second day of the 2009 ePublishing Innovation Forum in London, presenting; ‘Is All Similarity The Same? How Context Drives Revenue and Brand Loyalty’.

Now whilst you can cover a fair bit in a 15 minute talk, there are some areas where naturally you have to gloss over a fair bit of the potential detail. And indeed, there might of those of you out there who would have liked to have heard the presentation but for reasons of geography/time/money/lethagy (delete as applicable) were not able to.

So, in order to enlarge on the talk itself and to open the discussion to anyone who is interested, here is a slight redux. A ‘Directors Cut’ if you like (but not like the ‘Directors Cut’ of ‘Cinema Paradiso’ where the ending got totally screwed, or a George Lucas one where it’s sort of the same but with extra CGI and Greedo shooting first).

It is probably worth mentioning at this point that as a primer, I have recently written a 3 part series on user search modes. In that series I touch on a number of areas which are complimentary to this piece and it goes without saying that I’d recommend you read that too if you have time.

Right now, there is a huge ongoing debate about what sort of charging model publishers should use to try and derive cash from their content. Thus far, the overriding model has been a free / advertising supported model, where the building of an mass audience has outweighed any real thought of creating mass revenues.

The collapse of the advertising market, both in terms of quantity of potential clients and the rates at which they are charged, has seen a rapid shift towards re-addressing charging models for this content. Some – perhaps those with more specialist data – have well-established models for carrying this out, but now it’s the mainstream news publishers who are looking seriously at following suit.

Ultimately however, regardless of the charging models themselves, the same challenges exist. I might be accused of over-simplifying here, but to me these broadly are;

- Reader Aquisition
- Reader Renention

In fact they are no different for news publications when we look at their paper-versions, and online they don’t really differ whether you are applying a charging model for readers or not (just replace ‘Reader’ with ‘Subscriber’). As we’re talking about online here, I’m just going to refer to ‘Visitors’ as a generic term. So ;

We want new visitors. And we’d like them to come back.

Traditionally, the area of visitor acquisition has been the domain of Search Engine Optimisation (SEO). This being the method of luring users in through the ’side door’ (from places like Google) directly into pages that match their searches. Doing this well has a proven success rate and as a result there is a myriad of sources out there to read up on and learn best practise.

Where I am going to focus on is one of the most important ways of building visitor retention. That of automatically providing content similar and relevent to that they are already reading.

Now, this sort of functionality is not exactly new. Look at an ‘article page’ on any news site and you’ll see this in the accompnaying side bar, usually called ‘Related Stories’. Some of these are generated by our old friends the search engines, some from specialist tools, but many are hand-cranked by editorial staff during the creation-cycle.

However they are created, they are important to our goal of retention, because they help us to show the depth of our knowledge, the gravitas of our brand and crutially, that we understand the requirements of the visitor. For the free charging model, they help as additional clicks to the visit (helping build our advertising charging model), with a charging model they also assist in demonstrating the ‘Fitness For Fee’. The more a visitor consumes, them more likely they are to see value in continuing to subscribe.

In the search series, it was suggested that the key to meeting the various modes discussed was real understanding of the ‘Aboutness’ of content. You’ll not be suprised that it is also key here. However, whilst before we really only touched on basic tagging of ‘Entities’ (People, Places & Organisations) and ‘Concepts’ (single and multi word descriptive text) we now have to add to that, ‘Relevance’.

Knowing that say ‘Girls Aloud’ appears as an entity in our text is one thing. What now becomes more important is how relevant they are to the overall subject matter of the article itself. The more relevant they are, the more likely they are to matching another piece of content on the same subject.

With Nstein’s Text Mining Engine (TME), mathematical scores are added to each automatically generated entity/concept, suggesting how relevant they are to the overall content item (e.g. ‘Girls Aloud’ are 83% relevant to this article).

Of course, this is something that you could consider performing manually. Again, with small collections of content, where the same manual tagger scores all items, this is even achievable. However, it is vital to have a consistent scoring mechanism – the same baseline methodology for all the mathematical results – and this is almost impossible for a human being to do alone.

A single article might have entity/concept lists that run into high double figures for each. Scale that out across the tens of thousands of articles that a modest content heavy organisation produces annually and the size of the task becomes clear.

When we apply this sort of tagging methodology to collections of content – for example all articles inside a news content database – we’re actually creating something that we can refer to as a ‘Knowledgebase’. This being a repository not only of content, but also knowledge about what that content is about, the ‘Aboutness’ being described in that tagging ‘metadata’ for each document.

This Knowledgebase is also potentially highly interconnected. The relative ‘Aboutness’ of each document can be calculated against each and every other document by the use of this metadata. At the heart of the relative strengths of these connections lies their ‘Base Similarity’. Now these connections are pretty complex things, if we look at the fact each item of content might have 50 separate elements within its metadata, each with a different level of revevance, the connections between each are individually multi-faceted.

The Knowledebase itself is of course not static, but rather a living and breathing organism. New content is likely to be added on a constant drip basis, with each new item creating a disturbance to the existing connection calculations. It shouldn’t be any surprise why we employ really bright mathematicians to bring order to these conditions within our products.

Now, ‘Base Similarity’ is a wonderful thing. We can take our repository of content and using the tagging that we can apply to each, create a ‘Knowledgebase’ rich with multi-dimensional relevancy links between each item. Best still, as we can use a constant automated method to do this, there would be no subjectivity in how these linkages would be created or maintained (or arguably, a consistent level of very low-level subjectivity in the calculations). Best of all, we can provide highly accurate ‘Similar Stories’ result sets to accompany articles.

I’m lucky in my job that I get to hang out with bright people. Not only the aformentoned Nstein mathematicians (who have to explain things to me very slowly so I can keep up), but also within our customers in the publishing industry. Spending time with these people gives the slower-learners like me the chance to absorb their interesting ideas, re-phrase them and pretend that they were actually my idea in the first place, stealing the credit and learn from their best practices.

Talking to a few of these people late last year, gave me an interesting insight into how  ‘Similar Items’ could be better modeled for the online publishing world, especially for newspapers. It’s not that the above described methodology is wrong, far from it, but that it is a baseline. A starting point on which you can then build something far more interesting.

As we have discussed, the similarity calculations that we have are very low in subjectivity. Trouble is that our visitors are not. They bring their own context to their judgement of similarity and it is important to reflect this in the experience we provide them with.

For example, I’m viewing an article in the ‘Travel’ section of a site. The place I’m reading about has been visited by a few celebrities over the years. Now it maybe that the ‘Base Similarity’ calculation for similar items at this point would correctly produce articles that reference these celebrities, but is this would the visitor would expect ? After all, they are reading the ‘Travel’ section. We know this. Shouldn’t we favour similarity calculations upon ‘Places’ rather than ‘People’ ? In essence, the context in both what and where a visitor is reading content should carry weight the similarity calculation.

‘Base Similarity’ also treats all content as being equal. It is part of that important objectivity that we want from the creation of the metadata and the linkages. However, for news organisations who are attempting to acquire and retain visitor numbers, all content is most certainly not of the same value.

Today’s football scores for example, appear in a variety of sources, you can find them almost everywhere and as such they are really commodity content. Today’s exclusive new column from the well-respected and popular football writer is a high value unique property. We want to tip visitors towards our highest value content where we can, and that is another weighting that should be able to be applied to the similarity calculation in certain circumstances.

The conclusion to this might be judged as a bit backwards. I started by discussing that similarity needs to be based around detailed objective scoring and then seemingly have contradicted myself by saying that judging similarity purely objectively might not be enough.

In fact turning your content repository into a knowledgebase is the critical starting point for any automated similarity solution. To be able to apply any editorial imperatives to these calculations – for example adding subject category context as I described with ‘Travel’ – still relies on the objective scoring as a starting point. The weighting of the results is a secondary action – a filter – that can be applied dynamically defendant on the house-rules for an organisation. Critically, these filters rules can then be rapidly shaped and tuned by the online editorial staff on-the-fly, as they are not part of the core methodology of how the complex linkages are calculated in the knowedgebase.

In short, they don’t need to be mathematicians. Because we already have them.

Matt Mullen is an Industry Consultant at Nstein Technologies [http://www.nstein.com].





Livin’IT – Where There’s A Will, There’s A Weg

15 05 2009

I have an almost name sake and you’re reading this blog via his baby. When we’re old men, I suspect our lives will be pretty different. I’ll be the bloke in the park, feeding the ducks and waving his walking stick as the kids roar past on their hover boards shouting….

‘I fought in the browser wars so you could ride those damn things’

…whilst my almost-namesake will probably be enjoying his retirement by traversing the globe in his luxury Enourmoyacht. Having invented the damn hover board no doubt.

Back in the heady days of the late 90’s, I was working for an internet technology ‘incubator’ unit based at my local university. I’d spend the previous few years working for one of the biggest computing companies on the planet and now going to work in jeans on a university campus was a welcome change. My colleague and I had persuaded our new boss to introduce an unofficial 20% rule – inspired by a similar system at 3M – so we could develop our own ideas on top of the work that the unit was supposed to be doing.

A few months into the job, I called my colleague into my little office to show him something that I’d hacked together over a few weeks of my 20% time. I’d been trying to keep a little diary and was getting bored mashing this manually into HTML, so I’d built a rudimentary GUI, which allowed me to write entries in text and then saved it to a database and served it out properly rendered.

There was no inline linking or styling (you could add links into a ‘related links’ section separately, which when served appeared alongside in the page). It was therefore pretty basic stuff.

‘What is it ?’ asked my colleague.
It’s a Web Diary system. You can type in your diary entry and then publish it to the web’ I replied.
‘Who the hell wants to publish a diary ?’
‘Erm…. not sure. I’m sure someone will’.

That was the last we heard from that project.

A year or two later, I had started to develop small-scale systems which we would now recognise as ‘Web Content Management’, mainly at that stage for intranet and extranets. The incubator unit had been disbanded and I’d been taken on by a small consultancy company, to further develop these systems and add additional functionality for the increasing number of customers.

I remembered the diary code, blew the dust off it and integrated it into the pre-release version of the new intranet software. The idea was that the staff could maintain their own pages within the organisational structure about themselves. I’d added the ability to upload pictures, extended the authoring a bit to allow inline linking so they could add links to each others pages etc… it looked half-decent.

‘Explain it to me’ said the new boss.
‘Well, it’s sort of a way for staff to share their interests with colleagues’
‘Why would they want to do that ?’
‘I s’pose so they can find other people in the company who like the same things as they do….’
‘Nobody is going to do that’.

Again, that was the end of that idea.

Between these two events – before the end of the unit – my colleague and I had invented an addictive new game.

We’d been sent a small plastic football from one of our customers and we used to spend our lunch hours in the server room, trying to chip it into a waste paper basket (or as we say here, ‘bin’). So addictive was this game, that we’d get to work early and leave late just to try and perfect the techniques required to win at what we’d now dubbed ‘BinBall’.

An unfortunate incident involving one of us knocking out one of the DNS servers for over an hour with a misdirected shot led to it voluntarily being banned, which was a shame as we’d only just finished the 15 page ‘Official Guide to BinBall – Authorised by the National BinBall Association’.

And we’d been too slow to get the ‘nba.com‘ domain.

Matt Mullen is an Industry Consultant at Nstein Technologies [http://www.nstein.com]. He also now accepts that servers rooms are no place to develop dangerous ballgames.





Search – Furnishing Discovery

15 05 2009

I’m very easily distracted when I’m writing.

Right now, I’ve got the radio on (listening to a football commentary on Radio 5) tapping this article into Notepad (still my primary choice for writing after all these years) whilst simultaneously trawling through eBay for house stuff. It’s the latter which seem to be dragging more attention out of me than it really should.

For the last few months, I’ve been trying to track down various bits to furnish my house. When my friends pop in, they are gradually becoming less happy that there is nowhere for everyone to sit (other than a tiny 2 seater sofa and the vast expanses of bare floor). They can’t understand why virtually all my clothes live in bin bags on the floor of my bedroom. A room that is slightly poorly monikered, since I don’t own a bed. And if they come to eat, there is no table or chairs, so it is a careful plate-on-lap balancing exercise, which is far from genteel.

‘You moved in a year ago Matt. How can you live like this ?’

For me, the answer is a few hundred yards down the road. There lies the local council recycling centre and where I go every few weeks to take the stuff that cannot be collected from the house by the bin men, but can still be recycled. It always amazes me the things I see people throwing away, primarily almost new furniture. With the decline in prices for that sort of stuff over the last decade or so, people treat it like disposable fashion. Buy it, assemble it, break it, bin it. It’s so cheap, we’ll buy another.

Of course by the time it reaches its final destination, it’s not really possible to save very much of it. The twisted bits of melamine-faced chipboard has generally passed the point of no return, anything salvageble is quickly picked out by the local 2nd hand dealers, who in return for a few quid in the council coffers, drag it back to their waiting vans to be sold on at a small profit at the car boot sale. Which, I suppose, is real recycling at its best.

I decided that I didn’t want to buy anything disposable anymore. If I was going to buy new furniture, I was going to try to buy 1) quality and where possible 2) vintage. I wanted items that would last me forever, things made by proper craftsmen from proper materials, not punched out by a machine. Items that may well outlive me and provide someone else with a lifetime of use.

The continued empty state of the house stands as testament to the fact that this is proving to be very much a ‘work in progress’.

I’m a big fan of 20th Century British design, primarily that from the post-war period and indeed I’m sat writing this post on my small sofa, which was designed by Robin Day. I really would like some more pieces of his work but I really don’t have the time to trawl the antiques shops and markets in the vain hope of finding work by him that I like.

Hence the eBay trawl.

If I type ‘Robin Day’ into the search box, I get a mixed set of results. You see – as often the case, there is more than one famous person of that name. The other was a political journalist and TV presenter here in the UK, so the result set is full of his books and other ephemera. It’s not that difficult to understand why these appear as high scoring results, it’s just a matter of me manually disambiguating these two different people, which using the ‘Categories’ system is fairly easy.

However, using their ‘Best Match’ search option to sort the results, the  top scoring suggestion is a set of car mats for a Peugeot 307. Not really relevant at all. Why are they there ? Because in the item title is ‘Peugeot 307 Carpet Mats Robins &Day Manchester’.

In fact, there is alot of ‘noise’ in this set; results that include items primarily about ‘Robin Williams’, the song ‘Rockin Robin’ and even more bizarrely ‘Robin Hood’. And that’s just in the first 20 results.

Now, I don’t want to hammer eBay for their search, for it does many things well. For brand related searches – for example ‘Paul Smith’ - it does a good job, because during the item listing process, it attempts to enforce sellers to select the correct brand ‘tag’ from a drop down listing. Similarly, the aforementioned categorisation system made my Robin Day disambiguation fairly easy (even though I can’t find an easy way to search two sets of categories simultaneously, as the furniture results are found in both ‘Home & Garden’ and ‘Antiques’). Indeed, as my eBay login reaches its 10th anniversary this very week, they’ve consistently made improvements to the search process, which whilst not as good as I’d ideally like, is way beyond that provided by most online retailers.

This sort of searching is illustrative of the search mode that I refer to as ‘Discovery’. I’m searching for information that I’m not sure exists and I’m trying to describe in the best way I can, what it is that I’m trying to find. The actual search I’m trying to perform is actually far more complicated – multi facated even – but I know that the way in which I will get to where I want to do, is to gradually refine by search in an iterative way – filtering out the ‘noise’ of unrelated information along the way – to get to where I want to go.

So, what’s the real difference between ‘Discover’ and ‘Recover’ ? The answer is the relative value of the information itself.

Our ‘Recover’ searches are for content that you could largely refer to as being a commodity. It can be found in many places simultaneously, the battle is to make sure that you can furnish the visitor with it as painlessly as possible.

With ‘Discover’, the depth of the user need can be considered greater and the number of possible content items that match it is considerably less. The user isn’t even sure that the content even really exists, it is a requirement that is based on research. The nuances of the content and what makes them appropriate to the requirements is well beyond that of simple ‘entity’ matching.

During the discussion on ‘Recover’ we touched on the issue of ‘Aboutness’, that being the understanding at a document level of what was important about an individual piece of content. In that mode we were really concerned with making sure that the important ‘entities’ – People, Places and Organisations – were available as metadata via appropriate tagging. Now we’re thinking about ‘Discover’, a much more subtle and complex process, this becomes even more appropriate.

Now we’ve got to think not just about the entities, but also about the conceptual data within the document. Conceptual data ? Ok, let me explain.

‘Robin Day was one of the most respected British political commentators of the 20th Century’

‘Robin Day was one of the most respected British furniture designers of the 20th Century’

Both these sentences would match my basic ‘Robin Day’ search. One is relevant to my ‘Discover’, one is not. How do we disambiguate the relevant from the irrelevant ? Well, we can’t do that via the search term itself because it lacks context, the intention of what it is that I want.

In my earlier eBay example, I used their categorisation system to disambiguate the results.  eBay’s system primarily relies on the seller manually selecting an appropriate category from multi-layered system (with some automated help available along the way). For the most part, this more-or-less works. The sellers realise that the more accurately they list the item, the more chance they have to matching their thingamabob with the keen thingamabob collector.

Content classification in this way (‘Categorisation’) is one of the two primary tenets of how to meet the challenges of the ‘Discover’ search mode.

What in essence is happening, is that content is being adhered to a Taxonomy – a hierarchical scheme containing various parent-child relationships, bringing some order to large collections of content. This can be performed on a one-to-one or one-to-many basis, for example my Robin Day sofa could both appear in ‘<root>/Antiques/20th Century/Sofa’ and ‘<root>/Home & Garden/Furniture/Living Room/Sofa’, with both being correct.

By adding in this categorisation, we’ve added immediate meaning to the content. We know how to file it as being broadly similar to other content similarly classified. Whilst this is a start to creating the ‘Aboutness’ of the content, it doesn’t tell us very much about the detail.

If we return to our discussion on ‘Recover’ we saw how adding entity metadata in the form of tags greatly assisted the findability of the content. This will also be true for us in ‘Discover’ mode, but entity tags alone will not cut the mustard for true content understanding. In the case of our Robin Day sentences above, we’ll be able to add in the entity tag ‘Robin Day’, but that won’t be distinct enough alone to filter out the unwanted content.

It is in fact the conceptual data within the sentence which is the most critical in helping distinguish the differences between them. What do we mean by conceptual data in this case ? Well, at the most basic level there are a number of ’simple’ (single word) concepts in each.

1. ‘British’ ‘political’ ‘commentator’
2. ‘British’ ‘furniture’ ‘designer’

We can throw away the rest of each sentence and there is enough data in the above simple concepts to be able to distinguish between the two. Whilst that might be enough in his case, consider much larger results sets and these simple concepts whilst valuable might be too generic.

Looking at 2. is the the concept ‘furniture’ really descriptive of Robin Day ? He’s not furniture. He is a ‘designer’ for sure, but in isolation ‘designer’ is something that can be accurately applied to many people in many fields of work. However, ‘furniture designer’ is a much more accurate description. This multi-word conceptual data is described as being a ‘complex concept’.

Concept extraction is the other key technology that powers ‘Discover’ mode search solutions.

Let’s go back for the last time to our simplistic search example for ‘Robin Day’. With half a mind to the eBay experience, our search would to start of with filter out virtually all of the noise, but returning only the documents that contain the entity tag ‘Robin Day’, so no ‘Rockin Robin’ & ‘Robin Williams’ this time.

We would then have a result set that would be made up of documents about both Robin Days. However, using the conceptual data that we have been able to extract – both simple and complex – it would be been possible to have categorised each correctly and most critically automatically. One is a ‘furniture designer’ and the other ‘political commentator’.

You can of course use this in reverse. A search for ‘furniture designer’ would contain Robin Day within the result set. If we add ‘British’ (simple) and ‘20th Century’ (complex) to the mix, then filtering him in or out of the results as required would be simple.

‘Discover’ might be a different search mode, but the key to being able to meet it is again a good understanding of your content. Its essence. Its ‘Aboutness’.

In the ‘Recover’ discussion, we touched a little upon the difficulties of manual tagging, in that case just for entity data. To repeat the three reasons we established again;

1. Manual tagging is very time consuming.
2. Humans are sporadically brilliant, but hugely inconsistent.
3. Humans superimpose the current context to content.

If they were true for entity data in isolation, add in the overheads for categorisation and concept extraction, then for anyone publishing content in anything other than minute quantities, automated tagging – Text Mining – becomes an essential consideration.

For example, using Nstein’s Text Mining Engine (TME) – which contains modules for entity extraction, simple/complex concept extraction and classification against taxonomies – allows our publishing customers all over the world to extract meaning from the billions of content items that they own, helping them match what they produce to the ever demanding requirements of users.

Quality content, just like quality furniture, will always be in demand.

Matt Mullen is an Industry Consultant at Nstein Technologies [http://www.nstein.com].





Search – Aiding The Recovery

12 05 2009

In my last post, whilst wittering on about steam trains, wartime decryption machines and long since departed ecommerce sites, I posed myself a fairly big question with reference to search.

Are we at the end of the web’s own ‘Steam Age’ ?

To enlarge that slightly, is it time to reconsider the way in which we use the great ‘industrial machines’ of the early internet age – search engines – to better meet the way in which people want to interact with content ?

I also suggested that there are two broads modes of search – ‘Recover’ and ‘Discover’ – and that the current mass-market search tools (both search engines and embedded site searches) served neither of these modes especially well.

It was during one of my periodic trawls through Google Trends that first got me thinking the ‘Recover’ search mode. As I mentioned in the last post, if you look at the statistics right now, there are three things that stand out for me.

- The vast majority of searches are short and are obvious attempts to find existing websites (or information about/within them).

It does tend to give you the impression that there are vast numbers of people globally repeatedly typing ‘facebook’ in to Google every few seconds in a vain attempt to be understood (whilst weeping gently into their keyboards).

Remember though, what Google is stating here is search terms not search phrases. Drill down a bit more and you can see the co-currence with these domains searches (e.g. facebook login uk) and the picture becomes a bit clearer. That picture being….

- Users don’t trust site searches

As soon as people worked out that using the string ’site:[domain] [query]‘ filtered Google results just against that site only, use of common (and generally, poorly performing) site searches could be ignored. Which certainly transformed trying to find things in Microsoft’s support database for one thing. Now most users of Google don’t know or care for that specific syntax, but I’ll bet that well-known domains names trigger a similar behaviour within the Google ranking of results when some performs that ‘facebook login uk’ search.

Filter the results to ‘News’ and the results are full of big news content brands, especially newspapers.

- The vast majority of searches are for entities.

By entities, I mean they are for definable People, Places or Organisations. Even the aforementioned website searches are in themselves entity-based searches (Facebook is an Organisation entity, as would be The Times). If we use the Google Insight toolset to filter the results down using ‘Entertainment / Rising Searches’, we’ll see mainly people (Kelly Brook, Cheryl Cole) and organisations (Girls Aloud).

Ok, referring to ‘Girls Aloud’ as an Organisation does make them seem a little more relevant then they probably are – I mean they’re not exactly The United Nations – but again, they are definable, with definable attributes (e.g. members, singles, albums etc).

These are of course not all the straightforward observations you can make from trawling through Google Trends / Insight (and I suggest that if you ever have a few spare minutes, then it’s a good reality check for those of us increasingly divorced from the user community ‘on the ground’).

However, using these three as a starting point and with our ‘Recover’ mode in mind, then it does present an opportunity to produce something that would seem to meet the search mode that many out there are trying to use Google to achieve.

Extrapolating these three factors in my own peculiar way, I would say that this is inalienably true;

Users don’t want to search. They want to find.

Now, I bet a few of you snorted there for a moment. Find. Pah ! There I was reading something potentially useful and now I’ve woken up in a marketing meeting. So, sorry, let me clarify a bit.

Users don’t want to search. They want us to pre-find things of interest for them (because we’re supposed to be the experts on our own content after all).

The bit in brackets was meant to sting a bit. Did it ? It should. But the solution itself should be simple. Users are telling us what they are predominantly looking for – data about entities – and we have have tons of content that contains those self same entities. So why do we make it so damn hard for people to find it ?

To answer that we have to face a cold hard fact. We have content. We know what is generically in that content. However, specifically, we do not understand at a document level what our own content is really about.

It is that lack of knowledge of that ‘aboutness’ that separates us from being able to match our users to our content. Instead, we rely on the ability of traditional search technologies to be able to do this for us and by the looks of things, our users have voted with their feet and are instead trying massage Google into trying to to do the job for us. And Google doesn’t care where it sends them. To you or to your competitors.

Now some of you will be thinking ‘it’s ok, we tag our content with appropriate terms’. That’s great. It’s an important start. If you’re a blogger or a small-scale publisher and you’re doing comprehensive manual tagging, you can go and make a cup of tea or something. However, if you’re a large publisher, pull your chair a little closer for a few moments. This is important.

The trouble with manual tagging, is that for all its good intentions it is flawed in a major ways when you try to scale it out, especially for big publishers who might post hundreds of articles, 24 hours a day.

1. Manual tagging is very time consuming.

This means enslaving content creators (e.g. journalists) to manually typing terms or paging through endless drop-down menus to select appropriate categories.

2. Humans are sporadically brilliant, but hugely inconsistent.

I’m great at 8am when I’m full of caffeine. I’m less good when it’s 7pm and starting to flag.. give me the same job at both times and I guarantee I’ll give you a different quality of result.

3. Humans superimpose the current context to content.

What seems relevant to the story when we tag – because we are vested in what it means today as news/new content – will always take priority when we manually apply our opinion of what is appropriate. However, this content in 12 months time might be valuable for a totally different reason, something buried in paragraph 5 when it is first published might be the real long term value. And the chances of us realising that on day 1 is ?

Now, the whole area of generating these tags automatically – often referred to as ‘Text Mining’ – is something which I’ll cover in much more detail in the next article, but bear those tagging factors in mind as we plough onwards.

So we have content, we have some form of appropriate tagging which in some way describes what that content is about and we have throngs of web users trying to find it. This is potentially a great position to be in, there is demand and we have the supply to satisfy it.

However, as we’ve established, our over-reliance on traditional textbox site search as the ‘cure all’ has tended to coral users back towards Google and off to wherever they decide to send them. Indeed, this is not helped by the prevalence of the ‘Google Toolbar’ and embedded Google searchbox in browsers like Firefox, site design trends tend to place internal site search precariously on the top right hand side of the page, mere millimeters from the boys from Mountain View.

Instead, there is an increasing towards the adoption of what we could call ‘Topic Pages’. There’s nothing especially new about the principle at work here; a predictable and preferably SEO friendly and bookmarkable base URL which when called contains all the content about a certain subject. It’s how many sites have always historically organised their navigation in any case.

That model for producing ‘Topic Pages’ relies on the topic itself actually existing within the navigation system, which if you’re trying to fulfil the demand for a huge range of entity results – in a world where previously unknown people can generate huge traffic within hours of their first appearance (like the recent Susan Boyle phenomenon) – quickly becomes an unworkable task.

This is where our tagging starts to bear its fruit.

If we’ve done a good job with ensuring that all the entities contained within the content are present as distinct tags, we can use this information in a couple of ways. First of all, we can create ‘inline links’ within article text which match up with these tags. This can of course be done manually if you’re only dealing with small amounts of content, but in all likelihood this is something that you’ll want to do via your web content management system and a bit of templating/regex magic. This is the real shortcut towards a ‘Wikipedia-esque’ array of richness of content cross-linking. They have 75,000 volunteers doing their manual linking for them. We’ll have to be a little smarter to make up for our lack of manpower.

So what do our links contain ? Well, we’ve already worked out that if we’re going to make this work that we’re not going to be maintaining the multiplicity of actual pre-formatted pages for every possible entity. What we’re going to take advantage of is our previously much maligned site search. How successful we’re going to be in achieving our aim of a ‘SEO friendly and bookmarkable base URL’ will depend on how flexible (or clever) we can be with the search itself.

First up, it is important that we can point our search engine to read the tags only when performing a search on the content. These are the ‘controlled terms’ that we want it to use, if it tries to use anything else to deduce the result, then our tagging work will be largely in vain.

Now, we could make our tag of the format ‘mysite.com/search/q=[entity]‘ (where q=entity is whatever format your internal searching engine requires to perform a search). This is neat in one sense – the link goes right to the search results and it is easy to auto-generate in a page – but it is hardly especially intuitive or SEO friendly.

Our inline tag could be something like ‘mysite.com/[entity]‘, which can be just as easily auto-generated when we publish a page. There will of course need to be some internal standards for URL Encoding with things like spaces/character, but let’s leave that for the proper tecky folks to work out. Now, of course, this page does not formally exist but using something like ‘URL Rewite’ in Apache (or subverting the 404 – page not found – system in IIS) we can do something nifty.

By deconstructing the requested URL from the HTTP header – and remember we know the format that these non existing pages are going to take – ‘/[entity]‘ - we can place a search results page from our internal search engine on that partial URL. So, as far as the web visitor is concerned, it is a page called ‘mysite.com/[entity]‘ which contains only content (sorted by recency – most recent first) on that specific Person, Place or Organisation.

Selecting a page within this result set will of course display a page similarly rich in inline tags which all follow the same methodology as above. When a user selects a result, what they are saying to us is that ‘there is something about this content that matches my interests’. By adding the link data, we’re making an attempt to predict some of the related interest that that user might have and give them the click path to find them. Not search, but find.

There are some design considerations that need to be made. Whilst we know this is actually a set of search results, we want this to look as if it is a normal section page. In my day job at Nstein, we implemented a solution along these lines in the demo system for our own WCM product last year (with auto tagging via TME) and used the section template to display the results. Comparing a ‘pre-canned’ navigation section page to one produced dynamically as a ‘Topic Page’ and only knowledge of the formal navigation structure gives away which is which.

Taking a step back, you can see some really interesting use cases, especially when you factor in the possibility of automated tagging. Breaking news story on a Sunday afternoon ? If you have the content, you have the ‘Topic Page’ ready for visitors.

So, the summary of this is really that the ‘Topic Page’ is really a search abstraction. Technologically, it is not particularly difficult to experiment with or fully implement. However it is entirely reliant on you having a full understanding and associated tagging of all your content.

Which neatly takes us to our next discussion. ‘Discover’.

Matt Mullen is an Industry Consultant at Nstein Technologies [http://www.nstein.com].





Search – The End Of The Steam Age ?

8 05 2009

There is a trainline that runs just behind my house. It’s been largely redundant since the 1950s, but a few times a year above the hum of the city traffic you can hear the unmistakable sound of a steam whistle.

The slightly overgrown cuttings disguise what was at one time, a hugely important arterial rail route from London to the Ocean Liners that served the major ports of the world. A few hundred yards from my front door lays what was once one home to the most famous ships ever constructed and it was the owner of that whistle that brought the great and the good to the port to head to the US, Canada, Australia, South Africa, India and beyond.

These days, the whistle doesn’t herald world travelers, rather rail enthusiasts reliving the golden age of the steam engine, helping to preserve for future generations what was the predominant technology of its age. The technology that once drove Britain, the ‘Great Drive West’ in North America and brought Europe closer together.

Steam engines today are hugely impressive feats of engineering even to me, someone born into the transistor age and now working in the digital age. The heavy-duty precision of the external running gear, the brightly painted boiler housings safely containing unbelievable internal pressures and the tremendous volume of noise as they roar past. I can well understand why childhood encounters with these machines can evoke such memories for my parents generation.

So in the thawl of these machines were people, that they saw them as the ultimate means to an end. Passenger numbers increase ? More carriages. The extra weight ? More powerful engines. Need to decrease journey time ? More powerful engines. So much was emotionally invested in these machines, their peerless engineering and undoubted brutal, raw beauty, that their end seemed to come about in a sudden shock.

Looking back with the benefit of hindsight, it’s easy to deconstruct the situation. The technology was failing to address the problem. Steam, with all its romance, history and undoubted engineering brilliance, was failing to perform its basic function; getting people to where they wanted to go, on time and on budget. Britain’s network in particular was beset with over over-redundancy and duplication of routes, lack of forethought it its innovation and rock bottom customer satisfaction (of course, long before such a term was first coined).

What happened next is still the source of a certain amount of controversy here in the UK. There’s plenty to read on the subject, but what is unarguable is that within a few short years, steam engines were consigned to the scrap heap, both metaphorically and literally.

This year saw a rebirth. In fact a couple of rebirths. In parallel, whilst one group of talented enthusiasts put the finishing touches to the first steam engined train built since the second world war, another commissioned an equally outdated by hugely historically significant machine, a working replica of Turing’s Bombe.

It got me to thinking, in 50 years time, what from my time in the online industry to date will enthusiastic teams of geeks be building in historic celebration ?

- A fully working replica of ‘boo.com‘ (therefore doing something that proved beyond the original development team) ?

- A historic recreation of online retailer Blackstar’s site the day that the original remastered VHS box set of Star Wars Trilogy was released ?

- Millions of self-generating Geocities pages about pets littered with blinking marquee text ?

My guess is a site search. Probably one from a content heavy site like a newspaper. One, like some many out there, that huffs and puffs but doesn’t get you where you want to go especially quickly, if at all.

Those born into this later age will wonder at this re-created elderly technology….

‘I typed something simple into this textbox and got like a million results, but none of them seem to be relevant….’

‘When I searched for Southampton Football Club, I got loads of results about Southampton and football and clubs, but virtually none actually about The Saints…’

‘If I try something like football, how do I drill into those results to find more information ? I can’t ? Oh….’

…..in short those solutions that are widely deployed today are very far from something really suitable for finding information. Whether you know what you are looking for or not.

This brings me on an important distinction. That between ‘Discovery’ and ‘Recovery’.

Back in the day, all web searches were ‘Discovery’. You shouted into the ether and if you were lucky, you got something back. Quite quickly you learned to use basic query construction and built Boolean expressions to filter out the ‘noise’ from your result set. Of course the signal to noise ratio in those days was relatively low; you could work your way through the top results in-rote and pick-up the knowledge you required. For almost every search, the results were uncertain. You never really knew if  it was going to be out there.

Looking at Google Trends today – as I do from time to time – and you’ll quickly see that in pretty much every top result, the queries being sent are based upon ‘Recovery’. By this I mean that people are looking to shortcuts via Google to information that they know is out there, but don’t have to hand. The vast majority of these are actually looking for distinct people, places or organisations/groupings; elements that we could loosely collect together and call entities. These entities are distinct and definable.

People are not searching for them because they wonder whether they exist or not. They are searching for them to actually take the next step, learning something new about them. In that respect, the act of typing them into Google is somewhat redundant. And more critically, the next step is evidently not defined by any loyalty to any specific information source. The information itself and whoever provides it is a commodity. No wonder so much is invested in the apparent silver bullet that is SEO.

If Google Trends is to be believed, then ‘Discovery’ is dead. It is an outmoded method of engaging with information ? Do we really believe that ? Or is it that Google has become so unsuitable as a method of discovering information that people have switched mode ?

Indeed, if ‘Recovery’ is the future mode of search, then is the traditional ‘call & response’ method of Google et al (with the aforementioned superfluous click path) similarly a technology solution far-past meeting user requirements ?

Are we at the end of the web’s own ‘Steam Age’ ?

Matt Mullen is an Industry Consultant at Nstein Technologies [http://www.nstein.com].