If we forget most of it immediately, it's not wiretapping either

Phorm launches data pimping fight back | The Register:

And then from the information returned by the website, the profiler looks at the content. The first thing it does is it ignores several classes of information that could potentially be sensitive. So there’s no form fields, no numbers, no email addresses (that is something containing an “@”) and anything containing a title like Mr or Mrs. Aren’t you collecting the first three characters? MB: Because of a peculiarity of the tokenisation, numbers three digits or shorter aren’t collected anyway, they’re too short so there’s no numbers at all. If you have a mixture of letters and numbers – a compound – that would be potentially collected. Say, for example, the start of postcode? MB: Yes… KE: But as you’ll see it’s irrelevant anyway. MB: So we do this basic cleaning process and then we take a look at the key words that have come from the page and we eliminate “noise words” that have a low intrinsic meaning. So what we’re left with is a clean version of the key words in the page which we then basically do a chart of the ten most commonly occurring words. This process has the effect of largely eliminating personally identifiable information [PII] from the web page because it would have to contain PII that didn’t match any of our criteria and also appeared repeatedly in the page. The profiler takes this “data digest” and it passes it through the box we call the anonymiser and into the box we call the channel server. The channel server has got a database of advertising categories that we call channels – things like sport, health and beauty, travel, luxury cars, etc. The channels are global to the whole system [across ISP networks]. Via the Open Internet Exchange advertisers are able to specify the channels they want to target. The channels are controlled in the content they can have. We don’t have adult advertising, no medical channel, no tobacco, no gambling. The channels are also designed so they always match a minimum number of unique users – 5,000. A channel has to be sufficiently broad so that it doesn’t just reduce to one or two users. As soon as that match has been made the data digest, which has only ever been in memory, is immediately deleted. It never goes to disk.

What’s scary is that I almost believe this guy’s claim that he not trying to run a massive privacy invasion. But he really doesn’t seem to understand what personally identifiable information is — even if I clean out your email address, phone number and postal code there will be piles of information left over from which you can be identified. And even if a “channel” has 5000 people passing through it over time, I don’t believe the claim that the instantaneous population won’t get down to single digits every now and then.

So it’s down to “trust us” and “trust no one will hack our incredibly-high-value-target machines”.


One Response to "If we forget most of it immediately, it's not wiretapping either"

  1. Phorm UK tech team Says:

    Hi there
    I work for Phorm’s UK team and there is a full transcript of last nights live interview with it’s CEO at http://www.webwise.com/chat

