

I’ve been thinking a lot about data recently. I don’t just mean in the IBM Smarter Planet sense — data not just in volume. Instead, I’ve been thinking about how one works with data, given that the number of signals for any given project seems today to be increasing exponentially.
My thesis here is not simply that “too much data is bad.” Nor am I saying “data is the only real currency we have today.” I believe that a combination of intelligent signal-gathering, coupled with a robust data set, together result in streamlined and actionable results. I intend to prove this with three examples: foursquare’s recent work to battle check-in fraud; a Google advertising product, Conversion Optimizer; and metadata in a Tweet.
About two weeks ago, foursquare announced that they would be kicking up their cheating detection. They said:
What we’d like to do is award points, mayorships and badges only when you’re at the place you say you’re at. Last week we started using a few different tricks using your phone’s GPS to try to verify this. (and if your phone doesn’t use GPS, we use a few different tricks)
But just a day later, they had to issue a second blog post, clarifying their new policy and the changes they were making. People didn’t understand how fraud would be determined (and they wanted to know more). Based on the second post, it seems clear that people wanted to know specifically: in the absence of precise GPS data, how would the foursquare team determine what are legitimate check-ins and what are not? Foursquare’s response was that they’re constantly tweaking the system: when one checks in via a means sans GPS — say, via SMS — the team would use a variety of signals to determine whether you were being honest or not; the phrase “history, frequency, etc.” was the most commonly cited technique. But here, with a dearth of data, foursquare found itself looking for extra information, for more layered data about the check-in (and the user) than is currently available. They acknowledge the system isn’t perfect, but that’s the best they can do.
On the flip-side, I’ve been recently hearing about colleagues who have campaigns opted in to Google’s Conversion Optimizer. I’m told the system is good, but it’s too good: it is rules-based, and does everything it can to stick to the conversion metric it has been given — and often it adheres quite strictly. The tool, as smart as it is, does what humans can’t: it stand strong on those borderline cases. But it’s strictness as a strength is also its rigidity as its downfall. Indeed, Conversion Optimizer has a virtually unlimited number of signals to process: there are so many moving parts, so many potential levers to pull, that even the system must, at some point, start to eliminate otherwise valuable information. It is in many ways a corollary to the Paradox of Choice, fewer choices and less information would make the algorithm’s job easier.*
Finally, ReadWriteWeb recently posted a fascinating image, via Raffi Krikorian, a dev on Twitter’s API/Platform team. It breaks-out the anatomy of a Tweet…it shows the full metadata embedded right along with those 140 characters you input on your keyboard. I’ve read at least one opinion which states that Annotations could be trouble from a metadata perspective (for many of the standards-based reasons scholars raise objection with Google books). But this outline gives a glimpse into perhaps the best way to solve this wonderful problem space of data volume: how best to structure for organizing, sorting and extracting meaning or trends? It is important that (at least at the beginning), Twitter is bundling this as structured data.
Ultimately, I see this scenario as the Goldilocks reality: too little and too much and just right. The “quality” (i.e. the source) of data is about equally as important as the volume. We have the ability to solicit and accept not just a floodgate of information. Today, we can receive a discreet number of particular signals, each with different intervals and frequencies of their own. This is how data (and those who examine/tease out meaning from data) will succeed in the years to come.
*As usual, I want to state up-front that my comments about Google — as well as its products, employees and business partners– have neither been sanctioned nor screened by Google. I also want to make clear that I am writing anecdotally here: I am not even aware of the clients in examples discussed herein.
Tags: 4sq, APIs, cheating, check-in, conversion optimizer, Foursquare, goldilocks, GPS, IBM, metadata, paradox of choice, SMS, tweet, Twitter





