Thoughts on privacy
Yesterday a scientist from Lüneburg published details about his webcrawler which he used to crawl about 1.6 million datasets from the social network SchuelerVZ including the name, the name of the school, the school id and the url of the profile picture. So, how did he do this? He created over 800 accounts which he used to bypass the request-limits. He and the media called this a “Datenleck”.
You have to know that all the information he crawled is visible to everyone who’s in possession of an user account. He hasn’t bypassed any data protection mechanisms except for the request-limit. Is this really a “Datenleck”? I don’t think so. I think it’s rather a problem of our society than of SchuelerVZ. On the one hand we’re all trying to hide our private life from our neighbours by building fences and so on, but on the other hand we’re publishing everything, really everything, to social networks like Twitter or the mentioned SchuelerVZ. So why are the people complaining that someone copied the data they voluntarily published to the web? In my opinion you have to accept that someone can copy the data you decide to publish. If a human can read the data, a crawler can do so, too. There’s no effective way of preventing data from beeing accessed by non-humans. Of course you can use captchas and other techniques, but I think they’re rather annoying than helpful. Another method are request-limits like those that SchuelerVZ already uses. But as you can see, it’s easy to bypass them. You could also limit the requests per IP, but that makes the crawler in the best case just slower, in the worst it spreads over multiple hosts with different IPs.
Instead of thinking about how data can be hidden from robots, we should think about what data we are publishing to the web. Even if this means an additional cost of educational work, the result will be much more satisfying than any existing or upcoming turing-test.




















