miscoranda: by Sean B. Palmer

User-Agent Abuse

According to RFC 2616, the User-Agent header is a statistical datapoint and capability preference, allowing the receiving site to serve pages based on what the client is known to be able to receive: "This [header] is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations." So if the limitations of your user agent change, you can modify the User-Agent field that you send appropriately.

With this in mind, I often set my User-Agent header to "Mozilla/5.0 (Something)" when I'm using wget, curl, or urllib in Python, but I'm often told that this is a bad thing, even an abuse of the header. That's absurd; the abuse is usually on the server side, not the client. I fake the User-Agent because many sites don't allow download via curl or wget—two that spring immediately to mind are google.com and f2o.org. These sites have a legitimate practical reason to do so: presumably a high percentage of the hits they receive from these user agents are crawlers and bots. With Google especially, this is going to cost them a lot of money, so blocking is prudent.

But bots should adhere to robots.txt, and I'll bet that a significant portion of the requests that curl and wget banning sites receive from those clients are legitimate. Their filtering is, therefore, a technical solution to a societal problem. It's a bit like banning Firefox on a framed site because Firefox can display the content unframed. So whilst I realise that banning the clients server side is something that pragmatically just has to be done, a hack to save a lot of money and bandwidth, it's an abuse of the User-Agent header, and it's taking place on the server. Getting around that by faking the User-Agent header client side is abuse by neither morals nor specification, as long as the client is being used legitimately.

(Tip of the hat to John Cowan.)

by Sean B. Palmer, at 2006-02-20 20:53:25. Comment?

Validation at a Glance · GRDDL for XHTML Schemata Associations

Sean B. Palmer