Wednesday, November 11th, 2009
How to reduce bandwidth consumption of your site to the half
A website is, like many things in this life, something simple to create, but difficult to maintain. To avoid unpleasant surprises and ensure that all goes reasonably well, it is advisable to take a regular monitoring of basic parameters, such as pages served, bandwidth consumed, the activity of web spiders (eg Google's) and so on.
A tool I have found reliable enough to control some of these parameters is Awstats. This is a script that analyzes server log files of our web site, to generate a series of pages with summary tables and graphs of large numbers of parameters. You see that in operation in this demo. To gild the lily, you can also install Jawstats, a frontend which collects the data it has collected and shown in Awstats a Web much more dynamic and enjoyable. There is also a demo available Jawstats.

Jawstats attractively displays information about our website
What I tell you today is how, from Awstats data, managed to reduce the bandwidth consumed by the Tungsten PDA in a staggering 66%. Reviewing the tab "spiders" found that one of them was consuming hundreds of times more bandwidth than others, with a daily consumption of nearly 2Gb. This means 60Gb per month, which can ruin more than one webmaster who has his stay at a hosting site that charges according to the bandwidth consumed.
Furthermore, the spider is listed as "not _user_agent. Awstats Reviewing this time I get a somewhat clearer description of this spider: "Unknown robot (identified by empty user agent string). It is clear that the spider is identifying with an empty string. Reviewing the log files on my hosting (specifically the Apache access.log) I see that corresponds to entries like this:
XXXX - - [01/Nov/2009:04:53:02 -0800] "GET /wp-content/imagenes/bluetooth-carwhisperer.jpg HTTP/1.0" 200 28567 "-" "-"
While an entry for a normal spider itself that identifies the user agent:
XXXX - - [01/Nov/2009:01:33:54 -0700] "GET /de/2006/05/13/pagina-interesante-acerca-de-la-palm-tx/ HTTP/1.1" 200 16060 "-"
"Baiduspider+(+http://www.baidu.com/search/spider.htm)"
Searching the net I saw this spider "anonymous" was creating problems for many other webmasters as well as in principle do not provide any services, at least legally. So I had to do was to reject the views of this kind. For if we use Apache as Web servdor, there is no need to edit the file. Htaccess in the root folder of our website, and add the following lines:
#Unknown robot (identified by empty user agent string)
RewriteCond %{REQUEST_METHOD} !^HEAD$
RewriteCond %{REQUEST_URI} !^.*robots\.txt$
RewriteCond %{REQUEST_URI} !/favicon\.ico$
RewriteCond %{HTTP_REFERER} ^$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^$ [NC]
RewriteCond %{HTTP_REFERER} ^-?$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^-?$ [NC]
RewriteRule .* - [F]
The line "RewriteCond" define the conditions under which the rule applies, the line "RewriteRule" is actually denied access if they apply. The first three conditions exclude some legitimate requests, and the last four are those that identify those performed with empty user-agent or just the hyphen character.
The result, after several days of trial, is that this unwanted spider has stopped completely to access the Web, and the daily bandwidth has fallen dramatically, both in the Jawstats contrasted as in the data provided by my hosting.
So if you're having an unusual use of bandwidth on your site, for some months, review the activity of spiders.
By: Mark Gonzalez Troyes in General
| Comments RSS | Trackback |
Print this post
| Share: |




































Although in principle seems to be a good idea what you do, it's really not very useful in a number of cases. This is because there are many tools and websites (badly built), which details the user agent when accessing your site, and therefore blocks this rule completely. So you may want to verify who / who you are blocking, and seeing him access to those who feel that require it. The other option is to identify the person or persons are the ips from which you were attacking the site, and block them directly ...
Indeed, even as a first step of this method is effective containment, now comes the hard part of analysis, trying to locate the main IPs from this abuse, to create a more precise rule that does not give fair pay for the sinners.