[Sussex] Apache throttling/fair use

John Crowhurst fyremoon at fyremoon.net
Sun Jan 23 22:31:48 UTC 2005


> Guys
>
> I'm building a LAMP (Linux, Apache, MySQL and PHP) server system for a
> client, all okay so far (if not as fast as I would like).   One of the
> things we would like the server to do is stop a user of the site from
> using wget (or something like it) to grab each page.  This would be
> a volition of the content copyright.

You can filter out wget by User-agent, but that doesn't stop the user from
changing its User-agent. You can stop wget by a robots.txt unless the user
has recompiled it without support for robots.txt. The referer field is
another way of making sure the pages don't get downloaded, however the
user can work around that by setting the referer to your site.

You could write an index.php that displays all the pages, so that is the
only page the user will see, and have it so that if a page is requested
too fast (in the event of a wget for instance) that the IP is blacklisted.

Set DirectoryIndex to index.php and Apache will look for index.php as the
default page. You can also lockdown the directory with chmod 711, which in
the event of losing the index.php for any reason will leave the directory
returning 403 Forbidden.

Perhaps through the use of a cookie you could confirm that someone is
using your page and not just grabbing it with wget.

> Do any of you know of a apache module that I could use to monitor (or
> two) that I could use to do the following (or part there of):
>
> 1). Limit the number of pages downloaded per time period,

You could look at mod_throttle or mod_bandwidth

> 2). Stop web crawlers (like wget) from just grabbing the site's
>     content.

A robots.txt with:

User-agent: *
Disallow /

Or by User-agent:

SetEnvIfNoCase User-Agent "wget" badbot

<Directory />
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
</Directory>

> Ta
> Steve

-- 
John




More information about the Sussex mailing list