Talks (was: Re: [YLUG] A basic question)
Steve Kemp
steve at steve.org.uk
Fri Dec 28 21:26:53 GMT 2007
On Fri Dec 28, 2007 at 21:05:18 +0000, Alex Howells wrote:
> > 1. Validate the 'Date' header on mails, if present.
>
> Yeah, this should be an obvious one since spammers often set the date
> to something way in the past/future as an attempt to make your mail
> client display it at the top - did you have any success filtering mail
> if it was +/- a few days out?
The intention was exactly that, if the mail was in the future or
significantly in the past (you have to allow for retries, etc) to
either tag it, or drop it.
Unfortunately I discovered that the date-field is essentially
free-form in practise. So the automatic parsing failed as often
as it succeeded, and wasn't reliable enough to be used.
> I've noticed Google is capable of showing you only search results in a
> chosen language, and do think that'd be rather cool for e-mail as
> well; provided it doesn't try to intepret an attached binary as
> language, of course.
Agreed.
I worked on this for a *long* time (at least a day!). The idea
was to take the content-type from anything which had a mime-type
of text/plain, text/html, etc.
So something like this:
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Unfortunately I discovered that character sets and encodings are
tricky to handle reliably. A lot of the time the character set
would be set to something bogus, and so trying to use that information
> Frankly 100% of the e-mail I care about arrives in English, I don't
> speak any other languages barring a passing familiarity with Welsh and
> perhaps French. Being able to say "Not English? GET IN THE
> QUARANTINE!" would rock.
It would :)
> I know I've seen your system ;) Was curious about other stuff out there too!
There are other systems, but the problem is that most of them are
either big and expensive (messagelabs, etc) or small and hard to
judge (antespam.com, etc).
I guess now that Google allows you to point your MX records at
their servers there is another option. But that is slightly different
from the SMTP-proxy idea; since they'd be hosting your mail, not just
filtering it en route.
> Amazing isn't it? There *must* be some big sites hosted at work but
> we rarely see anything "below" the O/S ;)
Yes :)
> Is memcached something that'll apply to any dynamic content, so stuff
> generated by mod_perl, PHP, Ruby and all that?
Providing that the code was modified to work with it, yes.
> What're the drawbacks to the technology?
The code must be modified to work with it..
(I've not used Ruby's fragment caching, but I had the impression
that that was essentially free).
> Does it require much modification to site code to define what's cached?
You have to define what is cached, and handle it yourself. Basically
rather than having your routine look like this:
getMenuBar()
{
my $date = .......
# return either state for menu bar setting for current user
# or the rendered html as you see fit.
#
# (.e.g. number of "new messages", etc, which might vary between
# page fetches)
#
return $data;
}
>
You change that to read:
sub getMenuBar
{
my $cached = memcached->get("user_menu_$user" );
return $cached if ( $cached );
# not in cache .. generate the slow way...
# update the cache
memcached->set( "user_menu_$user", $data );
return( $data );
}
If there are only a few places that are critical *and* the code
is structured so that you can cleanly wrap a cache-test around them
then it is very simple to get started.
YOu don't need to store everything in the memory cache, but if you
do it'll be a lot faster rather than fast in places and slow in others.
As an example I cache lots of things on the debian-administration.org
website. In the past 21 hours the cache hit + miss numbers are:
hits : 6243674
misses: 130865
That means that 6243674 "items" have been fetched from the memory
cache and not fetched from the (compartively) slow MySQL database.
130865 items were not found in the cache, and were fetched from the
database.
Some items are tiny, such as the mapping between usernames and
user IDs. Other items were the compete text of an article, or the
tree representing all the comments posted upon an article.
Unfortunately I can't graph the size of the objects!
> That "seems" ridiculous but I can think of at least 15-20 projects
> which do it this way. So a large percentage. Would you care to comment
> on why they don't adopt a 'single tool' approach, given its clearly
> superior?
Lots of people don't think it is important enough. Primarily I think
that is because in the real world there is often a large gap between
best-practise and actual-practise.
I've worked in places that don't use source-control and thought they
didn't have a problem. Other places stick to the dark-ages, in the
$work[-1] we developed in Cobol with RCS! So they get the point for
using revision control, but lose many for using something so basic.
I guess when you start talking about projects the size of Debian
the release is so large, and has so many people working at different
parts there isn't a simple way to freeze and release - too many people
have their fingers in the pie. Similar problems probably of size and
"ownership" probably account for a lot of the manual steps.
(It is interesting that Mozilla, which sets great store by its
auto-builders [tinderbox] don't have an automated release system. I
assume that is because different people and teams are in charge of
different parts of the browser such as the rendering engine, etc.)
Still once you have made the change to automated releases I think
few people would ever go back..
> Sounds like you'd be a good candidate for giving one of these proposed
> talks! :-P
I'm shy... and remote ;)
Steve
--
More information about the York
mailing list