Talks (was: Re: [YLUG] A basic question)

Steve Kemp steve at steve.org.uk
Fri Dec 28 21:26:53 GMT 2007


On Fri Dec 28, 2007 at 21:05:18 +0000, Alex Howells wrote:

> >    1.  Validate the 'Date' header on mails, if present.
> 
> Yeah, this should be an obvious one since spammers often set the date
> to something way in the past/future as an attempt to make your mail
> client display it at the top - did you have any success filtering mail
> if it was +/- a few days out?

  The intention was exactly that, if the mail was in the future or
 significantly in the past (you have to allow for retries, etc) to
 either tag it, or drop it.

  Unfortunately I discovered that the date-field is essentially
 free-form in practise.  So the automatic parsing failed as often
 as it succeeded, and wasn't reliable enough to be used.

> I've noticed Google is capable of showing you only search results in a
> chosen language, and do think that'd be rather cool for e-mail as
> well; provided it doesn't try to intepret an attached binary as
> language, of course.

  Agreed.

  I worked on this for a *long* time (at least a day!).  The idea
 was to take the content-type from anything which had a mime-type
 of text/plain, text/html, etc.

  So something like this:

  Content-Type: text/html;
          charset="iso-8859-1"
          Content-Transfer-Encoding: quoted-printable

  Unfortunately I discovered that character sets and encodings are
 tricky to handle reliably.  A lot of the time the character set
 would be set to something bogus, and so trying to use that information

> Frankly 100% of the e-mail I care about arrives in English, I don't
> speak any other languages barring a passing familiarity with Welsh and
> perhaps French.  Being able to say "Not English? GET IN THE
> QUARANTINE!" would rock.

  It would :)

> I know I've seen your system ;)  Was curious about other stuff out there too!

  There are other systems, but the problem is that most of them are
 either big and expensive (messagelabs, etc) or small and hard to
 judge (antespam.com, etc).

  I guess now that Google allows you to point your MX records at
 their servers there is another option.  But that is slightly different
 from the SMTP-proxy idea; since they'd be hosting your mail, not just
 filtering it en route.

> Amazing isn't it?  There *must* be some big sites hosted at work but
> we rarely see anything "below" the O/S ;)

  Yes :)

> Is memcached something that'll apply to any dynamic content, so stuff
> generated by mod_perl, PHP, Ruby and all that?

  Providing that the code was modified to work with it, yes.

> What're the drawbacks to the technology?

  The code must be modified to work with it..

  (I've not used Ruby's fragment caching, but I had the impression
 that that was essentially free).

> Does it require much modification to site code to define what's cached?

  You have to define what is cached, and handle it yourself.  Basically
 rather than having your routine look like this:

 getMenuBar()
 {
    my $date = .......

    # return either state for menu bar setting for current user
    # or the rendered html as you see fit.
    #
    # (.e.g. number of "new messages", etc, which might vary between
    #  page fetches)
    #
    return $data;
 } 
> 


 You change that to read:

 sub getMenuBar
 {
    my $cached =  memcached->get("user_menu_$user" );
    return $cached if ( $cached );

    # not in cache .. generate the slow way...

    # update the cache
    memcached->set( "user_menu_$user", $data );
    return( $data );
 }

  If there are only a few places that are critical *and* the code
 is structured so that you can cleanly wrap a cache-test around them
 then it is very simple to get started.

  YOu don't need to store everything in the memory cache, but if you
 do it'll be a lot faster rather than fast in places and slow in others.

  As an example I cache lots of things on the debian-administration.org
 website.  In the past 21 hours the cache hit + miss numbers are:

    hits  : 6243674
    misses:  130865

  That means that 6243674 "items" have been fetched from the memory
 cache and not fetched from the (compartively) slow MySQL database.
 130865 items were not found in the cache, and were fetched from the
 database.

  Some items are tiny, such as the mapping between usernames and 
 user IDs.  Other items were the compete text of an article, or the
 tree representing all the comments posted upon an article.
 Unfortunately I can't graph the size of the objects!

> That "seems" ridiculous but I can think of at least 15-20 projects
> which do it this way. So a large percentage. Would you care to comment
> on why they don't adopt a 'single tool' approach, given its clearly
> superior?

  Lots of people don't think it is important enough.  Primarily I think
 that is because in the real world there is often a large gap between
 best-practise and actual-practise.

  I've worked in places that don't use source-control and thought they
 didn't have a problem.  Other places stick to the dark-ages, in the
 $work[-1] we developed in Cobol with RCS!  So they get the point for
 using revision control, but lose many for using something so basic.

  I guess when you start talking about projects the size of Debian
 the release is so large, and has so many people working at different
 parts there isn't a simple way to freeze and release - too many people
 have their fingers in the pie.  Similar problems probably of size and
 "ownership" probably account for a lot of the manual steps.

  (It is interesting that Mozilla, which sets great store by its
 auto-builders [tinderbox] don't have an automated release system. I
 assume that is because different people and teams are in charge of
 different parts of the browser such as the rendering engine, etc.)

  Still once you have made the change to automated releases I think
 few people would ever go back..

> Sounds like you'd be a good candidate for giving one of these proposed
> talks! :-P

  I'm shy... and remote ;)

Steve
-- 



More information about the York mailing list