[Gllug] VACANCY: Data parsing webmonkey to work in Python, Perl, Erlang, Haskell, or Lisp

Wed Sep 27 14:37:56 UTC 2006

Summary
=======

Company: Gambit Research
Location: Chiswick, W4
Salary: 25-35K
Contact: jobs at gambitresearch.com

The details
========

We're a company which tries to predict the future. If that sounds
terribly pretentious, or worryingly like a corporate mission
statement, allow us to explain: We act as the technology provider to
one of the largest syndicates of professional gamblers in the UK.

We operate much like a hedge fund, minus the suits and the egos. We're
very hacker friendly, and prefer to use clever tools and expressive,
powerful languages wherever possible. It's probably one of the few
places where you'll see functional programming used in the real world.

An important part of our job involves scraping data from
websites. We're interested in what odds the bookmaker is offering on a
particular outcome. If it turns out to be something we're interested
in, we have the facility to automatically place the bet via a web
scraper.

The web contains a wealth of information about historic sporting
events. Which players were in the starting XI in particular football
match? All this data is available on the web in human readable form.
We're interested in parsing it and extracting data we can work with to
make predictions about who's going to win the next game.

The vacancy is for someone to write parsers for semi-structured data
feeds from the web.

The role calls for a spectrum of skills. The everyday business
involves writing software to scrape prices from bookmakers' websites
and to automatically place the bets. Sometimes it's easy (write a
little Perl or Python script to do it all with a regular
expression). Sometimes it's more difficult and requires a degree of
cunning. Perhaps your particular web site doesn't like being
scraped?

There are bigger more open problems too. How can you write tools to
make your life easier? Rather than working on the raw HTML text it may
be simpler to parser the HTML into a DOM tree and work with that. Is
it possible to exploit regularity in this structure to determine where
the useful information lies? If the HTML is being generated by some
unknown template on the server, is it possible to automatically infer
a likely structure for that template by examining multiple instances
of pages generated by that template? There are some attempts to do
this in the academic literature, but they don't seem to have made it
as far as the real world. Yet.

Our hiring criteria is that the candidate must: a) be bright; and b)
get things done. Perhaps you will have some in-depth knowledge of some
of the areas above. It's more important that you understand the
underlying issues and be bright enough to pick things up. It's also
important that you can write good code in any language - using the
most appropriate tool for the job.

You'll need to have a well-rounded understanding of computer science
to be able to do all of that. You'll be working on your own project,
and you'll need to put all the pieces together to make it work. While
you'll have lots of expertise available to help, what you won't find
is a separate department to do all the things some programmers
wouldn't normally do (setting up Linux servers, administering a
database, etc).

Any of the following skills would be desirable:

Python, Perl, Erlang, C, Shell scripting, Assembly, Haskell, Lisp,
Unix, SQL, Regular Expressions, Core Internet protocols (HTTP, etc).

A pathological distrust of the following would also be of great
interest:

Windows, .NET, Java, Visual Basic, "Certified Engineer" status.
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug