[Menai-LUG] Automatic translation and Google's Summer of Code

Kevin Donnelly kevin at dotmon.com
Tue May 12 14:25:37 UTC 2009


Press Release
12 Mai 2009


Automatic translation from Welsh gets a boost from France!


High-quality Welsh-English machine translation will come a step closer when a 
new initiative gets underway this month.

The multinational Apertium team, which released their Welsh-English translator 
(http://www.cymraeg.org.uk [1]) in August 2008, has been accepted into the 
fifth Google Summer of CodeTM [2], and one of the projects to be funded will 
be an improvement to that translator.

Apertium (http://www.apertium.org) is a Free Software [3] machine translation 
platform.  It was first developed to handle translation between related 
languages in Spain, but over the last few years it has been extended to deal 
with other languages.  To date, translators for 17 language pairs have been 
released, covering languages spoken by 1.1bn people, from English (est. 500m 
speakers) to Aranese (est. 4,000 speakers).  A similar number of other 
language pairs are in development - these include Indian languages like Hindi 
and Bengali, and Scandinavian languages like Norwegian and Sami.

Google Summer of Code offers student developers stipends to write code for 
open-source projects, advised by mentors already working on the projects, and 
has helped create millions of lines of code for dozens of projects.  This was 
the first year that Apertium applied for the program, and 9 Apertium projects 
are being supported.

The Apertium Welsh-English translator works by applying grammatical rules to a 
Welsh sentence to turn it into an English sentence.  An alternative approach 
(adopted by software like Moses [4]) is to use a large body of text to work 
out what the likely translation of a given phrase is.  

The Summer of Code student, Gabriel Synnaeve from Grenoble, France [5], will 
be working on combining these two approaches, using techniques developed at 
Carnegie-Mellon University in the USA [6].  The aim is to improve the quality 
of the translation - in effect, the Apertium and Moses translations will be 
compared, and the best bits of each will be used in the final translation.

For instance, take the Welsh sentence:
"Mae Heddlu'r De yn ymchwilio i farwolaeth dyn 41 oed o Abertawe." 
(South Wales Police are investigating the death of a 41-year old man from 
Swansea.)

Apertium currently produces:
"South Wales Police is investigating death man 41 years old from Swansea."

Moses currently produces:
"the south wales police investigation into the death of a man 41 years
of age of abertawe."

The aim is to combine the best chunks from each program, so that we get 
something like:
*[South Wales Police] *[is investigating] +[the death of a man] *[41
years old] *[from Swansea]
Here, the chunks marked * come from Apertium, and the one marked + from 
Moses, and combining both improves the quality of the translation.

This is cutting-edge stuff, and has rarely been tried before.  Prof Harold 
Somers, in a 2004 report for the Welsh Language Board [7], suggested that a 
medium-term goal for machine translation in Welsh would be “to integrate ... 
different [machine translation] engines into a single system”.  Nothing has 
been done on that to date, and Gabriel's work will be the first attempt to 
bring this vision of "multi-engine machine translation" for Welsh closer to 
reality.

Francis Tyers [8], who will be mentoring Gabriel, said, "I was quite surprised 
that we didn't get any Welsh students applying, but this is a fantastic 
opportunity to improve Welsh language technology.  I have no doubt we'll see 
some real gains in the translation quality."

Gabriel has already started work.  "At the minute I'm fine-tuning the Moses 
Welsh-English translator to make it as efficient as possible.  The Apertium 
community is very friendly, and I wanted to participate in a big open
source project, so I'm glad I went for it."

Kevin Donnelly [9], who co-developed the Apertium Welsh-English translator 
with Francis, noted that this was a big step forward for Welsh.  “It is 
wonderful that so many talented people are working on Apertium, and that they 
are giving Welsh such a high priority.  What we need now is for bodies 
promoting Welsh here in Wales to step up to the plate and give whatever 
enouragement and other support they can.”


Notes

[1] http://ufal.mff.cuni.cz/pbml-91-100.html.  Francis Tyers and Kevin 
Donnelly (2009): “apertium-cy - a collaboratively-developed free RBMT system 
for Welsh to English”, Prague Bulletin of Mathematical Linguistics, 91.

[2] http://code.google.com/soc

[3] http://www.fsf.org/about/what-is-free-software.  The Free Software 
Foundation's definition of “Free Software” is software that the user is free 
to use, copy, change, and distribute.

[4] http://www.statmt.org/moses.  Moses is an open-source statistical machine 
translation system.

[5] Gabriel Synnaeve is a student at the École Nationale Supérieure 
d'Informatique et de Mathématiques (http://ensimag.grenoble-inp.fr), a 
leading informatics and mathematics centre.  He will graduate in September 
2009 and will then begin work on a doctorate on Bayesian machine learning.

[6] Alon Lavie (http://www.cs.cmu.edu/~alavie) is leading this work.  See 
also:  http://www.cs.cmu.edu/~alavie/papers/EAMT-2005-MEMT.pdf.  S. Jayaraman 
and A. Lavie (2005): "Multi-Engine Machine Translation Guided by Explicit 
Word Matching", Proceedings of EAMT-2005.

[7] http://www.byig-wlb.org.uk/english/publications/publications/2302.doc.  
Harold Somers (2004): “Machine translation and Welsh: the way forward.”, 
Report for the WLB.

[8] Francis Tyers studied computer science at Aberystwyth, and is now a 
language engineer for Prompsit Language Engineering, S.L. and a PhD student 
at the Universitat d'Alacant.  He is a key Apertium developer, with a special 
interest in extending it to handle the Celtic languages.

[9] Kevin Donnelly has been working on Free Software in Welsh since 2003, and 
developed the online Welsh dictionary Eurfa (http://www.eurfa.org.uk).  


Contact:
Kevin Donnelly, 01248-715925, kevin at dotmon.com


-- 
Pob hwyl / Best wishes

Kevin Donnelly

www.cymraeg.org.uk - Welsh-English autotranslator
www.klebran.org.uk - Gwirydd gramadeg rhydd i'r Gymraeg
www.eurfa.org.uk - Geiriadur rhydd i'r Gymraeg

-- 
Pob hwyl / Best wishes

Kevin Donnelly

www.cymraeg.org.uk - Welsh-English autotranslator
www.klebran.org.uk - Gwirydd gramadeg rhydd i'r Gymraeg
www.eurfa.org.uk - Geiriadur rhydd i'r Gymraeg



More information about the Menai mailing list