Living deliverables are the online drafts of the project's deliverable documents. Please use the cover page as in the template deliverable and enter the administrative data for the deliverable. Once finished editing, and ready for the final version, produce the PDF from the top cover page, using the print icon, and create a Biblio item to archive the version of the document to be delivered. Both the Biblio item and the link to the corresponding living deliverable should be entered in the administrative view for the project's deliverables.
Having done the above, the table of deliverables is automatically filled in by the system.
| Contract No.: | FP7-ICT-247914 and FP7-ICT-288317 | 
|---|---|
| Project full title: | MOLTO - EEU Multilingual Online Translation | 
| Deliverable: | D1.7A Advisory report | 
| Security (distribution level): | Consortium | 
| Contractual date of delivery: | M39 | 
| Actual date of delivery: | 27 May 2013 | 
| Type: | Report | 
| Status & version: | Final | 
| Author(s): | Stephen Pulman (University of Oxford), Keith Hall (Google Research) | 
| Task responsible: | UHEL | 
| Other contributors: | 
This annex to Deliverable D1.7 is the final report written by the MOLTO Advisory Panel after attending the last project meeting in Barcelona on 23 May 2013.
The original aims of the MOLTO project were to use the GF approach to provide high quality translations of texts in limited domains in real time, enabling an author to simultaneously produce versions of a text in multiple languages. These aims also included expansion to cover new languages, to enhance lexical and other resources, and to develop frameworks and training to simplify and make more productive the tasks of grammar development. Other aims included an investigation of the role of controlled languages in interaction with ontologies and other types of reasoning and knowledge representation systems, and to explore hybrid approaches to machine translation - to combine the precision of GF based translation with the coverage and robustness of statistical machine translation methods. The project also hoped via its industrial partners to demonstrate that GF applications could be of commercial value. Three case studies were envisaged: mathematical exercises (using the Sage framework) in 15 languages, patent data in at least 3 languages, and museum object descriptions in 15 languages.
Over the course of the MOLTO project much progress has been made in achieving the above goals. One of the challenges to building and maintaining hand-built grammar systems is managing the grammar development. The MOLTO project delivers a new set of development tools ranging from the cloud-based grammar editor to an integrated development environment plugin (Eclipse). The Grammatical Framework summer school continues to train new members of the community in developing the grammars using these tools. The development of a new translation system takes a matter of days; adding a new language to a system takes hours (once the resource grammar for the language exists). The most labor intensive part of the work is developing a resource grammar, which takes on the order of 3 to 9 months; once completed it is easily exploited through the MOLTO tools and existing systems based on MOLTO technology can utilize the new language (for translation, generation, or information access).
In order to ease the development of multilingual grammars (and domain specific resource grammars) MOLTO delivers tools and techniques to expand the lexicons. One approach iis to utilize translations of wordnet which maintain links across the lexical entries. This allows the grammar developer to write sense-disambiguated translation lexicons.
In the final year of the project, a robust parser for the Grammatical Framework grammars was developed. This statistical parser bridges the gap between brittle hand-coded grammars and data-driven statistical parsers. The parser is capable of generating parse fragments even when a complete analysis is not available under the defined grammar. Performance is competitive with widely used systems like the Stanford parser.
MOLTO explored a variety of applications of the rich grammar formalism of GF along with the development tools. One focus was on the application to multilingual information access: the Ontotext ontologies, the Cultural Heritage retrieval and verbalization, and ACEWiki inference. While another focus was to expand the translation capabilities of GF by exploring the integration of GF and statistical machine translation techniques. Leveraging the strong syntactic typing from GF, the GF/SMT translation system was able to perform state-of-the-art translation for patents.
One of the industrial partners, beInformed, has successfully deployed systems based on MOLTO technologies to model their information and general customer-facing documents in multiple languages. The other industrial partner, OntoText, has detailed plans to include MOLTO technologies in its product line.
We were particularly impressed by the effort devoted to evaluation, particularly of translation, but also of other MOLTO applications such as business logic modelling. For translation, the original promise of the MOLTO project was to provide high quality precise translations, within limited domains. The various tests carried out with human judges largely seem to confirm that this goal has been achieved: whereas existing commercial systems provide wider coverage than the MOLTO tools, the quality of the results is not as high. Similarly, the comparison carried out by Be Informed of the MOLTO tools against their existing solution (Velocity) seems to confirm their superiority.
Overall we believe the project team are to be congratulated on what they have achieved over the course of the project: we regard the project as having successfully accomplished all of the goals it originally set for itself.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | DX.2 Annual public report | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M24 | 
| Actual date of delivery: | 15 March 2010 | 
| Type: | Report | 
| Status & version: | Final | 
| Author(s): | O. Caprotti et al. | 
| Task responsible: | UGOT | 
| Other contributors: | 
Annual report on activities carried out in the framework of the MOLTO EU project. This report is designed for Web publishing, for a broad public outside the consortium. It documents the main results obtained by the MOLTO project during the first two years of activity and promotes the objectives of the project.
MOLTO’s goal is to develop a suite of tools for translating texts between multiple languages in real time with high quality. MOLTO uses domain specific semantic grammars and ontology-based interlinguas implemented in GF (Grammatical Framework), a grammar formalism where multiple languages are related by a common abstract syntax. Until now GF has been applied in several small-to-medium size domains, typically targeting up to ten languages, but during MOLTO we will scale this up in terms of productivity and applicability by increasing the size of domains and the number of languages.
MOLTO aims to make its technology accessible to domain experts who lack GF expertise so that building a multilingual application will amount to just extending a lexicon and writing a set of example sentences. The most research-intensive parts of MOLTO are the two-way interoperability between ontology standards (such as OWL and RDF) and GF grammars and the extension of rule-based translation by statistical methods. The OWL-GF interoperability enables multilingual natural language based interaction with machine-readable knowledge while the statistical methods add robustness to the system when desired. MOLTO technology is released as open-source libraries for third-party translation tools and web pages and thereby fits into standard workflows.
The EU project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run until June 2013.The Consortium, comprising the universities of Gothenburg, Helsinki and Polytechnical Barcelona together with the industrial Bulgarian partner OntoText, has been enlarged by the addition of University of Zurich and of the Dutch Be Informed.
  MOLTO's multilingual translation tools use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are designed to model domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language.
  MOLTO's multilingual translation tools use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are designed to model domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. 
An implementation of this technology is alread available in the Grammatical Framework, GF. As a result of the MOLTO project work, GF technologies are complemented by the use of ontologies, viewed as formalisms employed by the semantic web for capturing structural relations, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from linguistic data.
MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European
Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is constant on-going work on creating new resource grammars, in particular  Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, and Swahili. The coverage and accuracy of the GF grammar library resource varies among the different languages and is  documented on the web site of GF. 
When comparing MOLTO to popular translation tools like Systran (Babelfish) and Google Translate, the main difference is the intended user of the tools: these tools target end-users of information whereas MOLTO targets producers of information.
By producers of information, MOLTO is able to handle well scenarios in which the language is constrained, as examples one may consider e-commerce sites, where products are often described with repeated linguistic expressions (e.g. Wikipedia articles, contracts, business letters, user manuals, and software localization), but even social networks often display usage of common phrases ("Happy birthday!" "I like it" "The hotel is located ...." "Your reservation is confirmed").  Ideally, MOLTO tools will enable publishers of websites to add multilinguality with little effort but most importantly with the certification that the meaning of the message conveyed stays unaltered across languages. MOLTO is also working on a multilingual semantic wiki ............. 
There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood codified language, either because it is of technical nature or because of common everyday usage.
The domains considered during the MOLTO project show a range of features of constrained natural languages: mathematical exercises and biomedical patents employ a technical and sophisticated jargon, whereas museum object descriptions use a language accessible to anybody. 
The expected final product of MOLTO is an open-source software suite of tools comprising a grammar development environment, an application programming interface and environment to assist the translators' workflow, and sample application grammar libraries for the domains of mathematical word problems, biomedical patents, and cultural artefacts.
 Translation systems in MOLTO rely on multilingual grammars written in the GF programming language. Until now, the development environments available to GF grammarians consisted of a generic text editor, such as Emacs, used in combination with the GF interactive command shell, and the online GF documentation. This is a simple and effective environment for the experienced grammar developer. To better support less experienced grammar developers, one of the goals of the MOLTO project is to create an Integrated Development Environment for grammar development. The GF Simple Editor (by Thomas Hallgren), an initial prototype of a web-based grammar development environment that offers the same core functionality as the traditional environment is now available at http://www.grammaticalframework.org/demos/gfse.  Its main features include grammar editing, grammar compilation, error detection, testing and visualization. Moreover, it enables the creation of web-based translation systems without installation of any software, as it is using web services to carry out compilation and interpretation tasks, and thus gives quick access to GF to novice and occasional users. Intended scenario for this editor is in supporting fast testing and prototyping of example grammars in tutorial settings, for instance during teaching and demonstrating GF.
 Translation systems in MOLTO rely on multilingual grammars written in the GF programming language. Until now, the development environments available to GF grammarians consisted of a generic text editor, such as Emacs, used in combination with the GF interactive command shell, and the online GF documentation. This is a simple and effective environment for the experienced grammar developer. To better support less experienced grammar developers, one of the goals of the MOLTO project is to create an Integrated Development Environment for grammar development. The GF Simple Editor (by Thomas Hallgren), an initial prototype of a web-based grammar development environment that offers the same core functionality as the traditional environment is now available at http://www.grammaticalframework.org/demos/gfse.  Its main features include grammar editing, grammar compilation, error detection, testing and visualization. Moreover, it enables the creation of web-based translation systems without installation of any software, as it is using web services to carry out compilation and interpretation tasks, and thus gives quick access to GF to novice and occasional users. Intended scenario for this editor is in supporting fast testing and prototyping of example grammars in tutorial settings, for instance during teaching and demonstrating GF.
 A different, more sophisticated high-level integrated development environment is based on the  Eclipse platform and specifically tailors GF grammar-writing. The GF Eclipse plugin (by John Camilleri) currently features real-time syntax checking,  automatic code formatting, import-aware auto-complete suggestions, cross-reference resolution, inline contextual documentation, "New Module" wizards, external library browsing, launch shortcuts to the GF shell, and a visual tool for running treebank test suites. These new, powerful, time-saving development tools are aimed at both new users and GF veterans alike. It is available online at http://www.grammaticalframework.org/eclipse/ and at http://www.molto-project.eu/wiki/gf-eclipse-plugin.
 A different, more sophisticated high-level integrated development environment is based on the  Eclipse platform and specifically tailors GF grammar-writing. The GF Eclipse plugin (by John Camilleri) currently features real-time syntax checking,  automatic code formatting, import-aware auto-complete suggestions, cross-reference resolution, inline contextual documentation, "New Module" wizards, external library browsing, launch shortcuts to the GF shell, and a visual tool for running treebank test suites. These new, powerful, time-saving development tools are aimed at both new users and GF veterans alike. It is available online at http://www.grammaticalframework.org/eclipse/ and at http://www.molto-project.eu/wiki/gf-eclipse-plugin.
Controlled natural languages are controlled subsets of natural languages, which are normally used in technical domains. The purpose of these languages is to reduce the complexity involved in natural languages, and to eliminate the ambiguity. The users of these languages are experts within their domain, and are trained to use these languages.
 The MOLTO Phrasebook (by Aarne Ranta et al.) is one such controlled natural language, whose domain is that of touristic phrases. It covers greetings and travel phrases such as "this fish is delicious", "how far is the airport from the hotel" in 17 languages. The translations show the kind of quality that can be hoped for when using a GF grammar that can handle disambiguation in conveying gender and politeness, for instance from English to Italian. It is available both on the web from  http://www.grammaticalframework.org/demos/phrasebook/ and as a stand-alone, offiline Android application, the PhraseDroid,  from http://tinyurl.com/7tyzvfd. Screenshots of the mobile application are shown in the image on the side.
 The MOLTO Phrasebook (by Aarne Ranta et al.) is one such controlled natural language, whose domain is that of touristic phrases. It covers greetings and travel phrases such as "this fish is delicious", "how far is the airport from the hotel" in 17 languages. The translations show the kind of quality that can be hoped for when using a GF grammar that can handle disambiguation in conveying gender and politeness, for instance from English to Italian. It is available both on the web from  http://www.grammaticalframework.org/demos/phrasebook/ and as a stand-alone, offiline Android application, the PhraseDroid,  from http://tinyurl.com/7tyzvfd. Screenshots of the mobile application are shown in the image on the side.
 A different kind of controlled natural language is one that is used to command an interactive software system, for instance a computational engine such as Sage. The GFSage software application (by Jordi Saludes) shows a command-line tool able to take commands in natural language, have them executed by Sage,  and have the answers rendered in natural language too. The image on the side shows the web interface of Sage augmented by the MOLTO natural language command module. Note that this application demonstrates how a MOLTO library can add multimodality to a system originally developed with keyboard input/output as user interface. In fact, by piping the results to a speech engine, one can have the results aurally thus increasing accessibility of the computational systems to the visually impaired. The natural language interface relies on the Mathematical Grammar Library that can be tested at http://www.grammaticalframework.org/demos/minibar/mathbar.html and documentation on the GFSage module is available as deliverable  http://tinyurl.com/78bh4ap from the MOLTO wiki http://www.molto-project.eu/wiki/d62-prototype-comanding-cas.
  A different kind of controlled natural language is one that is used to command an interactive software system, for instance a computational engine such as Sage. The GFSage software application (by Jordi Saludes) shows a command-line tool able to take commands in natural language, have them executed by Sage,  and have the answers rendered in natural language too. The image on the side shows the web interface of Sage augmented by the MOLTO natural language command module. Note that this application demonstrates how a MOLTO library can add multimodality to a system originally developed with keyboard input/output as user interface. In fact, by piping the results to a speech engine, one can have the results aurally thus increasing accessibility of the computational systems to the visually impaired. The natural language interface relies on the Mathematical Grammar Library that can be tested at http://www.grammaticalframework.org/demos/minibar/mathbar.html and documentation on the GFSage module is available as deliverable  http://tinyurl.com/78bh4ap from the MOLTO wiki http://www.molto-project.eu/wiki/d62-prototype-comanding-cas.
 To demonstrate the MOLTO Knowledge Reasoning Infrastructure, the Patent retrieval prototype (by Milen Chechev from Ontotext in collaboration with the UPC and the UGOT teams), at http://molto-patents.ontotext.com,  shows examples of queries in natural language to a set of patents in the pharmaceutical domain. Users can ask question in French and English like 'what are the active ingredients of "AMPICILLIN"', 'que sont les formes posologiques de "AMPICILLIN"'. The system is still under development: at present the online interface allows to browse the retrieved patents and returns the semantic annotations that explain why any particular patent has matched the user's criteria. Similar technology for knowledge retrieval is being applied also in the case of cultural heritage, namely with descriptions of artefacts from the museum of Gothenburg, in order to allow multilingual query and retrieval. For this task, an ad-hoc ontology has been created and its preliminary GF application grammar can be tested by selecting "Painting.pgf" at http://www.grammaticalframework.org/demos/minibar/minibar.html.
  To demonstrate the MOLTO Knowledge Reasoning Infrastructure, the Patent retrieval prototype (by Milen Chechev from Ontotext in collaboration with the UPC and the UGOT teams), at http://molto-patents.ontotext.com,  shows examples of queries in natural language to a set of patents in the pharmaceutical domain. Users can ask question in French and English like 'what are the active ingredients of "AMPICILLIN"', 'que sont les formes posologiques de "AMPICILLIN"'. The system is still under development: at present the online interface allows to browse the retrieved patents and returns the semantic annotations that explain why any particular patent has matched the user's criteria. Similar technology for knowledge retrieval is being applied also in the case of cultural heritage, namely with descriptions of artefacts from the museum of Gothenburg, in order to allow multilingual query and retrieval. For this task, an ad-hoc ontology has been created and its preliminary GF application grammar can be tested by selecting "Painting.pgf" at http://www.grammaticalframework.org/demos/minibar/minibar.html.   
 The MOLTO translation environment is being developed (by UHEL with contributions of UGOT) as a customization the GlobalSight translation system (www.globalsight.com). The aim is to be able to embed MOLTO translation tools to a third-party translation platform. MOLTO tools are designed with a focus only on translation. GlobalSight is an open source translation management platform, which provides the infrastructure needed in a professional translation workflow. More specifically, a MOLTO translation editor will be available on the side of conventional editors and be characterized by the possibility of fetching terms from the FactForge ontology via the TermFactory database, allowing to import and export terms in TermFactory. Terminology work is also supported by OntoR, an ontology extraction system (by Seppo Nyrkkö) implemented as s semi-supervised machine learning process, where new term dictionary candidates may be found in given text, by finding "closest matches" in previously known _ontologies_ (i.e. hierarchical vocabulary, term structure, usually industry or domain specific). A corpus-harvested new term can be _aligned_ with its closest matches in an prior existing term ontology. New term's functional and semantic environment is analyzed, and the feature variables extracted are compared to values of previously known terms. The user is given the supervision control to decide the best alignment match and thus refine the ontology incrementally. These tools are not yet ready for distribution but a preview can be seen during the project meetings' open days.
 The MOLTO translation environment is being developed (by UHEL with contributions of UGOT) as a customization the GlobalSight translation system (www.globalsight.com). The aim is to be able to embed MOLTO translation tools to a third-party translation platform. MOLTO tools are designed with a focus only on translation. GlobalSight is an open source translation management platform, which provides the infrastructure needed in a professional translation workflow. More specifically, a MOLTO translation editor will be available on the side of conventional editors and be characterized by the possibility of fetching terms from the FactForge ontology via the TermFactory database, allowing to import and export terms in TermFactory. Terminology work is also supported by OntoR, an ontology extraction system (by Seppo Nyrkkö) implemented as s semi-supervised machine learning process, where new term dictionary candidates may be found in given text, by finding "closest matches" in previously known _ontologies_ (i.e. hierarchical vocabulary, term structure, usually industry or domain specific). A corpus-harvested new term can be _aligned_ with its closest matches in an prior existing term ontology. New term's functional and semantic environment is analyzed, and the feature variables extracted are compared to values of previously known terms. The user is given the supervision control to decide the best alignment match and thus refine the ontology incrementally. These tools are not yet ready for distribution but a preview can be seen during the project meetings' open days.
 The main dissemination venues for the results of MOLTO are the MOLTO website and the project meetings. The website at www.molto-project.eu makes available all the project’s results and advertises news, deliverables, and events organized by the partners. It also archives all MOLTO publications, both delivered at international meeting as well as at
 The main dissemination venues for the results of MOLTO are the MOLTO website and the project meetings. The website at www.molto-project.eu makes available all the project’s results and advertises news, deliverables, and events organized by the partners. It also archives all MOLTO publications, both delivered at international meeting as well as at
internal workshops. The MOLTO news updates are posted as RSS feed suitable for aggregation by interested portals that is distributed by the MOLTO twitter feed and via the MOLTO LinkedIn group.
 MOLTO sponsored the GF Summer School 2011, Frontiers of Multilingual Technologies during August 15-26, 2011 hosted by UPC in Barcelona, Spain. The two weeks program  included lectures  from  "Getting started with GF", to "GF application development" and "Resource grammar development" and was attended by around 20 participants from around the world. The use case studies of MOLTO were amply presented by members of the Consortium. On August,1 2011 Aarne Ranta was invited to give a tutorial on GF during CADE-the 23rd International Conference on Automated Deduction, in Wroclaw, Poland. The lecturing material "Grammatical Framework: A Hands-On Introduction" is online. At the same meeting, Jordi Saludes has presented the Mathematical Grammar Library during the affiliated workshop THedu'11, Computer Theorem Proving Components for Educational Software. "A Framework for Improved Access to Museum Databases in the Semantic Web" was presented during the meeting Recent Advances in Natural Language Processing (RANLP 2011), in September 2011, at Hissar, Bulgaria. Similar work, "Reason-able View of Linked Data for Cultural Heritage" was presented during The Third International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES, also in September, 2011 in Bourgas, Bulgaria.  The MOLTO project was presented at Tsukuba University and during the meeting "Digitization and E-Inclusion in Mathematics and Science 2012" (DEIMS2012) in February 2012 in Tokyo Japan by Olga Caprotti.
 MOLTO sponsored the GF Summer School 2011, Frontiers of Multilingual Technologies during August 15-26, 2011 hosted by UPC in Barcelona, Spain. The two weeks program  included lectures  from  "Getting started with GF", to "GF application development" and "Resource grammar development" and was attended by around 20 participants from around the world. The use case studies of MOLTO were amply presented by members of the Consortium. On August,1 2011 Aarne Ranta was invited to give a tutorial on GF during CADE-the 23rd International Conference on Automated Deduction, in Wroclaw, Poland. The lecturing material "Grammatical Framework: A Hands-On Introduction" is online. At the same meeting, Jordi Saludes has presented the Mathematical Grammar Library during the affiliated workshop THedu'11, Computer Theorem Proving Components for Educational Software. "A Framework for Improved Access to Museum Databases in the Semantic Web" was presented during the meeting Recent Advances in Natural Language Processing (RANLP 2011), in September 2011, at Hissar, Bulgaria. Similar work, "Reason-able View of Linked Data for Cultural Heritage" was presented during The Third International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES, also in September, 2011 in Bourgas, Bulgaria.  The MOLTO project was presented at Tsukuba University and during the meeting "Digitization and E-Inclusion in Mathematics and Science 2012" (DEIMS2012) in February 2012 in Tokyo Japan by Olga Caprotti.
Two demonstrations of MOLTO prototypes on query and retrieval in the cultural heritage and in the patent domains have been accepted for presentation at the European Track of the World Wide Web 2012 conference. A paper on GF, "Smart Paradigms and the Predictability and Complexity of Inflectional Morphology", will also be presented at the conference of the European Association for Computational Linguistics in April 2012.
The list of conference papers funded by MOLTO can be retrieved under Publication from the website.
Project meetings of MOLTO include always an open day with a program of presentations aimed at a general audience, the last MOLTO open days took place in Gothenburg on March 9, 2011 during the second project meeting, on September,2 2011 in Helsinki during the third project meeting, and on January,12 2012 in Gothenburg for the MOLTO-EEU kick off meeting.
The project is looking forward to the final development phase especially with the addition of the new case studies, which will bring feedback to existing tools and ongoing work. In terms of events sponsored by MOLTO, the Third International Workshop on Free/Open-source Rule-based Machine Translation will take place in Gothenburg, Sweden, between 13-15 June 2012. Chair of the meeting is the MOLTO coordinator A. Ranta and local organization is managed by the MOLTO project manager. The fifth MOLTO project meeting will take place in September 2012 in The Netherlands in cooperation with the MONNET project. Stay tuned by subscribing to the MOLTO RSS feed or follow us on Twitter.
| Contract No.: | FP7-ICT- | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | DX.3 | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M41 | 
| Actual date of delivery: | |
| Type: | Report | 
| Status & version: | Draft | 
| Author(s): | Aarne Ranta | 
| Task responsible: | UGOT | 
| Other contributors: | All partners | 
This final report is meant to summarize the work carried out and the results obtained under the grant agreement FP7-ICT-247914 and its enlargement 288317. It is also intended as a means to assess the output of the MOLTO project by the public.
This section must be of suitable quality to enable direct publication by the Commission and should preferably not exceed 40 pages. This report should address a wide audience, including the general public.
The publishable summary has to include 5 distinct parts described below:
Furthermore, project logo, diagrams or photographs illustrating and promoting the work of the project (including videos, etc…), as well as the list of all beneficiaries with the corresponding contact names can be submitted without any restriction.
A plan for use and dissemination of foreground (including socio-economic impact and target groups for the results of the research) shall be established at the end of the project. It should, where appropriate, be an update of the initial plan in Annex I for use and dissemination of foreground and be consistent with the report on societal implications on the use and dissemination of foreground (section 4.3 – H). The plan should consist of:
This section should describe the dissemination measures, including any scientific publications relating to foreground. Its content will be made available in the public domain thus demonstrating the added-value and positive impact of the project on the European Union.
This section should specify the exploitable foreground and provide the plans for exploitation. All these data can be public or confidential; the report must clearly mark non-publishable (confidential) parts that will be treated as such by the Commission. Information under Section B that is not marked as confidential will be made available in the public domain thus demonstrating the added-value and positive impact of the project on the European Union.
This section includes two templates:
Template A1: List of all scientific (peer reviewed) publications relating to the foreground of the project.
Template A2: List of all dissemination activities (publications, conferences, workshops, web sites/applications, press releases, flyers, articles published in the popular press, videos, media briefings, presentations, exhibitions, thesis, interviews, films, TV clips, posters).
These tables are cumulative, which means that they should always show all publications and activities from the beginning until after the end of the project. Updates are possible at any time.
The applications for patents, trademarks, registered designs, etc. shall be listed according to the template B1 provided hereafter.
The list should, specify at least one unique identifier e.g. European Patent application reference. For patent applications, only if applicable, contributions to standards should be specified. This table is cumulative, which means that it should always show all applications from the beginning until after the end of the project.
Please complete the table hereafter:
| Type of Exploitable Foreground | Description of exploitable foreground | Confidential | Foreseen embargo date | Exploitable product(s) or measure(s) | Sector(s) of application | Timetable, commercial or any other use | Patents or other IPR exploitation (licences) | Owner and Other Beneficiary(s) involved |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| General advancement of knowledge, Commercial exploitation of R&D results, Exploitation of R&D results via standards, exploitation of results through EU policies, exploitation of results through (social) innovation | Ex: New superconductive Nb-Ti alloy | YES/NO | dd/mm/yyyy | MRI equipment | 1. Medical 2. Industrial inspection (the type sector (NACE nomenclature) : http://ec.europa.eu/competition/mergers/cases/index/nace_all.html) | 2008- 2010 | A materials patent is planned for 2006 | Beneficiary X (owner) Beneficiary Y, Beneficiary Z, Poss. licensing to equipment manuf. ABC |
In addition to the table, please provide a text to explain the exploitable foreground, in particular:
Replies to the following questions will assist the Commission to obtain statistics and indicators on societal and socio-economic issues addressed by projects. The questions are arranged in a number of key themes. As well as producing certain statistics, the replies will also help identify those projects that have shown a real engagement with wider societal issues, and thereby identify interesting approaches to these issues and best practices. The replies for individual projects will not be made public.
| Contract No.: | FP7-ICT- | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | |
| Security (distribution level): | |
| Contractual date of delivery: | |
| Actual date of delivery: | |
| Type: | |
| Status & version: | |
| Author(s): | |
| Task responsible: | |
| Other contributors: | 
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D1.1. Work plan for MOLTO | 
| Security (distribution level): | Confidential | 
| Contractual date of delivery: | M1 | 
| Actual date of delivery: | 1 April 2010 | 
| Type: | Report | 
| Status & version: | Final (evolving document) | 
| Author(s): | A. Ranta et al. | 
| Task responsible: | UGOT | 
| Other contributors: | 
Detailed work plan for internal use of the consortium.
This is an evolving description of the work plan of MOLTO, divided in work packages and in tasks. The document is meant to track what the MOLTO Consortium is planning to do, what it has completed so far and the status of the ongoing research. It is the responsibility of the work package leader to enter tasks and to keep them up to date so as to reflect the work done by the group.
</br/></p/>
Detailed workplan for WP1
A number of management tasks are entitled to the coordinator: e.g. 
- collecting information from partners,
- reviewing and submitting information on the progress of the project as well as reports and other deliverables to EC;
- preparation of meetings, 
- proposing the decisions and preparing the agenda of the SG, 
- chairing of the SG meetings and monitoring the implementation of decisions taken at the meetings; 
- presenting the results of the consortium and serving as the secretary in the meetings;
- administering the EC financial contribution and fulfilling other financial tasks, 
- maintaining the project's website etc.
According to the Grant Agreement, Annex II, management of the consortium activities includes:
In order to get an overview of the workpackage: - add a view of the associated tasks - add a view of the deliverables
Create an admin type "deliverable" to collect the info on due deliverable so that they can be tracked on the calendar and in the workpackage's description.
The commission requested the following:
| Session | Submitted on | Verified on | 
|---|---|---|
| 5.2 | Apr 26, 2011 2:40:36 PM | May 3, 2011 2:33:02 PM | 
You are kindly requested to clarify the issues raised in this letter and submit a revised periodic report and Forms C through NEF at the latest on 16th of May. Should you require more time, please contact us. However, should we not have heard from you by the deadline we will proceed with the information at hand. Please note that in such case, this may lead to all or part of the costs being rejected.
Please note that according to Article II.5 of the grant agreement, the period for the execution of the payment has been suspended pending receipt of the additional information and the revised Periodic Report through NEF.
Please clarify the following points in the revised periodic report, and revise the Forms C if necessary:
| Attachment | Size | 
|---|---|
| MOLTO (247914) _ Periodic Report and Cost Claim submission in NEF.pdf | 77.56 KB | 
| Cost Claim overview MOLTO (247914) 2011-06-22.pdf | 74.59 KB | 
| D1 3R (amended).docx | 232.74 KB | 
Timetable for negotiation:
The negotiating Project Officer (PO) is Mr. BROCHARD Michel. The full contact references are detailed in the "Negotiation Mandate".
Please note that the negotiation must be successfully concluded by 31/08/2011.
In case this deadline is not met, the Commission reserves the right to cancel negotiations and any subsequent offer for a project grant agreement. We also would like to draw your attention to the fact that negotiations may be terminated, or the negotiation mandate modified, if so required following the results of the consultation with other departments within the Commission.
Please note that in accordance with the legislation in force, the coordinator is obliged to deposit any pre-financing received from the Commission on an interest-bearing bank account. If you do not comply with this obligation, your participation as coordinator may not be accepted.
The negotiation process is supported by an on-line tool called NEF which you will need to use to submit data that is necessary for the grant agreement.
NEF will also provide access to the Legal & Financial Validation form (LFV lite). The LFV lite provides an overview of the status concerning the legal and financial data of the partners of your project, and indicates those partners for whom legal and/or financial data is missing. If the legal and/or financial data of one or more partners is flagged as needed in the LFV lite or would be incorrect, new legal and/or financial documents must be submitted for the partner(s) concerned. Additionally the Commission can also request documents and information in regard to the operational capacity of the consortium and beneficiaries to achieve the objectives and expected results of the project.
The detailed explanations for accessing NEF will be sent shortly in a separate e-mail. Further guidance is available on-line at the following address: http://ec.europa.eu/research/negotiation/
You should have already received the Evaluation Summary Report (ESR) in the info letter email. If not, please contact the negotiating Project Officer.
The negotiation guidance notes and the most recent templates for the Description of Work (Annex I to the Grant Agreement) are available at: Nef Annex 1 - Concept. Other useful information on Framework Programme 7 is available at http://cordis.europa.eu/fp7/find-doc_en.html and includes:
This letter should not be regarded under any circumstances as a formal commitment by the Commission to give financial support as this depends, in particular, on the satisfactory conclusion of negotiations and the completion of the formal selection process. Should you have any queries about the above, please do not hesitate to contact the negotiating Project Officer.
The main issue to solve is the budget cut, which of course is the usual thing to happen. We will get 600k instead of the 712k we applied for. My suggestion is that we cut all WP's and sites in proportion, so we don't need to change the work description too much.
The realistic goal is that the work will begin on 1 September. Even this needs some effort from us:
| Attachment | Size | 
|---|---|
| comments_MOLTO_Ext.pdf | 90.52 KB | 
Please address the reviewers remarks by the end of September 2011!!!!
Soon it is time for the reporting of period 2 (01/03/2011 – 29/02/201)of the project MOLTO.
You have to send me:
This year you can complete the Use of resources directly in the NEF when completing the Form C. You will have to write short explanations of the costs: the number of person months, travel costs (who travelled where and for which purpose/meeting), consumables etc. All the costs must be related to a Work package.
The deadline for submitting your financial statement in the Participant Portal as well as sending me the Use of Resources by e-mail is 1st of April 2011.
The signed Financial Statement and the CFS (if applicable) have to be submitted to me in paper copies. Please send the originals by courier to address below.
To access the project via the Participant Portal, click on the following link: http://ec.europa.eu/research/participants/portal/
To log into the Participant Portal you need to have an account. If you don't have an account yet follow the 'register' link and instructions on the Participant Portal main page.
Once logged in with the account associated with your email address, the list of the projects you are involved in will appear under the 'My Projects' tab. The project MOLTO (247914) will appear under tab “Active”. By selecting “FR” on that line you will gain access to the Form C.
Do not hesitate to contact me if you have any questions. Kristina
Kristina Orbán Meunier
UNIVERSITY OF GOTHENBURG Research and Innovation Services
Erik Dahlbergsgatan 11B Box 100, 405 30 Göteborg, Sweden Tel +46 31 786 6466
mobile +46 766 229466
The grammar developer's tools are divided to two kinds of tasks:
GF grammar compiler API
actual tools implemented by using the API
The workplan for the first six months concerns mostly the API, with the main actual tool being the GF Shell, which is a line-based grammar development tool. It is a powerful tool since it enables scripting, but it is not integrated with other working environments. The most important other environment will be web-based access to the grammar compiler.
Note that most discussions on GF are public at http://code.google.com/p/grammatical-framework/.
Here follows the work plan, with tasks assigned to sites and approximate months.
Documentation of GF is hosted on Google Code at http://code.google.com/p/grammatical-framework/
There is a wiki cover page for the Resource Grammar Library API and an online version at http://www.grammaticalframework.org/compiler-api/.
The GF API design will take into account the following requirements:
The documentation is being hosted at the GF website.
What we mean by example based grammar writing.
Current status is proof of concept: it is possible to load example based grammar and to compile it.
Need to do: - ....
The runtime is the part of the GF system that implements parsing and linearization of texts based on a PGF grammar that has been produced by the GF compiler.
The standard GF runtime is written in Haskell like the rest of the system. Unfortunately this results in a large memory footprint and possibly also portability problems, which preclude its use in certain applications.
The goal of the current task is to reimplement the GF runtime as a pure C library. This C library can then hopefully be used in some situations where the Haskell-based runtime would be unwieldy.
Preview versions of the implementation, libpgf, are available from the project home page. This is also where up-to-date documentation can be found.
The compiler API must be used by the morphology server.
To develop a python plugin for gf (based on the planned C plugin) and connect it to relevant parts of the Natural Language Toolkit (http://www.nltk.org/)
2.8.1 Develop python bindings to gf.
2.8.2 nltk integration.
This is how to use some of the functionalities of the GF shell inside Python.
Due to some ghc glitch, it only builds on Linux.
You'll need the source distribution of GF, ghc and the Python development files1. Then, go to the python bindings folder and build it:
 cd GF/contrib/py-bindings
 make
It will build a shared library (gf.so) that you can import and use into Python as shown below.
To test if it works correctly, type:
 python -m doctest example.rst
First you must import the library:
% import gf
then load a PGF file, like this tiny example:
% pgf = gf.read_pgf("Query.pgf")
We could ask for the supported languages:
% pgf.languages()
[QueryEng, QuerySpa]
The start category of the PGF module is:
% pgf.startcat()
Question
Let's us save the languages for later:
% eng,spa = pgf.languages()
These are opaque objects, not strings:
% type(eng) 
(type 'gf.lang')
and must be used when parsing:
% pgf.parse(eng, "is 42 prime") 
[Prime (Number 42)]
Yes, I know it should have a '?' at the end, but there is not support for other lexers at this time.
Notice that parsing returns a list of gf trees. Let's save it and linearize it in Spanish:
% t = pgf.parse(eng, "is 42 prime")
% pgf.linearize(spa, t[0])
'42 es primo'
(which is not, but there is a '?' lacking at the end, remember?)
One of the good things of the GF shell is that it suggests you which tokens can continue the line you are composing.
We got this also in the bindings. Suppose we have no idea on how to start:
% pgf.complete(eng, "")
['is']
so, there is only a sensible thing to put in. Let's continue:
% pgf.complete(eng, "is ")
[]
Is it important to note the blank space at the end, otherwise we get it again:
% pgf.complete(eng, "is")
['is']
But, how come that nothing is suggested at "is "? At the current point, a literal integer is expected, so GF would have to present an infinite list of alternatives. I cannot blame it for refusing to do so.
% pgf.complete(eng, "is 42 ")
['even', 'odd', 'prime']
Good. I will go for 'even', just to be in the safe side:
% pgf.complete(eng, "is 42 even ")
[]
Nothing again, but this time the phrase is complete. Let us check it by parsing:
% pgf.parse(eng, "is 42 even")
[Even (Number 42)]
We store the last result and ask for its type:
% t = pgf.parse(eng, "is 42 even")[0]
% type(t)
(type 'gf.tree')
What's inside this tree? We use unapply for that:
% t.unapply()
[Even, Number 42]
This method returns a list with the head of the fun judgement and its arguments:
% map(type, _)
[(type 'gf.cid'), (type 'gf.expr')]
Notice the argument is again a tree (gf.tree or gf.expr, it is all
the same here.)
% t.unapply()[1]
Number 42
We will repeat the trick with it now:
% t.unapply()[1].unapply()
[Number, 42]
and again, the same structure shows up:
% map(type, _)
[(type 'gf.cid'), (type 'gf.expr')]
One more time, just to get to the bottom of it:
% t.unapply()[1].unapply()[1].unapply()
42
but now it is an actual number:
% type(_)
(type 'int')
We ended with a full decomposed fun judgement.
In Ubuntu I got it by installing the package python-all-dev. ↩
Here a slighly better description with eventually relevant links to sw, documentation etc.
Major features:
New languages:
Web-based tools for grammarians: http://www.grammaticalframework.org/demos/gfse/
Ongoing work at http://cloud.grammaticalframework.org.
Look into online IDE platforms, like Kodingen and CodeRun.
There is work for Ajax-based code editors, eg Ymacs, which could be useful since there is a GF mode for emacs already (where?).
The emacs mode can now be found in http://www.grammaticalframework.org/src/tools/gf.el (note by Aarne)
There is also a Mozilla project, Bespin, to build a web-based editor extensible by javascript.
Also - check Orc, yet another online IDE for a new language, using CodeMirror as editor.
Design and intergrate probabilistic features to GF and PGF.
Extend planning here.
Finale phase of the work planned in this workpackage. Exact scheduling to be defined.
Adding the possibility to dynamically add new words to lexicons "linked" in compiled grammars.
To be entered for M7 - M30.
Add child pages to the living deliverable following instructions given in the abstract.
http://www.molto-project.eu/wiki/living-deliverables/d43a-appendix-gramm...
See deliverable
According to the plan http://www.molto-project.eu/node/858 the Knowledge Engineering Infrastructure has been realeased. It is accessible here. We have imported an exemplary initial data set containing information for different persons, organizations, locations.
To execute a SPARQL query to the data set, click "SPARQL Query" and for exemple try the following query without the backslashes (\)
prefix rdf:<\http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefix rdfs:<\http://www.w3.org/2000/01/rdf-schema#> prefix prt:<\http://proton.semanticweb.org/2006/05/protont#> select distinct ?l where { ?s rdf:type prt:Organization ; rdfs:label ?l . }
It should return the names of all organizations stored in the data set.
The Knowledge Engineering Infrastructure could be extended with new data sets if new data sets are available, see http://www.molto-project.eu/node/858, http://www.molto-project.eu/node/896 and http://www.molto-project.eu/node/948.
here a better task description
Mathematical grammars developed using GF for the WebALT project (eContent 22253) allow us to generate multilingual simple drills for high school students and university freshmen. These grammars will be the starting point aiming at extending coverage to word problems, the ones that require the student to first model a situation and then to manipulate the mathematical model to obtain a solution.
The UPC team, being a main actor in the past developing of gf mathematical grammars and having ample experience in mathematics teaching, will be in charge of the tasks in this work package with help from UGOT on technical aspects of GF and possibly Ontotext on ontology representation and handling.
It will be required to reason on equations and statements proposed by the student, so we will need to review to what extend an automatic reasoner could deal with student input of this sort and how the system behavior could be designed to degrade gracefully in order to keep the student interaction going.
In the framework of the WebALT project a gf grammar library was developed for generating simple mathematical drills in a variety of languages. The legal status of this library has recently changed to LGPL, making it suitable to be the starting point for the language services demanded by this work package. To achieve a better degree of interchangeability it is required to organize the existing code into modules, remove redundancies and lay them in a way acceptable for easy lexicon enhancement by way of the grammar developer’s tools of work package 2, WP2.
Writing a gf grammar for commanding a generic computer algebra system (CAS) by natural language imperative sentences. Concrete grammars adapted to the CAS at hand. Depends on work package 2 WP2.
Integrate the commanding library into a component to transform the issued commands to the CAS.
Gf grammar library able to generate natural language sentences corresponding to objects and relations of the word problem. It must be able to parse simple questions related to the word problem domain into predicates. Depends on work package 2 and probably work package 4.
Automated reasoning is needed to assess the soundness of the model proposed by the student and to answer his/her questions. This requires adding small ontologies describing the word problem, including:
Add State of the Art study here.
Some time ago I managed to build a theory supporting the Farm problem in Isabelle/HOL (attached below)
I wasn't expecting such a toil but lack of detailed documentation and a wicked simplifier made my life miserable for a whole week.
It is based on 3 sets:
and a function: is_leg_of : leg → animal.
As axioms, we have:
That is, facts that are implicitely known but you need to state for Isabelle with Main theory to work:
Let R be the number of rabbits in the farm and D the number of ducks in the farm. With the preceding axioms, we were able to produce Isabelle-certified proofs that
R + D = 100
and
2*D + 4*R = 260
and then deduce that R=30 and D=70.
| Attachment | Size | 
|---|---|
| Farm.thy | 5.67 KB | 
In particular, objects will be annotated by natural language noun-phrases and equations by sentences. These annotations will be parsed into GF interlingua and will be used whenever language generation related to the problem was needed.
The work will start with the provision of user requirements (WP9) and the preparation of a parallel patent corpus (EPO) to fuel the training of statistical MT (UPC). In parallel UGOT will work on grammars covering the domain and subsequently, together with UPC, apply the hybrid (WP2, WP5) MT on abstracts and claims. Ontotext will provide semantic infrastructure with loaded existing structured data sets (WP4) from the patent domain (IPC, patent ontology, bio-medical and pharmaceutical knowledge bases, e.g. LLD). Based on the use case requirements, Ontotext will build a prototype (D7.1, D7.2) exposing multiple cross-lingual retrieval paradigms and MT of patent sections. The accuracy will be regularly evaluated through both automatic (e.g. BLEU scoring) and human based (e.g. TAUS) means (WP9).
The work package is split into 9 major tasks as follows:
The patents case study comprises two basic scenarios: the online patent retrieval and the patent translation. In this prototype we tackle these two scenarios separately, as shown in Figure 1, even though they can be viewed as a unique multilingual patent retrieval paradigm. In the future, we plan to study how to automate the reciprocal inputs between the two processes, i.e. the annotation of translations and the translation of semantically annotated documents.
From a general perspective, two user roles may be defined in this case study: end-users looking for information related to the patents and editors adding new patent documents to a hypothetical repository.
Details are given in D71.
Determining and gathering of bilingual and monolingual corpora for the patent case study.
There are two subtasks here:
Developing an ontology capturing the structure of patent documents; and indexing the patents documents according to the semantic knowledge.
Contact @UPC: Lluis and Cristina
DEPENDENCIES:
Participants:
Contact point @Ontotext: Borislav Popov
DEADLINES: Beta = M21; Final = M27
Contact @UPC: Lluis and Cristina
DEPENDENCIES:
Patents abstracts and claim are translated using the baseline of the hybrid system.
DEPENDENCIES:
Participants:
Contact point @Ontotext: Borislav Popov
DEADLINES: Beta = M21; Final = M27
DEPENDENCIES:
Note: Deadlines have been delayed 3 months due to the WP delay.
DEADLINE: M31 (to allow for final report)
 
The work is started by a study of the existing categorizations and metadata schemas adopted by the museum, as well as a corpus of texts in the current documentation which describe these objects (D8.1, UGOT and Ontotext). We will transform the CIDOC-CRM model into an ontology aligning it with the upper-level one in the base knowledge set (WP4) and modeling the museum object metadata as a domain specific knowledge base. Through the interoperability engine from WP4 and the IDE from WP2, we will semi-automatically create the translation grammar and further extend it (D8.2, UGOT, UHEL, UPC, Ontotext). The final result will be an online system enabling museum (virtual) visitors to use their language of preference to search for artefacts through semantic (structured) and natural language queries and examine information about them. We will also automatically generate a set of articles in the Wikipedia format describing museum artefacts in the 5 languages with extensive grammar coverage (D8.3, UGOT, Ontotext).
Links to Swedish museum databases who use the Carlotta system which is built upon the CIDOC-CRM model:
The work will start with collecting user requirements for the grammar development IDE (WP2), translation tools (WP3), and the use cases (WP6-8).
We will define the evaluation criteria and schedule in synchrony with the WP plans (D9.1). We will define and collect corpora including diagnostic and evaluation sets, the former, to improve translation quality on the way, and the latter to evaluate final results.
Translator's new role (parallel to WP3: Translator's tools) will be designed and described in the D9.1 deliverable. Most current translator's workbench software treat the original text as read-only source. The tools to be developed within WP3 (+ 2) will lead towards more mutable role of source text. The translation process will resemble more like structured document editing or multilingual authoring than transformation from a fixed source to a number of target languages.
We will only provide a basic infrastructure API for external translation workbenches and keep an eye on the "new multilingual translator's workflow".
For each work package, the liaison contact information and work progress will be kept up-to-date on the MOLTO web site. Our liaison person Mirka Hyvärinen will be in contact with other project members.
Also possibility to access UHEL's internal working wiki "MOLTO kitwiki" will be granted upon request to other project members.
Evaluation aims at both quality and usability aspects. UHEL will develop usability tests for the end-user human translator. The MOLTO-based translation workflow may differ from the traditional translator's workflow. This will be discussed in the D9.1 evaluation plan.
To measure the quality of MOLTO translations, we compare them to (i) statistical and symbolic machine translation (Google, SYSTRAN); and (ii) human professional translation. We will use both automatic metrics (IQmt and BLEU; see section 1.2.8 for details) and TAUS quality criteria (Translation Automation Users Society). As MOLTO is focused on information-faithful grammatically correct translation in special domains, TAUS results will probably be more important.
Given MOLTO's symbolic, grammar-based interlingual approach, scalability, portability and usability are important quality criteria for the translation results. For the translator's tools, user-friendliness will be a major aspect of the evaluation. These criteria are quantified in (D9.1) and reported in the final evaluation (D9.2).
In addition to the WP deliverables, there will be continuous evaluation and monitoring with internal status reports according to the schedule defined in D9.1.
Define workplan here
Factorize the grammar used now for the demo fridge in modules that isolate the different kinds of phrases: eg. Comments, Greetings, Questions, etc. Check whether there are ontologies that describe these.
The factorization can be seen in the phrasebook example under /example/phrasebook.
The MOLTO Phrasebook is a web application for the traveler, eventually it will be a phone application (for the Android). It consists of frequently used phrases that a foreigner might want to use when abroad.
demo preview: http://tournesol.cs.chalmers.se/~aarne/phrasebook/phrasebook.html
The current GF Grammar Compiler API is providing translation services that can be called on-the-fly. The goal of this task is to find out how to integrate them to an existing API where there is a need for Internationalization, example Facebook https://developers.facebook.com/docs/internationalization.
The image  shows how translations are entered manually in the current version. My guess is that we could improve on that.
 shows how translations are entered manually in the current version. My guess is that we could improve on that.
Anther example is the situation of commonly used sentences: "Happy birthday", we have on our Travel Phrasebook, we do not have Portuguese, we could friends-source it :) but how? Give them a FB app?
Love to see some comments on this.
BTW, I am not partial to FB, you can check any social network of your liking that provides an Internationalization API. This is a test of concept also looking for CNLs in the wild :)
The core of WP11 is an existing wiki system AceWiki which is going to be developed into a multilingual controlled natural language wiki system within the MOLTO project.
The AceWiki homepage (http://attempto.ifi.uzh.ch/acewiki/) contains:
AceWiki development is hosted on GitHub (https://github.com/AceWiki/AceWiki)
AceWiki side:
GF side:
Release notes: https://raw.github.com/AceWiki/AceWiki/master/CHANGES.txt
See also https://github.com/yuchangyuan/AceWiki
See also the thread starting with: https://lists.ifi.uzh.ch/pipermail/attempto/2011-December/000818.html
General refactoring and clean-up of the AceWiki code.
Make the AceWiki design multilingual and implement a small AceWiki engine for multilingual GF grammars.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D1.2. Progress Report | 
| Security (distribution level): | Confidential | 
| Contractual date of delivery: | M7 | 
| Actual date of delivery: | 1 Oct 2010 | 
| Type: | Report | 
| Status & version: | Draft (evolving document) | 
| Author(s): | A. Ranta et al. | 
| Task responsible: | UGOT | 
| Other contributors: | All | 
Progress report for the first semester of the MOLTO project lifetime, 1 Mar - 30 Sep 2010.
This section must be of suitable quality to enable direct publication by the Commission and should preferably not exceed four pages.
In line with this, diagrams or photographs illustrating and promoting the work of the project, as well as relevant contact details or list of partners can be provided without restriction.
The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run for 36 months. It promises to develop a set of tools for translating texts between multiple languages in real time with high quality. MOLTO will use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF, Grammatical Framework. GF technologies in MOLTO are complemented by the use of ontologies, such as used in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data.
MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.

Tools like Systran (Babelfish) and Google Translate are designed for consumers of information, but MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecisions. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it.
There is a well known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well understood language. Three such domains will be considered during the MOLTO project: mathematical exercises, biomedical patents, and museum object descriptions. The MOLTO tools however will be applicable to other domains as well. Examples of such domains could be e-commerce sites, Wikipedia articles, contracts, business letters, user manuals, and software localization.
A few results have been already achieved during the first semester of the project's lifetime. Two applications of the MOLTO translation web services are online on the project web pages:

On the more technical level, MOLTO released:
The expected final product of MOLTO is a software toolkit made available via the MOLTO website. It will consist in a family of open-source software products:
These tools will be portable to different platforms as well as generally portable to new domains and languages. By the end of the project, MOLTO expects to have grammar resource libraries for 18 languages, whereas MOLTO use cases will target between 3 and 15 languages.
The main societal impact of MOLTO will be on contributing to a new perception for the possibilities of machine translation, moving away from the idea that domain-specific high-quality translation is expensive and cumbersome. MOLTO tools will change this view by radically lowering the effort needed to provide high-quality scoped translation for applications where the content has enough semantic structure.
The MOLTO website at http://www.molto-project.eu publishes the results, the news and all information related to the project. In addition, a Twitter feed is also available at http://twitter.com/moltoproject.
The project objectives for the first semester focus on establishing the grounds for cooperation among the partners, hence three deliverables contribute to refine the goals of the project:
The first version of the MOLTO web services, due at Month 3 is the major concrete target for the period and demonstrates the technologies underlying the ideas of the project.
Please provide a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.
For each work package, except project management, which will be reported in section 2.3, please provide the following information:
The Grammarian's Tools include tools for using the GF grammar compiler and the Resource Grammar Library. In the first 6 months of MOLTO, we have worked on consolidating the compiler and the Library API, and also experimenting with the example-based grammar writing technique.
Clearly significant results include:
No deviations from Annex I and the use of resources was as planned.
na
During the first period we managed to clarify the needs for knowledge representation infrastructure of the case studies and software tools in MOLTO. We have also circulated a questionaire describing the structured data sets which are expected to be of benefit for the project. Based on this information, we proceeded with deploying the knowledge representation infrastructure, which is now in place and accessible to the partners. It will be further described in D4.1 Knowledge Representation Infrastructure.
The second major direction during this period was the undoubtedly challenging grammar to ontology interoperability. For this we have chosen a quasi-exhaustive knowledge base of important named entities in the world and some relations between them. It is encoded according to PROTON – a basic-upper level ontology with about 300 classes of named entities. The first goal set for this interoperability was a transformation of questions expressed in natural language towards a formal query language – SPARQL. For this purpose, and on the basis of the ontology and the entities in the knowledge base, we have manually created a corpus of 500 sentences. This corpus is being used for development of the GF grammars handling the natural language questions and also for evaluation of the coverage of the grammars over this language space. After an initial grammar handling questions to the knowledge base has been developed for a subset of the English language, we have created a transformation function, rendering GF sentence trees to SPARQL queries. In order to show these initial results, we have developed a natural language based search interface over the knowledge base, with automatic suggestion of possible continuation of the questions, which is featured on the MOLTO website. The results of these questions are one or two dimensional tables of entities, where each row is an individual “answer”.
Effort spent by Ontotext in WP4 – 7.5 PMs; Other participants UGOT: Aarne will talk directly to Olga for this.
| Attachment | Size | 
|---|---|
| MOLTO.WP4_.M6.doc | 94 KB | 
WP5 is planned to span from Month 7 to Month 30, but it is being conditioned by the delay on the Patents data. So, there is already some ongoing work we detail in the folowing.
Most of the objectives of the package depend on the compilation of the Patents corpus. Even the languages of study depend on the data that the new partner provide. In order to compensate the delay due to this both in WP5 and mainly in WP7 we started working here on hybrid approaches. The methodology now is to develop hybrid methods in a way independent of the domain and data sets used, so that they can be later adapted to patents.
At the moment, we are able to obtain phrases and alignments from a GF-generated synthetic corpus. This is a first step for the hard integration of both paradigms, and also for the soft integration methods led by SMT. We are currently going deeper into the latter, as it is a domain independent study.
Bilingual corpora are needed to create the necessary resources for training/adapting statistical MT systems and to extend the grammar-based paradigm with statistical information. We will compile and annotate general-purpose large bilingual and monolingual corpora for training basic SMT systems. At the moment, we have compiled and annotated the European Parliament corpus for English and Spanish. Languages will probably finally be English, German, and Spanish or French, so as soon as this is confirmed the final general-purpose corpus can be easily compiled. The depth of the annotation will depend on the concrete languages and the available linguistic processors.
On the other hand, domain specific corpora will be needed to adapt the general purpose SMT system to the concrete domain of application in this project (Patents case study, WP7). We cannot build the final corpus, but some of the MOLTO members have join the IRF so that a set of Patents data are available for individual research purposes. This has allowed to compile a preliminar parallel corpus on which we can start shortly to build a domain GF grammar and to develop a first pure SMT domain-adapted translator.
| Attachment | Size | 
|---|---|
| ProgressReport_WP5.odt | 34.6 KB | 
Working towards deliverable D6.1:
Clearly significant results include:
WP6 was moved ahead to start on Month 5 (instead of 7) to buy time for WP5 which will be delayed due to lack of data.
| Attachment | Size | 
|---|---|
| MOLTO.WP6_.M6.doc | 23 KB | 
WP7 was scheduled to start in Month 4. But the WP leader site, Matrixware, left the MOLTO Consortium during Month 3. We have had negotiations with replacing partners, and expect them to be concluded before November 2010 (Month 9 of MOLTO). Then we expect to start WP7 no later than January 2011 (Month 11).
While the delay is with several months, it need not imply great changes in the actual work. The original reason to start in Month 4 was to give the Matrixware site something to work on, since they we not highly involved in the other WP's. The new partner is expected to get started immediately, and the WP will also profit from the fact that some other MOLTO tools have become available (grammarian's tools from WP2 and grammar-statistics combination from WP5).
The actual work plan for WP7 may change in accordance with the preferences of the new partner. This will happen within the limits of the budget originally allocated to this WP.
WP8 will start in Month 12, so no work can be reported yet.
lauri, here goes your report
The stated objectives of this workpackage are to:
The first task has been to setup the website for MOLTO, with information about MOLTO’s technology and potential (D10.2, UGOT and Ontotext) targeted to research, industry and users. Bibliographic information on GF, on SMT and on knowledge retrieval is kept up-to-date and includes tutorial presentations delivered during the MOLTO workshops. The web site includes a News section with frequent informal posts on internal progress and plans and encouraging community contributions in the form of comments. More light newsflash items are published using the MOLTO Twitter feed. A specific section is devoted to Frequently Asked Questions and can be collaboratively maintained by the MOLTO partners.
This workpackage was responsible for two deliverables during the first semester:
The dissemination plan can be accessed on the consortium-restricted pages at http://www.molto-project.eu/wiki/d10.1 and will be amended during the project's lifetime if needed. The project has been presented in a few meetings and international events, most notably at LREC2010, EAMT2010, and ACL2010.
The first version of the MOLTO web service consists of an online demonstration of a multilingual travel phrasebook, described online in Deliverable D10.2 at http://www.molto-project.eu/wiki/d10.2.
Management tasks carried out during the first semester of MOLTO finalized the administrative and organizational setup of the project. The website for the project is online at http://www.molto-project.eu. The Consortium Agreement had been signed before the Grant Agreement in December 2009. The workplan for MOLTO (Deliverable D1.1) is hosted on the wiki pages on the website.
The Steering Group of MOLTO, elected during the Kick-Off meeting, presently consists of voting members A. Ranta (UGOT, Chair), J. Saludes (UPC), B. Popov (Onto), and L. Carlson (UH). The Steering group held monthly calls to discuss the project's progress and recorded the minutes on the website. The MOLTO Advisory Board has been established, with members Prof. Stephen Pulman (Computing Laboratory Oxford) and Keith Hall (Google Research Zurich).
The project had to face a major challenge with the dissolution of the Consortium partner company Matrixware. Upon learning of this, the Coordinator informed the Commission and proceeded to formalize the dismissal of Matrixware, that left the Consortium at the end of Month 2, on April 23, 2010. In order to be able to carry out the tasks set forward in the MOLTO DoW, with minor disruption, MOLTO started negotiations with EPO, European Patent Office, to incorporate it as new member of the MOLTO Consortium. This process has taken a long time, about 3 months and we expect to learn their final decision at the end of October. In case of positive outcome, then EPO will step in and we expect little changes to the original workplan. In case of negative outcome, then MOLTO will discuss changing the workplan for Workpackage 7, the Patent Case Study, possibly to a different domain. MOLTO partners have been approached by several interested parties with use case study domains that could be suitable testbeds for the tools developed during the project, these wil be approached first.
The original workplan has been slightly modified to cope with changes in the Consortium, mainly by shifting the start of two workpackages. The loss of Matrixware affected the MOLTO activities scheduled for Workpackage 7: Case Study Patents (led by MXW) from Month 4 to Month 30. The major task that has been put on hold is the preparation of a parallel patent corpus (Mxw) to fuel the training of statistical MT (UPC). The work on Workpackage 7 will start as soon as the Consortium situation clarifies. UPC, the most directly affected partner (whose tasks depended on the work of Mxw), has begun the work on Workpackage 6: Case Study Mathematics in Month 5 instead of Month 7.
Two project meetings have been organized, in Barcelona, 8-10 March 2010, and in Varna 10-12 September, 2010. A bilateral meeting, between UH and UGOT, has been organized in Helsinki on 5-6 May 2010.
List of deliverables accessible to you.
The admin page is the administrative data related to this deliverable as was planned in the description of work. Use this page, as work package leader, to keep track of changes in the content, scope, or date of the deliverable.
The wiki page is the collaborative editing platform for the deliverable, when a report, or for the cover document, when a prototype. Please cut and paste the front matter as in sample deliverables when creating a new one.
The publication is the actual frozen deliverable: it can easily be produced from the wiki page using the print icon and save as pdf, directly from the browser. Unless a publication is linked to the administrative record of the deliverable, it will not appear in the quick listing http://www.molto-project.eu/view/biblio/deliverables.
| ID | As planned (admin page) | Due date   | Dissemination level | Nature | Publication | Wiki | 
|---|---|---|---|---|---|---|
| D1.1 | Workplan for MOLTO | 1 April, 2010 | Consortium | Report | D1.1 Work plan for MOLTO | |
| D10.1 | Dissemination plan, with monitoring and assessment | 1 June, 2010 | Consortium | Report | Dissemination Plan with Monitoring and Assessment | D10.1 Dissemination Plan with Monitoring and Assessment | 
| D10.2 | MOLTO web service, first version | 1 June, 2010 | Public | Prototype | MOLTO web service, first version | D10.2 MOLTO web service, first version | 
| D9.1 | MOLTO test criteria, methods and schedule | 1 September, 2010 | Public | Report | MOLTO test criteria, methods and schedule | D9.1 MOLTO test criteria, methods and schedule | 
| ID | Title | Due date | 
|---|---|---|
| MS1 | 15 Languages in RGL | 1 September, 2010 | 
| MS2 | Knowledge Representation Infrastructure | 1 September, 2010 | 
Not available for midterm reporting.
Not available for midterm reporting.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D1.3. Progress Report T12 | 
| Security (distribution level): | Confidential | 
| Contractual date of delivery: | M12 | 
| Actual date of delivery: | 1 Apr 2010 | 
| Type: | Report | 
| Status & version: | Draft (evolving document) | 
| Author(s): | A. Ranta et al. | 
| Task responsible: | UGOT | 
| Other contributors: | All | 
Progress report for the first year of the MOLTO project lifetime, 1 Mar 2010 - 28 Feb 2011.
This section must be of suitable quality to enable direct publication by the Commission and should preferably not exceed four pages.
The publishable summary has to include all the distinct parts described below:
In line with this, diagrams or photographs illustrating and promoting the work of the project, as well as relevant contact details or list of partners can be provided without restriction.
The publishable summary should be updated for each periodic report.
Please provide an overview of the project objectives for the reporting period in question, as included in Annex I to the Grant Agreement. These objectives are required so that this report is a stand-alone document.
Please include a summary of the recommendations from the previous reviews (if any) and indicate how these have been taken into account.
Please provide a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.
For each work package, except project management, which will be reported in section 3.2.3, please provide the following information:
Please use this section to summarise management of the consortium activities during the period. Management tasks are indicated in Articles II.2.3 and Article II.16.5 of the Grant Agreement.
Amongst others, this section should include the following:
The section should also provide short comments and information on co-ordination activities during the period in question, such as communication between beneficiaries, possible co-operation with other projects/programmes etc.
The deliverables due in this reporting period, as indicated in Annex I to the Grant Agreement have to be uploaded by the responsible participants (as indicated in Annex I), and then approved and submitted by the Coordinator. Deliverables are of a nature other than periodic or final reports (ex: "prototypes", "demonstrators" or "others"). If the deliverables are not well explained in the periodic and/or final reports, then, a short descriptive report should be submitted, so that the Commission has a record of their existence.
If a deliverable has been cancelled or regrouped with another one, please indicate this in the column "Comments". If a new deliverable is proposed, please indicate this in the column "Comments".
This table is cumulative, that is, it should always show all deliverables from the beginning of the project.
Please complete this table if milestones are specified in Annex I to the Grant Agreement. Milestones will be assessed against the specific criteria and performance indicators as defined in Annex I.
This table is cumulative, which means that it should always show all milestones from the beginning of the project.
Please provide an explanation of personnel costs, subcontracting and any major costs incurred by each beneficiary, such as the purchase of important equipment, travel costs, large consumable items, etc., linking them to work packages.
There is no standard definition of "major cost items". Beneficiaries may specify these, according to the relative importance of the item compared to the total budget of the beneficiary, or as regards the individual value of the item.
These can be listed in the following tables (one table by participant):
TABLE 3.1 PERSONNEL, SUBCONTRACTING AND OTHER MAJOR COST ITEMS FOR BENEFICIARY 1 FOR THE PERIOD
| Work Package | Item description | Amount in € with 2 decimals | Explanations | 
|---|---|---|---|
| Ex: 2,5, 8, 11, 17 | Personnel direct costs | 235000.00 € | Salaries of 2 postdoctoral students and one lab technician for 18 months each | 
| 5 | Subcontracting | 11000.02 € | Maintenance of the web site and printing of brochure | 
| 8, 17 | Major cost item 'X' | 75000.23 € | NMR spectrometer | 
| 11 | Major cost item 'Y' | 27000.50€ | Expensive chemicals xyz for experiment abc | 
| Remaining direct costs | 15000.10€ | ||
| Indirect costs | |||
| TOTAL COSTS | 363000.85€ | 
Please submit a separate financial statement from each beneficiary (if Special Clause 10 applies to your Grant Agreement, please include a separate financial statement from each third party as well) together with a summary financial report which consolidates the claimed Community contribution of all the beneficiaries in an aggregate form, based on the information provided in Form C (Annex VI) by each beneficiary.
When applicable, certificates on financial statements shall be submitted by the concerned beneficiaries according to Article II.4.4 of the Grant Agreement.
Besides the electronic submission, Forms C as well as certificates (if applicable), have to be signed and sent in parallel by post.
A Web-based online tool for completing and submitting forms C is accessible via the Participant Portal: http://ec.europa.eu/research/participants/portal.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D1.4 Periodic Management Report T18 | 
| Security (distribution level): | Confidential | 
| Contractual date of delivery: | M18 | 
| Actual date of delivery: | 1 Oct 2011 (expected) | 
| Type: | Report | 
| Status & version: | Final | 
| Author(s): | O. Caprotti et al. | 
| Task responsible: | UGOT | 
| Other contributors: | All | 
Progress report for the third semester of the MOLTO project lifetime, 1 Mar 2011 - 30 Sep 2011.
The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run for 36 months. It promises to develop a set of tools for translating texts between multiple languages in real time with high quality. MOLTO will use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF [2], Grammatical Framework. GF technologies in MOLTO are complemented by the use of ontologies, such as those used in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data.
MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.
While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.
Three such domains will be considered during the MOLTO project: mathematical exercises, biomedical patents, and museum object descriptions. The MOLTO tools however will be applicable to other domains as well. Examples of such domains could be e-commerce sites, Wikipedia articles, contracts, business letters, user manuals, and software localization.
The results achieved during the first 18 months of the projects are:
The expected final product of MOLTO is a software toolkit made available via the MOLTO website. It will consist in a family of open-source software products:
These tools will be portable to different platforms as well as generally portable to new domains and languages. By the end of the project, MOLTO expects to have grammar resource libraries for 18 languages, whereas MOLTO use cases will target between 3 and 15 languages.
The main societal impact of MOLTO will be on contributing to a new perception for the possibilities of machine translation, moving away from the idea that domain-specific high-quality translation is expensive and cumbersome. MOLTO tools will change this view by radically lowering the effort needed to provide high-quality scoped translation for applications where the content has enough semantic structure.
This semester marks the half-lifetime of the project, a point in which all work-packages are under development and the first integrations ought to take place. In particular, the initial prototypes are being delivered. They include the APIs for WP2 and WP3, the GF grammar IDE and the grammar-ontology interoperability allowing natural language generation from an ontology, translation of natural language queries to SPARQL, the GF grammar for simple mathematical exercises, and information extraction. The main integrations are taking place among the GF grammar tools developed in WP2 and the translator's workbench developed in WP3, but also between the museum ontology created by WP8 and the interoperability described above and carried out in WP4.
This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.
For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:
The period M13-M18 has been very active in the development of grammar development tools and also in their documentation and dissemination. Thus we can report the following new software:
We are working with further development of the two IDE's for GF and the example-based grammar writing method. We are also gathering material for Deliverable D2.3, Best Practices, due M24.
During the reporting period, work has progressed on the following fronts.
A new tab embedding a treegrid editor implemented with the ExtJS javascript library for editing term equivalents has been added to the editor. A version is able to query Ontotext FactForge for term candidates. It remains to implement a full search and edit back end.
A development environment for the open source translation management system GlobalSight has been installed for adapting parts of the system for the MOLTO translation tools.
It remains to develop the glue to connect the existing parts together. Some extensions to the grammar development API have been listed in a requirements section in the deliverable.
A C language runtime for parsing and generating with PGF files has been written by Lauri Alanko. The first release will be out in Oct 2011.
A first version of an ontology/terminology acquisition toolkit for lexical resources management by Seppo Nyrkkö was demonstrated at the Helsinki project meeting.
WP4's main task is to research the possibilities for interoperability between grammars, written in GF, and ontologies and to build a prototype demonstrating it.
During the period M12-M18 the work concentrated on refactoring and bug fixing of the prototype build earlier, on experimenting with bigger data sets, and on extending the functionality of the prototype. The main points can be summarized as follows:
M18 is the date where Milestone S5 (First prototypes of the baseline combination models) should be achieved. The baseline systems for this workpackage, as described is Task 5.4, include an statistical machine translation system (SMT) trained with patents data, and the GF multilingual translation with a specific grammar for patents.
The SMT system was mainly developed in the previous six months and was already reported in the First Year Report. In the following section we explain the most significant results which have been accomplish with respect to the GF system.
For Task 5.5, we have started the work towards the hybrid system. Parts of the GF system such as the lexicon building already make use of statistical components. Besides, the methodology to combine SMT and GF alignments is established waiting to be applied to the patents domain.
The work done for these tasks has been recently published in the "MT Summit XIII 4th Workshop on Patent Translation" with the title "Patent translation within the MOLTO project".
At the same time of writing this report, the Deliverable D5.1 Description of the final collection of corpora corresponding to Tasks 5.1 and 5.2 has been submitted as a regular publication. It is a public document accessible from the MOLTO web page.
A first implementation of the English-to-French patent translator with GF is available. The translation process can be divided according to the action of three modules: a generic pre-processing, the on-line lexicon building, and the patents grammar.
The generic processing consists of an on-purpose tokeniser that deals with compound nouns, phrases separated by hyphens, chemical compounds, etc. The Stanford POS-tagger is used for named entities recognition and a recogniser of numbers has been developed. Chemical compounds after being tagged can be independently translated by the compounds grammar. This grammar is in an early stage of development within this workpackage.
The second module is devoted to the lexicon building. To do this, the GF library multilingual lexicon is extended with nouns, adjectives, verbs and adverbs. The abstract syntax for these PoS is created from the claims in English and words are lemmatised and corrected manually from noise and ambiguities. The appropriate inflection is generated using the implemented GF paradigms and the English dictionary of the GF library for English, which is the starting language. Base forms are then translated into French and the inflection is generated in the same way. This process will be extended to other languages later on the project.
Finally, the core of the translator is the patents grammar. The GF resource grammar has been extended with functions that implement constructions that occur in patent claims. The grammar is also in its first stages and nowadays it has a huge number of ambiguities and its coverage is around 15% on complete sentences. This figure can increase up to a 60% when dealing with chunks instead of full sentences.
This workpackage is tightly related to WP7. The delay on the patents corpus from WP7 has implied a reordering of some tasks within WP5. This explains the work done for Task 5.5 substituting parts of Task 5.4 which will be finished in the next months. Also because of the delay on the approval of the data, Deliverable 5.1 could be updated soon.
Deliverable D6.1 has been released as a tagged SVN repository available at svn://molto-project.eu/tags/D6.1, although bug fixing may continue in the head branch.
With respect to the T6 progress report, we increased the number of compiled languages from 7 to 13 and checked for correctness and fluency in 3 languages.
Dissemination activities at CADE'23 and satellite conference THedu'11.
Refactoring from the WebALT code to the modular form compatible with GF 3.2 complete.
The library compiles for the following languages: Bulgarian, Catalan, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, Swedish and Urdu plus an artificial language (LaTeX).
A demo based on the minibar demo, but including mathematical output for LaTeX is up and running at http://www.grammaticalframework.org/demos/minibar/mathbar.html
Testing for correcteness and fluency has been done for English, German and Spanish. This amounted to:
The results of Deliverable D6.1 have been presented in THedu'11
WP7 is due to provide a beta prototype in the forthcoming M21: D7.1 Patent MT and Retrieval Prototype Beta. Although the WP started with certain delay, the objectives for each task have been accomplished, namely in the last months.
The initial tasks includes the definition of the architecture for the prototype, Task 7.7, and the use case scenarios Task 7.1. We mainly consider two scenarios: the multilingual retrieval of biomedical pantents and the online translation of patent claims and abstracts.
In relation to Task 7.4 and Task 7.5, we started the work towards the selection of ontologies to deal with the biomedical domain and the extraction of FDA terms, drugs and measurement related models. Besides, we also have created an ontology to capture the structure of patent documents. In order to query the system, we have defined the structure and topics of the queries available to the user, detailed below. Finally, we have implemented a first version of the grammar that recognizes such queries.
Regarding Task 7.2, recently we finally obtained the official license for using the EPO corpus. Earlier, we had been working with provisional data in order to create the first version of GF grammars Task 7.3 and to test the SMT system with this data Task 7.6.
The architecture of the multilingual patents retrieval system is based on Exopatent, a working KRI platform from OntoText. The platform allow several search options including NL queries. We have defined 131 query examples along 21 query topics in relation to the biomedical domain. The grammar developed to process the queries covers about 600 queries in English and 500 in French.
Patents in the retrieval engine are being annotated following the two main ontologies selected for the domain.
The patents translation system is tightly related to WP5. The recent work in this task includes the development of the patents grammar, and the extension of the GF Resource Grammar with the functions implementing constructions that occur in patent claims. Generally speaking, the coverage of the grammar is unsatisfactory, which reinforces the efforts in the use of statistical techniques.
This workpackage has suffered a delay due to the lack of the proper license of the data corpus. Nonetheless, we are achieving the objectives related to WP7 tasks. Similarly, tasks in WP5 are being rescheduled. As we recently received the approval for the use of the data, we expect to speed up some tasks that were waiting for the data, as the baseline system (Task 5.4) and the annotation of the patents and the generation of the database (Task 7.5).
The use of resources have been as planned, with no remarkable deviations, according to the tasks described above performed so far.
Dana Dannels from the Linguistics Department of UGOT and Mariana from Ontotext have had a skype meeting every month to summarize what has been accomplished by WP8 and WP4 and to coordinate the work on tasks. Dana mainly worked with Ramona Enache, a GF expert, to share ideas and discuss the work in progress.
The objectives for this workpackage in the period are ....
Task ... is completed, is undergoing.....
We have developed a prototype for generating natural language descriptions using discourse patterns in English and Swedish, and two scientific paper about our work; both have been per reviewed. One was presented in a WWW conference and one in a CH workshop.
The work done during the months 12-18 the results are targeting for the preliminary work for D9.2 - the final evaluation.
We have gained much input from Maarit Koponen's review on post-editing analysis and quality measurement in MT evaluation, also presented during the Open Day at the Third Project Meeting in Helsinki.
For designing the evaluation of translator's tools, we have studied different translation management systems, that are common in the translation industry. We have selected GlobalSight, an open-source platform, for a closer study.
We also have set up a MOLTO Content Factory server, which provides collaborative term voting and term validation. These features will be used in evaluation of terminology work. The MOLTO server has an URL already - but currently, UHEL security measures make it http-accessible only via UHEL's VPN. UHEL is working to find a solution for opening up the server to the MOLTO Consortium compatible with the university security policies. The server base URL is "http://tfs.cc/" and the mediawiki content is hosted at "http://tfs.cc/wiki", as demonstrated during the MOLTO Open Day.
There is an ongoing discussion about collaboration with local entrepreneurs who are researching pre-editing and pre-validation of machine translatable documents. The research is focused in MT quality and its evaluation metrics.
This work package is tightly related to other work packages, likewise to the dissemination work package. Due to the changes in the Patents Case Study (WP7) we are reviewing the related material for evaluation purposes. We are going to announce updates to the earlier evaluation plan (D9.1) as needed.
The use of resources follows the earlier plans.
The objectives of this WP are to:
To address (i), we have interfaced the RSS feed, publishing updates from the MOLTO website, to the Twitter feed http://www.molto-project.eu/moltoproject. This will further distribute the MOLTO news feed to mobile devices, alongside with the project's presence on LinkedIn.
To address (ii) we have organized a number of events geared at publicizing the core technologies employed in the project. MOLTO partners from UPC and UGOT organized a GF Summer School in Barcelona between 15-26, August 2011. The program comprised a tutorial week and an advanced week with specific topics, including also work which is being carried out as part of the MOLTO workplan. In particular, J. Saludes presented the evaluation of WP6, T. Hallgren introduced web application programming for GF, R. Enache showed the work on the GF-ontology inter-relation and on WP8, and C. Espana presented the results of WP5. The web site of the school, archives the presentations, the discussions (in particular the future work suggestions as a result of the panel discussion) and the calendar of the lectures. Furthermore, A. Ranta delivered a GF tutorial during CADE23, Grammatical Framework: A Hands-On Introduction, and J. Saludes presented The GF Mathematics Library (joint work with S. Xambó) during the CADE23 satellite workshop "CTP Components for Educational Software".
UHEL arranged the 3rd MOLTO Project meeting in Helsinki Aug.31-Sept 2, 2011.
To address (iii), we now will enlarge the MOLTO Consortium by two new partners, one of which, Be Informed is a commercial partner interested in exploiting the MOLTO tools in its products.
The list of publications can be obtained from the MOLTO website, ordered by year (most recent first), http://www.molto-project.eu/biblio?sort=year&order=desc.
The GF book appeared in April 2011 and is expected to help new developers to get started with MOLTO tools: Aarne Ranta, Grammatical Framework: Programming with Multilingual Grammars, CSLI Publications, Stanford, 2011, 340 pp. http://www.grammaticalframework.org/gf-book/
Aarne Ranta, Translating between Language and Logic: What is Easy and What is Difficult. In N. Bjørner and V. Sofronie-Stokkermans (eds), Automated Deduction - CADE-23 Proceedings, LNCS/LNAI 6803, Springer, Heidelberg, 2011, pp. 5-25 (invited talk mentions word done in WP6).
Computational Morphology. A course in European Master's Programme in Language and Communication Technologies 2011, University of Malta, 22-30 March 2011. http://www.cse.chalmers.se/~aarne/computationalmorphology/
Computational Syntax. A course in the Masters Programme in Language Technology, University of Gothenburg, 11 April - 31 May, 2011. http://www.cse.chalmers.se/~aarne/computationalsyntax/
Grammatical Framework: A Hands-On Introduction. Tutorial at CADE-23, Wroclaw, 1 August 2011. http://www.grammaticalframework.org/gf-tutorial-cade-2011/
Second GF Summer School: Frontiers of Multilingual Technology. Barcelona, 15-26 August, 2011. http://school.grammaticalframework.org/
The main task of the management workpackage for the period has been to finalize the enlargement of the consortium which has been proposed last January. The new partners, University of Zurich (UZH) and the company Be Informed (BI), will lead two new workpackages directly applying the MOLTO tools to their existing core technologies. During the negotiation phase, a new workplan has been submitted and the budget was recalculated.
Payment for Period 1 arrived on the last day of August and was shared to the consortium early in September after redistribution of the Matrixware budget.
Monthly meetings of the Steering Committee were held regularly on conference calls and recorded on the wiki pages, http://www.molto-project.eu/wiki/minutes.
Deliverables of MOLTO are listed and linked for download on the web page http://www.molto-project.eu/workplan/deliverables.
Below is a summary of deliverables due in the third semester.
The only milestone due for the period is MS3, the web-based translation tool, which has been interpreted as a online editor interface to a GF application grammar and made available at http://www.molto-project.eu/node/1063.
Tables on the usage of resources are not available for midterm reporting, however we have a rough estimate of person's months by each node.
| Node | Professor | PhD | PhD Student | Research Engineer | 
|---|---|---|---|---|
| UGOT | 1 | 3 | 9 | 0 | 
| UPC | 10 | 12 | 0 | 0 | 
| UHEL | 2 | 0 | 4 | 9 | 
| OntoText | 0 | 0 | 0 | 23.75 | 
Not available for midterm reporting.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D1.5 Periodic Management Report T24 | 
| Security (distribution level): | Confidential | 
| Contractual date of delivery: | M24 | 
| Actual date of delivery: | 5 Apr 2012 | 
| Type: | Report | 
| Status & version: | Final | 
| Author(s): | O. Caprotti et al. | 
| Task responsible: | UGOT | 
| Other contributors: | All | 
Progress report for the fourth semester of the MOLTO project lifetime, 1 Sep 2011 - 29 Feb 2012.
The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run for 39 months. It promises to develop a set of tools for translating texts between multiple languages in real time with high quality. MOLTO will use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF, Grammatical Framework. GF technologies in MOLTO are complemented by the use of ontologies, such as those used in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data. GF has been applied in several small-to-medium size domains, typically targeting up to ten languages but MOLTO will scale this up in terms of productivity and applicability.
A part of the scale-up is to increase the size of domains and the number of languages. A more substantial part is to make the technology accessible to domain experts without GF expertise and minimize the effort needed for building a translator. Ideally, with the tools produced by MOLTO, this can be done by just extending a lexicon and writing a set of example sentences.
MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.
While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.
The MOLTO Enlarged EU proposal adds two countries (Switzerland and The Netherlands) and two work packages. The Semantic Wiki work package builds a system that integrates the functionalities of MOLTO tools with a collaborative environment, where users can create content in different languages, and all edits become immediately visible in all languages, via automatic semantic-based translation. The Interactive Knowledge-Based System work package puts MOLTO technology to use in an enterprise environment, for the localization of end-user oriented systems to new languages and the generation of high-quality explanations in natural language. Noteworthy in this work package is the fact that translation grammars are constructed in house by Be Informed's non-expert staff without the intervention of grammar specialists.
MOLTO technology will be released as open-source libraries, which can be plugged into standard translation tools and web pages and thereby fit into standard workflows. It will be demonstrated in web-based demos and applied in three case studies: mathematical exercises in 15 languages, patent data in at least 3 languages, and museum object descriptions in 15 languages.
The results achieved during the first 24 months of the projects have been demonstrated during the 4th Project Meeting. They include:
A detailed list with short abstracts is available at http://www.molto-project.eu/content/molto-4th-project-meeting-demos.
In the past semester we reported:
The expected final product of MOLTO is a software toolkit made available via the MOLTO website. It will consist in a family of open-source software flagship products:
These tools will be portable to different platforms as well as generally portable to new domains and languages. By the end of the project, MOLTO expects to have grammar resource libraries for 18 languages, whereas MOLTO use cases will target between 3 and 15 languages.
The main societal impact of MOLTO will be on contributing to a new perception for the possibilities of machine translation, moving away from the idea that domain-specific high-quality translation is expensive and cumbersome. MOLTO tools will change this view by radically lowering the effort needed to provide high-quality scoped translation for applications where the content has enough semantic structure.
This section describes the progress of each workpackage and discusses changes to the workplan, if necessary.
The work during the fourth semester has proceeded in parallel in all workpackages leading to a number of demonstrative prototypes.
In WP2, the work concentrated on finishing the Cloud-based Editor and the Eclipse Plugin for GF, and in WP3, the design of an integrated architecture for the translation tools led to the adoption of the third-party platform GlobalSight as the translators workflow management framework in which to integrate the MOLTO tools.
WP4 has come to its conclusion delivering a prototype on the company's website Ontotext for showing GF-OWL interoperability.
The first hybrid models for the statistical and robust translation promised in WP5 have been implemented and evaluated on a specific testset, the results are available in a deliverable and will form the basis for the next developments. The MGL developed in the use case study of mathematics has been equipped with a command line interface to the Sage suite of Computer Algebra Systems, thus providing a natural language dialog to a computation system.
The Patent MT and Retrieval Beta Prototype was first released in M21, and the final version of D7.1 has been delivered.
In the cultural heritage museum study, the ad-hoc ontology facts, stored in the knowledge representation infrastructure delivered by WP4, can be queried in natural language in 5 languages.
The milestones for the period have all been achieved as described later in the report.
This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.
For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:
Moreover, if applicable:
This WP has delivered the GF grammar development infrastructure in anticipated ways, resulting in two IDE's (a cloud-based and an Eclipse-based) and a faster grammar compiler. The WP and its last deliverable have been extended in time to allow for interaction with the MOLTO-Enlarged EU, which was delayed. The skeleton of the final deliverable has been discussed during the 4th Project Meeting.
GF 3.3.3: faster compilation of grammars, permitting on-the-fly changes of running translation systems.
GF Cloud-Based IDE: an IDE for beginners, as well as for on-the-fly changes of running translation systems. New features in this year:
GF-Eclipse plugin: an IDE for power users, with features such as
GF Resource Grammar Library has 7 new languages since March 2012: Hindi, Latvian, Nepali, Persian, Punjabi, Sindhi, and Thai. Some MOLTO applications (e.g. the Phrasebook and the Math library) are ported to some of these languages.
RGL support of lexicon building was evaluated in the article by Détrez and Ranta, Smart Paradigms and the Predictability and Complexity of Inflectional Morphology, to appear in EACL 2012.
As a tutorial and reference for GF, a book has been published: Ranta, Grammatical Framework - Programming with Multilingual Grammars, CSLI, Stanford, 2011.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 5 (R. Enache) | 6 (J. Camilleri) | 
| UPC | 1 (J. Saludes) | 1 (C. España) | 0 | 0 | 
| UHEL | 0 | 0 | 0 | 5 (L. Alanko) | 
| OntoText | 0 | 0 | 0 | 4,2(M.Chechev,M.Damova,K.Krustev) | 
We moved Deliverable 2.3, "User Manual and Best Practices", to Month 27 (due 20 June 2012). The reason is that we want to include the experience from the new kind of users from Be Informed, and the start of the MOLTO enlargement was delayed.
The work done during the last year is related to the promises of WP3: to combine MOLTO tools with traditional CAT tools. As described in the appendix D9.1A, MOLTO tools would be used to translate real time multilingually some formulaic parts of a more complex document type, such as descriptions of chemical formulas in a patent. The rest of the document would be translated with more traditional methods. We have chosen the translation management system GlobalSight to combine the workflows.
We have been modifying the editor released in MS3 adding term management and user authentication. We've been also developing a term search; currently it is a separate component, but we're planning to attach it to the editor. The search can be tested at http://tfs.cc/molto_term_editor./editor_sparql.html.
Term management platform TermFactory (TF), a related project run by Lauri Carlson, is under development. The plan is to connect TF to the editor in order to allow on-the-fly user extensions of the lexicon of the grammar. The work done in WP2 by UGOT is in synergy with our WP: they have been developing ways to change the GF grammar without full recompilation thus in a significantly faster time.
As for publications, a master's thesis called ''Ontology-based lexicon management in a multilingual translation system'', written within the project, will be finished during Spring 2012.
As a part of MS8 (due September 2012), GlobalSight is now running on our server at http://tfs.cc/globalsight/.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 0 | 4 (A. Slaski, S. Virk, N. Frolov) | 
| UPC | 0,25 (L. Màrquez) | 1,75 (M. Gonzàlez) | 0 | 0 | 
| UHEL | 2 (L. Carlson) | 0 | 0 | 5 (I. Listenmaa), 6 (J. Shen), 6 C. Li) | 
| OntoText | 0 | 0 | 0 | 4,1(M.Damova, M.Chechev, S.Enev) | 
This WP has delivered two way interoperability between the natural language and ontology. The prototype was build and made publicly available on http://molto.ontotext.com. The prototype integrates the infrastructure for knowledge modeling, semantic indexing and retrieval with tools for NL queries to the semantic repository and verbalization of the results.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 0 | 0 | 
| UPC | 0 | 0 | 0 | 0 | 
| UHEL | 0 | 0 | 0 | 0 | 
| OntoText | 4(B.Popov) | 0 | 0 | 10,1(M.Chechev,P.Mitankin,M.Nozchev,F.Alexiev,A.Ilchev,I.Kabakov,K.Krustev,V.Zhikov) | 
The milestone MS7 has been achieved in M24 (First prototypes of hybrid combination: The methods are implemented and evaluated on a specific test set).
The work of the fourth semester corresponds to the three last tasks of the WP (T5.4 Baseline systems, T5.5 Hybrid Models and T5.6 Systems evaluation, see http://www.molto-project.eu/workplan/statistical-and-robust-translation). The baseline systems have been improved by extending the GF translator. Now the translator is able to deal with chunks so that the coverage has been widened (Task 5.4).
For Task 5.5 we have implemented two kinds of hybrid models which we call Soft and Hard integrations. The following section outlines its main characteristics.
Finally, for Task 5.6 both the baselines and the hybrid systems have been evaluated using a variety of lexical metrics and compared with generic public available translators such as Google and Bing. Also a manual evaluation has been carried out in order to compare the most promising hybrid system according to the automatic evaluation and the pure SMT translator.
The work done for these tasks has been submitted to the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) and the submitted paper with title "A Hybrid System for Patent Translation" can be found in MOLTO web page.
At the same time of writing this report, the deliverable D5.2 Description and evaluation of the combination prototypes is being submitted as a regular publication. It is a public document accessible from the MOLTO web page.
Two kinds of hybrid translators for patents have been developed. The final systems are not only a combination of two different engines but the subsystems also mix different components. We have developed a GF translator for the specific domain that uses an in-domain SMT system to build the lexicon; an SMT system is on top of it to translate those phrases not covered by the grammar.
In the previous report we showed that the GF grammar-based system alone could not parse most patent sentences. Consequently, the current translation system aims at using GF for translating patent chunks, and assemble the results in a later phase. As explained in D5.2, this implies several modifications to the GF baseline itself.
To gain robustness in the final system, the output of the GF translator is used as a priori information for a higher level SMT system. The SMT baseline is fed with phrases which are integrated in two different ways. First, what we call "Hard Integration", phrases with GF translation are forced to be translated this way. The system can reorder the chunks and translates the untranslated chunks, but there is no interaction between GF and pure SMT phrases. Second, in the "Soft Integration" system, phrases with GF translation are included in the translation table with a certain probability so that the phrases coming from the two systems interact.
The hybrids exploit the high coverage of statistical translators and the high precision of GF to deal with specific issues of the language. At this moment the grammar tackles agreement in gender, number and between chunks, and reordering within the chunks. Although the cases where these problems apply are not extremely numerous both manual and automatic evaluations consistently show their preference for the hybrid system in front of the two individual translators. In the near future we plan to widen the number of issues approached by the grammar. Also, modifications with SMT components to the GF translator and new kinds of combination of phrases will be introduced.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 1 (R. Enache) | 0 | 
| UPC | 7.75 (L. Màrquez, L. Padró, H. Rodríguez, D. Farwell) | 11.25 (C. España, M. Gonzalez, X. Carreras) | 0 | 0 | 
| UHEL | 0 | 0 | 0 | 0 | 
| OntoText | 0 | 0 | 0 | 0 | 
The final hybrid translators have been developed for the French-English language pair. We also aim at including German, so in the following months the concrete syntax for German will be completed. We plan to complete the task in May and it does not affect any other tasks of the project. The systems in the three languages will be available for the final evaluation.
Deliverable D6.2 has been released as tagged SVN content publicly available at svn://molto-project.eu/tags/D6.2. Bug fixing and some more features may continue to be developed in the head branch.
With respect to M18, we added an upper layer to the MGL library to support commands issued to a Computer Algebra System (CAS) and to render the answers in the natural languages as text or speech using actual concrete syntaxes for 3 languages: English, Spanish and German.
We developed software components to interact with a CAS (Sage) both externally using the http protocol, or inside the Sage shell and notebook interfaces. Furthermore, we developed a testing procedure to assist in regression tests for the tool.
Developed a prototype to engage Sage in a dialog using natural language that runs on Linux and Mac OSX. The system assists command composing by providing autocompletion and gives spoken output on demand.
Developed a Sage interface to issue commands to a Sage process from the native Sage shell or notebook. In Linux it provides autocompletion using the native shell mechanism for it.
The dialog prototype has been demonstrated at DEIMS12
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 0 | 0 | 
| UPC | 6 (J. Saludes) | 0 | 0 | 6 (A. Ribó Mor) | 
| UHEL | 0 | 0 | 0 | 0 | 
| OntoText | 0 | 0 | 0 | 0 | 
During this period, WP7 has done a step forward in the development of the prototype: the Patent MT and Retrieval Beta Prototype was first released in M21, and the final version of D7.1 has been delivered.
Due the incremental development of the prototype, most of the tasks span till M27, when the final prototype must be delivered. The following lines describe the progress of the following tasks:
In relation to Task 7.2, the EPO provided a parallel corpus of patents from which only 66 patents belongs to the biomedical domain. We downloaded an alternative corpus of 7,705 document directly from their website (i.e. publicly available) The following summarizes the content of these documents: 4,274 out of the 7,705 documents have claims (6M lines), 2,058 out of them are trilingual (3M lines). 2,116 documents have claims written only in English, 66 have claims only in German (260K lines), 34 only in French (88K lines). There are no extra files having other combination of languages.
Regarding Task 7.4 and Task 7.5, the ontologies, indexes, databases and retrieval engines have been set up for the specific domain and using the patent documents described above. The semantic annotation process is carried out by a GATE pipeline on the English texts. We are working to export the annotations during the translation process in order to be able to show the annotations also in the French and German texts.
As for Task 7.3 and Task 7.6, the grammars development and SMT adaptation to the domain is being developed jointly with WP5 tasks. The grammars have been developed for English and French, and in the following will be developed also for German.
Finally, regarding Task 7.7, the interface allows accessing the system in three different ways: the controlled language, SPARQL and terms in the index. In the future we will include free text and a combination of it with the controlled language.
Since M21 there is a fully functional version of the prototype at http://molto-patents.ontotext.com/. The demo allows querying the system in English and French. The patents in the database has original text in English, French and German.
The retrieval system can be queried in three different ways. The NL-based interface allows the user to query the system in English and French using written natural language. The SPARQL interface, more suitable for advanced users, allows to accurately browse the repository using SPARQL queries The keyword-based visual browsing interface uses the RelFinder tool in which the user can search for keywords using the autocomplete functionality. The results from the RelFinder search are visualised as graphs.
The visualization of the results displays the list of classes from the ontologies that match the query and the list of patent documents indexed under the matching criteria. The interface provides also a link to access the semantically annotated documents and the original patents. The interface that shows the annotated documents highlights on the text the words that are related to any semantic item. Colors are given according to the semantic annotations type. The right side of the page gives the list of semantic types and colors that are present in the text.
A paper about the Patent retrieval system was accepted at WWW2012 Conference, to be held in April.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 3 (R. Enache) | 3 (A. Slaski) | 
| UPC | 0 | 7,5 (M. Gonzales, C. España) | 0 | 0 | 
| UHEL | 0 | 0 | 0 | |
| OntoText | 0 | 0 | 0 | 8,8 (M.Chechev, M.Damova, V.Zhikov, I.Kabakov) | 
In general lines, we are achieving the objectives related to WP7 tasks within the timeframe. However, due the several issues related to the gathering of the corpora, the databases of the retrieval system do not include yet automatic translations of the patent document but only real translations. The issue affects directly the annotation process of Tasks 7.5, but it does not imply a delay for the whole prototype. The estimation is that the automatic translations and annotations will be included in the final prototype.
The work package has started by data collection, proceeded with developing the ontology interface, and lately focused on the baseline translator. The translator is only for five languages so far, but will be extended soon. The ontology interface will permit multilingual queries about museum objects exploiting the MOLTO Knowledge Representation Infrastructure. It also makes this case study into an example of multilingual ontology verbalization.
Ontology and corpus study (D8.1).
Grammars for translation and multilingual NLG for painting descriptions in five languages: English, Finnish, French, Italian, Swedish. This was built in a modular way that is easy to extend to new languages, which we will do soon.
Ontology verbalization in a generic way. The same languages will be usable as in translation, but aren't yet.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 1 (R. Enache) | 0 | 
| UPC | 0 | 0 | 0 | 0 | 
| UHEL | 0 | 0 | 0 | 0 | 
| OntoText | 0 | 0 | 0 | 5,9 (M.Chechev, M.Damova, K.Krustev, V.Zhikov) | 
We have not been able to proceed at the planned pace, and would like to have an extension of the WP time. We would like to extend this WP and its last deliverable 8.3 till Month 36. We have not been able to use the person months as planned (as can be seen from the Use of resources). One of the planned key persons, Dana Dannélls at UGOT, will be able to join MOLTO full-time a few months later than originally planned, probably October/November 2012.
This WP is working on collecting evaluation plans from each site.
An extended D9.1E Evaluation plan has been written.
Progress evaluation has mainly been carried out by each site during development. This would be a good idea to collect this more systematically.
For the SMT/hybrid patent case, automatic measures (BLEU but also others - maybe check Cristina/UPC slides for examples) are probably mainly used.
In developing the GF grammars, informants (native speakers of the relevant languages) have been used during the grammar writing process to check and correct output. The informants have been given output to read and have informed the developer if sentences are correct or if not, how they should be corrected.
Moving forward, the final evaluations will need to include usability of the tools as well as quality evaluation of the output. (WP9 review slides have some examples of the user communities that might be mobilized for usability evaluation and the platforms that could be used. One thing that we were discussing wrt to mobilizing evaluators is that they need to be motivated to use the tools in some way?)
For output quality, final evaluations will likely involve both automatic and manual methods. For automatic methods, UPC's Asiya evaluation kit offers some syntactically and semantically oriented metrics in addition to the purely lexical ones like BLEU, but only for a couple of languages. As all automatic metrics rely on comparison to gold standard human translations, these need to be obtained for the test sets, if they are to be used.
Manual evaluation methods on the other hand require humans to do evaluations. For the patent case, evaluators need to have sufficient understanding of the material to be able to assess whether the translations are correct or not, particularly since we expect one of the strengths of the GF hybrid to be in correctly handling long formulae. Therefore plans have been made to hire professional patent translators of the languages in question to do the evaluation expectedly in June. Since Google is now also providing patent translations, that will be used as a point of comparison. The TAUS scale, fluency etc. could be used in this case.
For the museum case one manual evaluation approach was to produce museum descriptions in various languages that combine the simpler rules - e.g. "Painter painted Painting in City in Year on Canvas" etc. and then have the native speakers check the individual relations involved (Who painted? What did they paint? Where? When? etc.) and combine these into a measure of the overall fidelity. For this, evaluators do not necessarily need to be museum experts, any native speakers of the language in question should do. If you want a reference for this, an interesting description of such approach is in http://www.cs.ust.hk/~dekai/library/WU_Dekai/LoWu_Acl2011.pdf Other measures such as fluency, TAUS fitness scale could also be used.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 0 | 0 | 
| UPC | 2,16 (L. Màrquez, LluísP, D. Farwell) | 0,5 (C. España) | 0 | 0 | 
| UHEL | 0.5 (L. Carlson) | 0 | 9 (S. Nyrkkö) | 0 | 
| OntoText | 0 | 0 | 0 | 2(M.Chechev, K.Krustev) | 
| UZH | 0 | 0 | 0 | 0 | 
| BI | 0 | 0 | 0 | 0 | 
A few events have been organized by MOLTO and some are in the making. The your researchers in the Consortium have published a number of papers at international venues on initial results of the project. Project meetings have taken place in Helsinki, and in Zürich with the extra MOLTO-EEU kickoff meeting in Gotheburg. GF tutorials and tutorials on MOLTO works have been delivered on various occasions, more prominently during the GF Summer School: Frontiers of Multilingual Technologies in August 2011.
At UGOT, R. Enache and S. Virk have passed their licenciate, a step towards their PhD, by publishing work in connection to MOLTO. Moreover, D. Dannels has also defended her PhD seminar by discussing work done in the natural language analysis of cultural heritage domain. Finally, K. Angelov, obtained his PhD at Chalmers with a thesis on the inner workings of GF, much of which goes to benefit WP2 and WP3.
The list of publications, archived on the MOLTO website (http://www.molto-project.eu/biblio?sort=year&order=asc), follows here below.
Controlled Language for Everyday Use: the MOLTO Phrasebook, Ranta, Aarne, Enache Ramona, and Détrez Grégoire, Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, (2011)
The GF mathematics library, Saludes, Jordi, and Xambó Sebastian, THedu'11, (2011)
Grammatical Framework: Programming with Multilingual Grammars, Ranta, Aarne, CSLI Studies in Computational Linguistics, Stanford, p.350, (2011)
MOLTO Enlarged EU Annex I - Description of Work, Consortium, MOLTO , (2011)
MOLTO poster presented at EAMT Conference (European Association for Machine Translation) 2011, Leuven, Ranta, Aarne, and Enache Ramona, (2011) - also presented at META-FORUM by Listenmaa, Inari in Budapest, 2011.
Typeful Ontologies with Direct Multilingual Verbalization, Angelov, Krasimir A., and Enache Ramona, Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, (2011)
Typeful Ontologies with Direct Multilingual Verbalization poster, presented at the Google Anita Borg retreat, June 2011, Zurich, Enache, Ramona, (2011)
The GF Mathematics Library, Saludes, Jordi, and Xambó Sebastian, Proceedings First Workshop on CTP Components for Educational Software (THedu'11), 02/2012, Volume Electronic Proceedings in Theoretical Computer Science, Number 79, Wrocław, Poland, p.102–110, (2011)
D1.3A Advisory Board Report, Hall, Keith, and Pulman Stephen, 03/2011, Number D1.3A, Gothenburg, (2011)
MOLTO - Multilingual On-line Translation - Annual Report 2010-2011, Caprotti, Olga, España-Bonet Cristina, and Alanko Lauri, 03/2011, Gothenburg, (2011) - Published on cordis.eu.
A Framework for Improved Access to Museum Databases in the Semantic Web, Dannélls, Dana, Damova Mariana, Enache Ramona, and Chechev Milen , RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING, 09/2011, Hissar, Bulgaria, (2011)
Hybrid Machine Translation Guided by a Rule–Based System, España-Bonet, Cristina, Labaka Gorka, Díaz De Ilarraza Arantza, Màrquez Lluís, and Sarasola Kepa , Machine Translation Summit, 09/2011, Xiamen, China, p.554-561, (2011)
The painting ontology, Dannélls, Dana, CIDOC 2011 conference, 09/2011, (2011)
Patent translation within the MOLTO project, España-Bonet, Cristina, Enache Ramona, Slaski Adam, Ranta Aarne, Màrquez Lluís, and Gonzàlez Meritxell, Workshop on Patent Translation, MT Summit XIII, 09/2011, p.70-78, (2011)
Reason-able View of Linked Data for Cultural Heritage, Damova, Mariana, and Dannélls Dana, The Third International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES (S3T), 09/2011, Bourgas, Bulgaria, (2011)
Deep evaluation of hybrid architectures: simple metrics correlated with human judgments, Labaka, Gorka, Díaz De Ilarraza Arantza, España-Bonet Cristina, Sarasola Kepa, and Màrquez Lluís, International Workshop on Using Linguistic Information for Hybrid Machine Translation, 11/2011, Barcelona, Spain, p.50-57, (2011)
The Patents Retrieval Prototype in the MOLTO project, Chechev, Milen, Gonzàlez Meritxell, Màrquez Lluís, and España-Bonet Cristina, WWW2012 Conference, Lyon, France, (2012)
MOLTO - Multilingual On-Line Translation, Ranta, Aarne, Talk given at Xerox Research Centre Europe, Grenoble, 19 January 2012, 01/2012, (2012)
Using GF in multimodal assistants for mathematics, Archambault, Dominique, Caprotti Olga, Ranta Aarne, and Jordi Saludes, 02/2012, Digitization and E-Inclusion in Mathematics and Science 2012, (2012)
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 4 (O. Caprotti) | 0 | 0 | 0 | 
| UPC | 1 (S. Xambo) | 0 | 0 | 0 | 
| UHEL | 0 | 0 | 0 | 0 | 
| OntoText | 1 (B.Popov) | 0 | 0 | 2,6 (M.Chechev, M.Damova) | 
| UZH | 0 | 0 | 0 | 0 | 
| BI | 0 | 0 | 0 | 0 | 
None to report.
During the first 3 months of our participation in the MOLTO project we completed an initial integration of the GF-provided services (mainly translation and look-ahead editing) into AceWiki.
We implemented a new Java front-end to the GF Webservice, and use it to connect to the GF services from AceWiki. The existing AceWiki user interface was extended to allow for an easy switching between different languages and to present with each sentence its GF-provided analysis (translations into other languages, word alignment diagrams, GF syntax trees, etc.). The AceWiki storage format was changed to a one based on GF abstract trees (which are language-neutral).
The other main part of our work dealt with the implementation of the ACE grammar in GF. We tested an existing implementation (Angelov and Ranta, 2009) which targets an earlier version of ACE for its recall and precision, and found that some changes need to be introduced to make it compatible with the latest version of ACE. More importantly, we decided to focus on and also started work on a grammar of the subset of ACE that is used in the current AceWiki.
We also experimented with taking the content of an existing AceWiki demo wiki (domain Geography) and using it to pre-populate the new GF-based AceWiki.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 0 | 0 | 
| UHEL | 0 | 0 | 0 | 0 | 
| UZH | 0 | 2 (K. Kaljurand) | 0 | 0 | 
None to report, besides delayed start as explained later in this document.
The first 2 months we started with the Adoption phase as described in the DoW for WP12. We've focused our efforts on the requirements for the verbalization component(D12.1). We distinguish 4 categories of relevant requirements.
We presented a requirements draft to our partners in March 2012.
At the kickoff hosted by UGOT we did a first round with the UGOT people to draw up the specification to migrate Be Informed current explanation prototype to GF.
Next Steps
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 0 | 0 | 0 | 0 | 
| BI | 0,25 (J. van Grondelle, J. van Aart ) | 0 | 0 | 0,25 (H. ter Horst) | 
WP12 runs longer than is indicated in the Gantt chart appearing in the Annex, however duration is correctly listed under 3.3.3. We planned for a duration of 15 months, D12.2 for instance is projected for March 2013.
Deliverables of MOLTO are listed and linked for download on the web page http://www.molto-project.eu/workplan/deliverables.
Below is a summary of deliverables due until the fourth semester.
The milestones until now have been achieved to a different degree of completion, either by a deliverable or by some online prototype. In particular, MS3, is the translation editor available at http://www.grammaticalframework.org:41296/editor/#translate and is being integrated in the Translator's Tools due next September 2012. The grammar-ontology-interoperability has been documented in D4.3 however is has been requested that more details be made explicit. Concerning MS7, the methods implemented by the hybrid combination prototypes have been evaluated on a specific test set as reported in D5.4.
| ID | Title | Due date | 
|---|---|---|
| MS1 | 15 Languages in RGL | 1 September, 2010 | 
| MS2 | Knowledge Representation Infrastructure | 1 September, 2010 | 
| MS3 | Web-based translation tool | 1 March, 2011 | 
| MS4 | Grammar-ontology interoperability | 1 October, 2011 | 
| MS5 | First prototypes of the cascade-based combination models | 1 October, 2011 | 
| MS6 | Grammar tool complete | 1 March, 2012 | 
| MS7 | First prototypes of hybrid combination models | 1 March, 2012 | 
The third semester of the project saw the enlargement of the Consortium by two new partners University of Zürich and Be Informed. While the original planned start of the MOLTO-EEU project enlargement was scheduled for September 2011, and accounted for synchronicity of the deliverables with ongoing workpackages, the actual kickoff only happened in January 2012. Consequently, the end date of the project is now shifted to 31 May 2013 and the main deliverable of Workpackage 2 has been shifted 3 months to take into account the feedback from the new use cases added by the enlargement.
The following inconsistencies have been noticed in the revised Annex:
However, due to the delay in start, both WP11 and WP12 will now be ongoing in the period M22-M37, as in the chart below. Notice also the changes affecting WP2 and WP8.

The following actions were taken as a result of the review report quoted here below.
Some observations, comments and remarks, raised and discussed at the review meeting, follow. These should be addressed in the respective deliverable(s) as well as in the planning for the next period.
Rule extraction (from lexical databases, ontologies, text examples) needs to be specified in detail and a concrete schedule should be included in the updated workplan (D1.1).
The workplan in maintained online using a dynamically generated list of tasks entered by the workpackage leaders. It is available, if logged in, under http://www.molto-project.eu/workplan and tasks http://www.molto-project.eu/workplan/tasks. It is the responsibility of the workpackage leader to actively use and document ongoing work using this tool.
The topic of rule extraction will be detailed in the last, main deliverable of WP2.
Concerning the integration of the TermFactory (TF) and Knowledge Representation Infrastructure (KRI), it seems that there are overlaps between these tools. The partners must clarify which functions of these tools will be used in the case studies in order to exploit complementarities of the tools and avoid overlaps.
The Term Factory is not only a component in molto but a stand-alone software, which vitally requires some functionalities of its own for technical purposes. Any excessive development of overlapping functions will avoided by co-operation and planning with the KRI developers. It is in deed notably relevant for evaluation and reporting that the case studies describe the tools that provide each functionality.
Critical issues with respect to the semi-automatic creation of abstract grammars from ontologies, as well as deriving ontologies from grammars, are still to be clarified. Concrete steps to handle these issues need to be specified in detail and a schedule should be included in the updated workplan (D1.1).
As part of the prototype for D4.3 an automatically build from an ontology abstract and concrete English grammar have been integrated. They are used to verbalize the results from the semantic repository. Experiments and discussions, about using a similar approach for automatically buiding a query grammar from the semantic repository, were performed, but the provided from UGOT GF query grammar was selected as better tool because of its expressing power and the possibilities to generate better natural language. The query grammar has different types of question templates and it can be easily ported for new domain with minor modifications at the abstract and concrete grammars. The mapping rules that are used for connection between the abstract grammar and SPARQL are selected as the best semi-automated aproach for connection between the grammar and SPARQL. The mapping rules provide possibilities to make an general rules for transformation, but also to make a fine tune for a specific cases. The rules that are currently used are general enough to be used at new domains with a ported GF query grammar and this will be demonstrated at WP7 and WP8 prototypes.
Current description of work in WP6 lacks details on the prototype multilingual dialog system to be developed. Including an example dialog and specifications of this prototype in a new version of deliverable D9.1 is recommended.
WP7 work should focus on the major issues examined in MOLTO, especially in relation to the grammar – ontology interoperability rather than chemical compound splitting. Specific scenarios are needed for the exploitation of MOLTO tools in this case study. It is recommended to include such scenarios in a new version of deliverable D9.1.
Specific scenarios are needed for the exploitation of MOLTO tools in the case study on cultural heritage (WP8) which just started. It is recommended to include such scenarios in a new version of deliverable D9.1.
Use cases are listed in http://www.molto-project.eu/workplan/usecases and they include two scenarios for WP8 and two for WP7. The specific use case scenarios for WP7 were described in: UC-71 and UC-72. Details about them were given in Section 2 of D.7.1.
UC-71 focuses on grammar-ontology interoperability. User queries, written in CNL (controlled natural language) are used to query the information retrieval system.
UC-72 focuses on high-quality machine translation of patent documents. It uses an SMT baseline system to translate a big dataset and fill up the retrieval databases. In order to study the impact of hybrid systems in translation quality, a smaller dataset will be translated using the hybrid system developed in WP5.
The way the project’s web site is structured, although it contains the necessary content, affects its readability in some cases.
We have added a direct navigation link to Sites and People, and a quick link to the public deliverables list. Publications can be tagged by workpackage or event, thus making the selection of publications by tag easier.
The deliverables on the workplan (D1.1) and the dissemination plan (D10.1) should be regularly updated (at the beginning of 2nd and 3rd year).
We have kept an updated list of deliverables with administrator's view at http://www.molto-project.eu/workplan/deliverables and quick links at http://www.molto-project.eu/view/biblio/deliverables. The dissemination plan is kept uptodate on the wiki page, http://www.molto-project.eu/wiki/living-deliverables/d101-dissemination-.... We now added a Section to summarize Exploitation plans.
Taking into account the numerous endeavors undertaken in the translation domain, both research and commercial, the market segment addressed by MOLTO should be identified with maximum precision. The specific case studies should also be taken into account in this effort. It is suggested that careful planning is initiated as early as possible and not later than the next reporting period.
The addition of the new partner BI will open extra markets for the tools of MOLTO. We have also started to look into usage of constrained natural languages in software localization, in social networks and in specific mathematical domains.
Official tables on the usage of resources are available for yearly reporting in Forms C.
Here we have a rough estimate of person's months given by each node. Note that the figures listed previously do not include management months, hence totals may differ.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UGOT | 9 (O. Caprotti, A. Ranta) | 0 | 10 (R. Enache) | 12 (J. Camilleri, A. Slaski, S. Virk) | 
| UPC | 19.16 (J. Saludes, L. Màrquez, L. Padró, H. Rodríguez, D. Farwell) | 22 (C. España, X. Carreras, M. Gonzalez) | 0 | 6 (A. Ribó Mor) | 
| UHEL | 2,5 (L. Carlson) | 0 | 7 (S. Nyrkkö) | 5 (L. Alanko), 5 (I. Listenmaa), 12 (J. Shen, C. Li) | 
| Ontotext | 6 (B.Popov, S.Karagova) | 0 | 0 | 36 (P.Mitankin, M.Nozchev, F.Alexiev, A.Ilchev, I.Kabakov, K.Krustev, M.Damova, M.Chechev, V.Zhikov, S.Enev) | 
| UZH | 0 | 2 (K. Kaljurand) | 0 | 0 | 
| BI | 0,25 (J.van Aart, J. van Grondelle) | 0 | 0 | 0,25 (H. ter Horst) | 
We found a typo in the table for WP7 in the new Annex I for MOLTO EEU. The person months must be the same as for the previous DoW (Version number: 3 Revision 1 (21/01/2011)): namely WP7 description, pag. 31, PMs: UGOT 12, UPC 15, and Ontotext 15 (and not Ontotext 0).
I, as scientific representative of the coordinator of this project and in line with the obligations as stated in Article II.2.3 of the Grant Agreement declare that:
1. The attached periodic report represents an accurate description of the work carried out in this project for this reporting period;
2. The project (tick as appropriate):
3. The public website, if applicable:
4. To my best knowledge, the financial statements which are being submitted as part of this report are in line with the actual work carried out and are consistent with the report on the resources used for the project (section 3.4) and if applicable with the certificate on financial statement.
5. All beneficiaries, in particular non-profit public bodies, secondary and higher education establishments, research organisations and SMEs, have declared to have verified their legal status. Any changes have been reported under section 3.2.3 (Project Management) in accordance with Article II.3.f of the Grant Agreement.
Name of scientific representative of the Coordinator:
Aarne Ranta
....................................................................
Date: 26/4/2012
</hr/>
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D1.6 Periodic Management Report T30 | 
| Security (distribution level): | Confidential | 
| Contractual date of delivery: | M30 | 
| Actual date of delivery: | 7 Nov. 2012 | 
| Type: | Report | 
| Status & version: | Final | 
| Author(s): | O. Caprotti et al. | 
| Task responsible: | UGOT | 
| Other contributors: | All | 
Progress report for the fifth semester of the MOLTO project lifetime, 1 Mar 2012 - 31 Aug 2012.
The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run until 31 May 2013 with the task to develop tools for translating texts between multiple languages in real time with high quality. MOLTO grounding technology is multilingual grammars based on semantic interlinguas and statistical machine translation to simplify production of multilingual documents without sacrificing the quality. The specific interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF, Grammatical Framework, which in MOLTO is furthermore complemented by the use of ontologies, as in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data. GF has been applied in several small-to-medium size domains, typically targeting up to ten languages but MOLTO will scale this up in terms of productivity and applicability.
A part of the scale-up is to increase the size of domains and the number of languages. A more substantial part is to make the technology accessible to domain experts without GF expertise and minimize the effort needed for building a translator. Ideally, the MOLTO tools will reduce the overall task to just extending a lexicon and writing a set of example sentences.
MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.
While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.
MOLTO technology will be released as open-source libraries, accompanied by cloud services, to be used for developing plug and play components to translation platforms and web pages and thereby designed to fit into third-party workflows. The project will showcase its results in web-based flagship demos applied in three case studies: mathematical exercises in 15 languages, patent data in at least 3 languages, and museum object descriptions in 15 languages. The MOLTO Enlarged EU scenarios will apply MOLTO tools to a collaborative semantic wiki and to an interactive knowledge-based system used in a business enterprise environment.
This section describes the progress of each workpackage and discusses changes to the workplan, if necessary.
The main objective of this 5th semester has been to consolidate the project tools and technologies towards the production of the final deliverables. In order to focus the developments to clear goals, the Consortium has agreed to identify 9 "MOLTO flagships" that highlight the achievements of the project and combine what has been produced across work packages:
This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.
For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:
Moreover, if applicable:
GF Eclipse plugin http://www.grammaticalframework.org/eclipse/index.html has grown to Version 1.5.1 by June. It has been adopted by Be Informed and Ontotext. Camilleri and Ranta gave a GF crash course at Be Informed using Eclipse. There are moreover two publications: one in EAMT (a poster), one at FreeRBMT (full paper).
The Resource Grammar Library has been enhanced by two languages as external contributions: Japanese and Latvian. The MOLTO Phrasebook has been extended to Latvian. Work on Chinese and Maltese is going on.
With the release of D2.3, Grammar Tools and Best Practices, the work in this WP finished on 30 June. But there is further work planned, as dissemination and exploitation.
A preview version of libpgf, a C-based reimplementation of the GF runtime, is available since July. When finished, it should make GF technology accessible to applications that cannot make use of the current Haskell- and Java-based runtimes either due to resource constraints or interoperability concerns. In particular, libpgf should be easier to access from non-JVM-based programming languages. Bindings for Python are available since September.
GF Eclipse plugin http://www.grammaticalframework.org/eclipse/index.html
D2.3 http://www.molto-project.eu/biblio/deliverable/grammar-tools-and-best-pr...
The delivery date of D2.3 was postponed from M24 to M27 to be able to profit from the initial experiences with the new partners' scenarios.
After drawing the specifications in D 3.1, we published the prototype as Deliverable 3.2.
A first prototype of the translation editor, one built using the Google Web Toolkit, was tested for integration into the GlobalSight translation manager. This first prototype turned out to be too fragile and has now been replaced by simpler version, the Simple Translation Editor, found in http://cloud.grammaticalframework.org/translator.
Work has progressed integrating the translation tools prototype with the TermFactory web based term ontology editor (TF) that is to be used by ontologists and terminologists to create and maintain MOLTO domain dependent vocabulary.
The milestone MS8 (Translation tool complete) is achieved in parallel with WP5 (Statistical and robust t.) in the sense that the MOLTO GlobalSight translation project management system is set up and available for utilisation in other work packages, including the use cases and their evaluation in WP9. Also, MS8 can be considered achieved in the sense that TermFactory is integrated with OWLIM ontology repository, completing the ontology backend of the tool.
During the reporting period, a login and access control component (GateService) has been added to TF. This component is integrated with and maintained through the GlobalSight user manager.
The OWLIM ontology repository has been integrated with TF, using a common interface based on Jena assembler library. Besides OWLIM, Jena triple databases (TDB, SDB) and WebDAV ontology documents can be edited directly from TermFactory.
UHEL has unspent money, and has recruited an evaluation project manager to work with WP3 and WP9 until M36+3, beginning in November 2012 (start of M32).
We are asking to move Deliverable D3.3 (Translation tools / workflow manual) to M33. The D3.3 manual is expected to include documentation of the workflow with help from other work packages and input from the new recruit, that have all been subject to a 3 months shift.
During this reporting period, a new version of D4.3 Grammar Ontology Interoperability was submitted elaborating on the automatization of the grammar creation for controlled language (CL) to RDF interoperability. A new way of building the transformations between CL and SPARQL, treating SARQL as another concrete grammar along with the language specific ones was suggested.
We have designed and implemented the Query Grammar Helper Builder tool to connect a SPARQL end point and support an inexperienced user to create GF grammars - Abstract, English and SPARQL. As a part of the work in WP2 we have developed an Eclipse plugin wrapper of this tool.
We have also started designing the needed improvements to the interoperability approach in order to productize it as an outcome of the project and in alignment with the exploitation objectives.
We have also introduced several new members to the Ontotext team - Laura Tolosi, Maria Mateva, Ilia Trendafilov and Georgi Georgiev.
None
The milestone MS8 has been achieved in M30 (Translation tool complete), which for WP5 meant to have a complete system integrating the grammar and STM. Although the system is already available on Gothenburg's server we are still working on improvements.
The work done during the fifth period has been focused on 4 of the 6 tasks of the workpackage:
5.3 Robust Parsing First efforts to include the robust parsing work done in the previous semesters into the hybrid systems are being done. The work is in progress and the final idea is to be able to use GF's robust parsing to deal with the chunks instead of relying on Genia.
5.4 Baseline systems Refinements on the French GF grammar have been done in order to improve the performance. The German grammar has been done from scratch and it is now comparable to the French one.
5.5 Hybrid Models The new grammars have been integrated in the final hybrid system. Different versions of the previous hybrids are now available. In particular, a new system considers different probabilities for the GF translations according to the confidence in obtaining them. This information can be also used in the development step of the statistical system. A one-click system has been developed with the most promising hybrid system. This system will be updated with new hybrids whenever we obtain a better translation performance.
5.6 Systems evaluation A wider evaluation of the baseline systems has been done by including syntactic and semantic metrics into the evaluation. Also, the comparison with external translation systems such as Google and Bing has been redone in order to reflect the improvements of these systems during the last year. A comparison with Pluto is also done. However, we realized that since we share some data there is the possibility that our test sets are in their training data. We plan on using confidence estimation measures in order to be able to test on different patents for which none of us have translations such as American patents.
A GF grammar for patents has been developed for German and improved for French
A hybrid system with the new grammars have been evaluated and a new one which takes into account probabilities for the GF translations has been built
The work on robust parsing has resulted in two submissions to the Coling 2012 conference
A one-click system for the hybrid translator has been build and is now available as a shell command on the server in Gothenburg. Partners wishing to test the system should contact UGOT to obtain access to the server.
There are no deviations from Annex I and at M30 the workpackage has produced a hybrid system for patent translation for English-to-French and English-to-German. For the opposite directions we use SMT as fallback.
However, we plan to continue the work on hybrid systems by improving the current German translator and the integration with robust parsing. With this, D5.3 will be postponed till January, when new hybrid systems will be also finished and prepared to be evaluated within WP9.
Created a prolog-based reasoner to deal with elementary problems in arithmetic, implementing partition by subclasses and decomposition.
Created GF grammars to express some word problems in English and Prolog
Started integration with UZH AceWiki/gfservice.
We are running out of time to fully develop the prototype. The promised system would work in two modes: Author mode for entering a problem and Student mode for attempting to solve it. The first mode is more or less working but the last will take a lot of time. The scheduled Deliverable D6.3, Assistant for solving word problems, due by 1 September 2012 has been postponed to 1 December 2012. Some resources of UPC originally planned in WP8 have been moved to this workpackage in order to accomplish the promised tasks.
Due the incremental development of the prototype, most of the tasks have span till M30, when the final prototype is being completed.
The next lines describe the progress of the following tasks:
In relation to Task 7.2, the patents downloaded from the EPO website have been automatically translated and semantically annotated. The complete collection of files is available in the MOLTO repository, and it consists of 1) the original patent documents, 2) the English version of the patent documents having the semantic annotations, and 3) the automatic translations of claims, abstracts and descriptions. These documents constitute the main content of the retrieval databases.
As for Task 7.4 and Task 7.5, the ontologies, indexes and databases have been updated with the new dataset of documents.
Regarding Task 7.6, we designed process for patents translation that allows for building a translated document having the same XML structure as the original patent. As a result, the interface of the prototype can show the translated patents using the same user-friendly view as for the original ones. The translation of the documents consists of a pipeline involving the following 5 steps: First, the patent files are preprocessed in order to extract the text contained into the sections in a structured manner (step 1). Then, the formatting marks inline with the text are replaced by placeholders (step 2). And then, the resulting text is segmented and tokenized as required by the translation system (step 3). Soon after, the raw text is translated using the SMT system (step 4). The translated text is post- processed in order to recover the original structure of the document (step 5), including original formatting, claims enumeration and images.
Regarding Task 7.3, the query grammars have been refactored using the set of primitives defined in the Query Library work conducted in WP4. In consequence, the English and French version of the patents query grammar were adapted to the new structure, and the German version has been developed from scratch. The new grammar is equivalent to the old one. The difference is the fact that it relies on the primitive query building functions defined in the Query Library. Developing a grammar using the Query Library requires less linguistic knowledge, but just selecting the right set of primitives that would be right for the task. In comparison to the previous patent query grammar, now it has fewer constructions, because of the fact that it is developed on top of the Query Library. As a consequence, the constructions are also more natural and the number of malformed constructions have decreased considerably. The current grammar consists of 31 patterns and it is able to parse/generate 359 query constructions in English, 111 in French and 147 in German.
Finally, regarding Task 7.7, the interface has been updated with the German version of the query grammars. Also, some basic tests have been carried out at two levels in order to assess the prototype functionalities. First, some deficiencies have been corrected regarding the usability of the interface, i.e., examples of the main page, the language selection and the visualization of the results in French and German. In addition, we studied the inherent logic of the queries and the expected results, so that the system returns results that can now be considered more appropriate or accurate.
The Deliverable 7.2 gives a detailed description of the modules and their functionalities.
In general lines, we are achieving the objectives related to WP7. However, the Deliverable 7.2, planned for M27, has been delayed to M30 due several issues related to the gathering of the corpora, the pro/post process of the documents and the integration of the new query library. Also, we carried several basic tests in order to assess the behavior of the prototype in terms of query results and user interaction, which reported several deficiencies that have been corrected. Since D72 has been postponed, D73 is delayed accordingly from M33 to M36.
The data collection (D8.1) and first prototype of grammars (D8.2) were delivered on time. The grammar prototype has six languages, but is being extended to 15. It implements the generation of describing texts from facts in the database. The final system (delivered as a part of D8.3) will also allow natural language queries about museum objects, applying the technology developed in WP4.
In addition to Gothenburg City Museum, there has been interest to this WP from the Europeana project http://www.europeana.eu/portal/ A plan for later dissemination of the work includes the generalization of the results by making them available for Europeana. Also the Monnet project http://www.monnet-project.eu/ has a common interest in this WP in the area of ontology localization and verbalization.
The work in this workpackage in particular addresses multilingual text planning, and was exploited in D8.2 and has resulted in two publications in 2012; see WP10.
The schedule for D8.3 is postponed from M30 to M36 due to a delay in the PhD defense of Dana Dannélls, one of the key persons of this WP; as a bonus, MOLTO can then fully profit from her thesis work funded from other sources.
To improve the communication between work packages, we have set up a bug tracking system in http://tfs.cc/trac, and assigned the flagship leaders to be in charge of their component. Everyone using the tools is encouraged to leave their comments and requests in trac.
Maarit Koponen's work in evaluating semantical aspects of machine translation quality is progressing. We have started recruiting people for translation quality evaluation, and got response from the University of Pisa, with Italian-English and Italian-German as possible language pairs.
Targeting towards D9.2, we have started gathering evaluations from the grammar writers. Individual use cases can be measured in terms of translation quality, but good grammar design principles will make any grammar easier to write and maintain. We evaluate the grammars in terms of D2.3, a best practices document.
Maarit Koponen's article Comparing human perceptions of post-editing effort with post-editing operations was accepted in Seventh Workshop on Statistical Machine Translation (Montreal) and published in the proceedings.
No major deviations reported. Minor actions include the addition of grammar quality evaluation.
New on the website, the publishing of news items from RSS feeds of MOLTO Consortium partners and from the GF source code repository, in the footer, and news items from MOLTO in the header, alongside the publications and the demos. A new collective demo of the GF application grammars together with the novel GF cloud services is prominently featured on the website.
We are also collecting the Use of resources in a overall table (http://www.molto-project.eu/workplan/resources) that summarizes the data provided by the partners. Personal views (e.g. http://www.molto-project.eu/workplan/resources/olga.caprotti) and workpackage views will be available soon.
Two major events have been organized with the sponsorship of MOLTO: FreeRBMT 2012, and CNL 2012. Free Rule-Based Machine Translation, FreeRBMT 2012, took place in Gothenburg on 13-15 June, 2012 and was organized by UGOT (see http://www.chalmers.se/hosted/freerbmt12-en). A tutorial on the Apertium system followed as additional satellite event, it was attended by MOLTO partners from UPC and UGOT and resulted in the adoption of some of the Apertium lexicons in GF.  Papers by J. Camilleri  and by Cristina España-Bonet et al.  presenting MOLTO results will appear in the online proceedings. Additionally the program included a series of presentations on MOLTO's current work in GF resources and tools for machine translation (see http://www.molto-project.eu/freerbmt-program.html). Many of the MOLTO talks have been streamed live from the moltoproject YouTube channel, http://www.youtube.com/moltoproject, where they can still be watched.
The Third Workshop on Controlled Natural Language, CNL 2012, took place on 29–31 August 2012 in Zurich, Switzerland. UZH has been organizing this meeting over the past years, and this time as a MOLTO activity. A few papers were presented by the MOLTO Consortium, listed below, but we also note a contribution by external researchers, Normunds Grūzītis, Pēteris Paikens and Guntis Bārzdiņš, FrameNet Resource Grammar Library for GF, using the MOLTO Phrasebook as case study in their work.
On 14 August, Aarne Ranta visited Lingsoft Inc in Helsinki. Lingsoft is "a full-service language management company", producing for instance the proofing tools for the Nordic languages and German in Microsoft Office products. Lingsoft is one of the most successful language technology companies, founded in 1986 and working with numerous partners and products. Recent products range from spell checking to language education tools, speech recognition, and translation. He was invited by the CEO Juhani Reiman and the Senior Advisor Simo Vihjanen to give a presentation of MOLTO's tools and discuss possible collaborations. MOLTO and Lingsoft share the belief in precise linguistic knowledge as a key to successful language processing. Lingsoft has now set up a group to explore the possibilities offered by MOLTO and GF. The focus is on machine-assisted translation for specific domains.
YouTube videos of MOLTO related talks, http://www.youtube.com/moltoproject
Toward multilingual mechanized mathematics assistants, Saludes, Jordi, and Xambó Sebastian, EACA 2012 (Proceedings), 06/2012, p.163–166, (2012)
The Patents Retrieval Prototype in the MOLTO project, Chechev, Milen, Gonzàlez Meritxell, Màrquez Lluís, and España-Bonet Cristina, WWW2012 Conference, Lyon, France, (2012)
Multilingual Online Generation from Semantic Web Ontologies, Dannélls, Dana, Enache Ramona, Damova Mariana, and Chechev Milen, WWW2012, 04/2012, Lyon, France, (2012)
MOLTO Enlarged EU - Multilingual Online Translation, Caprotti, Olga, and Ranta Aarne, 16th Annual Conference of the European Association for Machine Translation, 05/2012, Trento, Italy, (2012)
The GF Mathematical Grammar Library, Caprotti, Olga, and Saludes Jordi, Conference on Intelligent Computer Mathematics /OpenMath Workshop, 07/2012, (2012)
Multilingual Verbalisation of Modular Ontologies using GF and lemon, Davis, Brian, Enache Ramona, van Grondelle Jeroen, and Pretorius Laurette, Third Workshop on Controlled Natural Language (CNL 2012), Volume 7427 LNCS, (2012)
General Architecture of a Controlled Natural Language Based Multilingual Semantic Wiki, Kaljurand, Kaarel, Third Workshop on Controlled Natural Language (CNL 2012), 09/2012, Volume 7427 LNCS, p.110--120, (2012)
Probabilistic Robust Parsing with Parallel Multiple Context-Free Grammars, Angelov, Krasimir A., COLING 2012, (Submitted)
How Much do Grammars Leak?, Angelov, Krasimir A., COLING 2012, (Submitted)
The GF Eclipse Plugin: An IDE for grammar development in GF, Camilleri, John, and Angelov Krasimir, 16th Annual Conference of the European Association for Machine Translation, 05/2012, Trento, Italy, (2012)
An IDE for the Grammatical Framework, Camilleri, John, Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012), 06/2012, (2012)
Deep evaluation of hybrid architectures: Use of different metrics in MERT weight optimization, Cristina España-Bonet, Gorka Labaka, Arantza Diaz De Ilarraza, Lluis Marquez and Kepa Sarasola, Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012), 06/2012, (2012)
Comparing human perceptions of post-editing effort with post-editing operations, Koponen, Maarit, Proceedings of the Seventh Workshop on Statistical Machine Translation, June, Montréal, Canada, p.181–190, (2012)
Future activities:
Three lines of work were followed: developing a multilingual ACE grammar (ACE-in-GF), extending the AceWiki system based on the GF technology (currently referred to as AceWiki-GF) and extending the Attempto reasoner RACE.
In collaboration with UGOT (John J. Camilleri) a GF-based multilingual grammar for ACE was developed. This grammar has the following properties:
This resource is fully presented in Deliverable D11.1.
AceWiki-GF was further developed by adding preliminary support for multiple grammars, multiple articles, ambiguity management, and grammar editing. A large number of AceWiki-GF demo wikis have been made publicly readable/editable on the Attempto website. Most of these wikis are based on grammars developed in MOLTO and in previous GF-related projects. Some of the simpler grammars can also be edited. The current work on AceWiki-GF and its underlying ideas were published at CNL 2012 and presented as both a talk and a demo.
The Attempto reasoner RACE is currently extended to handle arithmetic, linear equations and text problems. This work – not being part of the actual MOLTO tasks – aims at providing AceWiki-GF with an alternative reasoning capability that covers the complete first-order subset of ACE. The current, still incomplete, version of RACE was demonstrated at CNL 2012.
Organization of meetings and conferences:
A small deviation is expected for the deliverables of this workpackage to allow time for Tobias Kuhn's contributions, who is a leading developer of AceWiki currently on a researcher's visit abroad. The revised schedule is as follows:
During this period two major topics were addressed from the adoption phase as described in the DoW.
A GF bootcamp was held at Be Informed at June, 4-6 in cooperation with UGOT. During this bootcamp the Be Informed team first received an in-depth introduction to grammar building using the Grammatical Framework. Building on that knowledge, several workshop sessions were held to discuss the theory and practice of (semi-)automatically converting Be Informed business modeling "language" (Be Informed meta models) and "speech" (Be Informed models) to Grammatical Framework constructs. Furthermore, discussions on a number of technical issues concerning the integration of Grammatical Framework technology into the Be Informed Business Process Platform.
Also in this period, we tried to capture requirements from a large number of perspectives. Some requirements apply to the verbalization component to be developed in WP12, but many also apply to the functionality that can be based on this component. Requirements were derived from business usage scenario's:
A full overview of these requirements are presented in Deliverable D12.1.
none
The only milestone due in this period is that of WP5 and WP3, Translation tool complete, which has been met by its due date, 1 September 2012. The next milestone MS9, Case studies complete, involves the work-packages on mathematics, patents retrieval, and cultural heritage. We are delaying the work on cultural heritage and therefore we will have to shift part of this milestone too.
Project management during the period consisted mainly in maintaining the routine communication with the partners, by holding a monthly skype call, and in distributing the second installment of the funding.
UGOT received the 2nd interim payment from the EU and it distributed it to the partners on 15 August, 2012. Each partner received also the financial assessment from the EU and an overview of the payment that has been sent. The Consortium has now received 85% of the MOLTO total budget which is the maximum amount possible before the approval of the final reporting.
Followup actions after the annual review included discussions within the Consortium on how to organize a better showcase for the final results of the project and in addressing the reviewers' remarks and suggestions (see Task 1.8). Updated versions of some deliverables were produced and made available on the website.
In terms of infrastructure, the svn repository is currently being used by a larger number of members of the Consortium and in addition there is a new bug-tracking system installed and running at UHEL.
Tables on the usage of resources are not available for midterm reporting, however we have a rough initial estimate of persons' months by almost all nodes. Ontotext has not been able to provide the data.

| Contract No.: | FP7-ICT-247914 and FP7-ICT-7-288317 | 
|---|---|
| Project full title: | MOLTO - EEU - Multilingual Online Translation | 
| Deliverable: | D1.7 Final Management Report | 
| Security (distribution level): | Confidential | 
| Contractual date of delivery: | M39 | 
| Actual date of delivery: | Version 2: 25 October 2013. Version 1: 25 July 2013 | 
| Type: | Report | 
| Status & version: | Version 2 | 
| Author(s): | O. Caprotti, A. Ranta et al. | 
| Task responsible: | UGOT | 
| Other contributors: | All | 
Progress report for Period 3 of the MOLTO project lifetime, 1 Mar 2012 - 31 May 2013.
The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and ran until 31 May 2013. Its goal was to develop tools for translating texts between multiple languages in real time with high quality. MOLTO's grounding technology is multilingual grammars based on semantic interlinguas and grammar-based translation. It also explores ways to use statistical machine translation without sacrificing quality.
MOLTO uses specific interlinguas that are based on domain semantics and are equipped with reversible generation functions. Thus translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF, Grammatical Framework, which in MOLTO is furthermore complemented by the use of ontologies, as in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data. GF has been applied in several small-to-medium size domains, typically targeting several parallel languages. During its lifetime, MOLTO has scaled up this technology in terms of productivity, domain size, and the number of languages.
The size of domains has been increased to involve up to thousands of concepts. and the number of languages to twenty parallel ones. A special focus has been to make the technology accessible to domain experts without GF expertise and minimize the effort needed for building a translator. Ideally, the MOLTO tools will reduce the overall task to just extending a lexicon and writing a set of example sentences.
MOLTO was initially committed to dealing with 15 languages, which included 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. The additional languages also addressed in MOLTO are Chinese, Hebrew, Hindi, Latvian, Persian, and Urdu.
While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO's main target is the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 11 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.
MOLTO technology is continuously released as open-source software and linguistic modules, accompanied by cloud services, to be used for developing plug and play components to translation platforms and web pages and thereby designed to fit into third-party workflows. The project showcases its results in web-based flagship demos applied in three case studies: mathematical exercises in 15 languages, patent translations and queries in 3 languages, and museum object descriptions and queries in 15 languages. The MOLTO Enlarged EU scenarios add to this an application of MOLTO tools to a collaborative semantic wiki and to an interactive knowledge-based system used in a business enterprise environment.
This section describes the progress of each workpackage and discusses changes to the workplan, if necessary.
The last period of the MOLTO project and of its enlargement MOLTO-EEU has been a very intensive period of work for the Consortium. The major deliverables have been delayed to this period and had to be completed. They included:
To demonstrate the usage of the MOLTO tools and technologies, the partners worked towards joint prototypes for the various case studies listed in the workplan. The coordination work involved agreement on platforms, on formats, and on the overall architecture of each demonstrator.
The final case studies include: - a proof-of-concept dialog system and reasoner for word problems (WP6) - patent translation by the robust hybrid approach and multilingual query interface (WP3, WP4, WP5, WP7) - museum artifacts multilingual query and descriptions (WP4, WP8) - multilingual semantic wiki AceWiki (WP2, WP11) - multilingual business modelling by GF (WP2, WP3, WP12).
This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.
For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:
Moreover, if applicable:
We have developed translator's tools further: lexicon extraction as an essential core technology, integration of GF translation to the Pootle platform as a concrete example. Preparing for the final review, UHEL has had the flagship of lexicon extraction and prepared a presentation. UGOT has contributed to the lexicon work with Shafqat Virk's PhD research about resources for Indo-Aryan languages. We have also collaborated with Ontotext in D4.3A to include contrast and comparison of TermFactory and KRI, as suggested by the reviewers.
A part of the work on the web-based translation tool originally scheduled for UHEL was carried out by UGOT. This means a shift of workload from UHEL to UGOT of 3 person months.
In the final period of MOLTO, we have finalized the model for SPARQL generation and RDF facts verbalization. The D4.3A annex deliverable was published as a follow-up to the reviewers' recommendations and to summarize the progress of work in the field of grammar-ontology interoperability in the descendants of the KRI prototype.
The main goal of this WP has been to develop a hybrid system between GF and SMT specialised in patent translation and has implied the construction of new resources on the domain and conceiving techniques to integrate both technologies. The WP has also tasks devoted to widen GF and have been focused on building general purpose lexicons. Besides, a more robust GF has been achieved by the use of the robust statistical parser that will allow to translate free text or, at least, the parts covered by the grammars without being affected by unknown elements.
Regarding the development of different types of general lexicons it has been used GF’s core idea of common abstract syntax and multiple concrete syntaxes to produce multilingual morphological lexicons. The abstract syntax is based on data from the Princeton WordNet and the Oxford Advanced Learner’s dictionary. The concrete syntaxes are produced using data from already existing lexical resources (i.e. Bilingual dictionaries and Universal WordNet), and GF’s morphological smart paradigms. Because words can have multiple senses, and it is often very hard to find one-to-one word mappings between languages, two different types of multi-lingual lexicons have been developed: Uni-Sense and Multi-Sense. In a uni-sense lexicon each source word is restricted to represent one particular sense of the word, and hence it becomes easier to map it to one particular word in the target language. These type of lexicons are useful for building domain specific NLP applications. A multi-sense lexicon, on the other hand, is a more comprehensive lexicon and contains multiple senses of words and their translations to other languages. This type of lexicons can be used for open-domain tasks such as arbitrary text translation. These lexicons cover a number of language including English, German, Finnish, Bulgarian, Hindi, Urdu and their size ranges from 10 to 50 thousand lemmas.
In WP5 we also experimented with open-domain robust translation based solely on GF. This is a huge step since the traditional application domain of GF is in controlled languages where the domain is small and well defined, while in the task of translating running text the source language is not clearly defined anymore. As a simple numerical measure for the leap, we can say that the typical GF applications deal with grammars containing hundreds of lemmas while in this experiment our grammars contain more than 50,000 lemmas. We developed an entirely new runtime system for GF in C which has the advantage to be more portable and more efficient. The efficiency was the first requirements that we had to satisfy since otherwise interpreting these huge grammars would be intractable. Furthermore, we turned the original non-probabilistic algorithms for parsing and reasoning into probabilistic ones. The introduction of probabilistic models is crucial for the disambiguation of the grammars which are by necessity highly ambiguous. The third major contribution to the project is that we also made the GF parser robust, i.e. when faced with sentences which are not parseable, it returns a sequence of recognized chunks rather than an error. We evaluated our implementation with state-of-the-art statistical parsers for related grammatical formalisms, and we found that for sentences longer than 25 tokens, our implementation is at least two orders of magnitude faster. We also tried to use our new architecture in machine translation but here the results are not conclusive yet. We found that the two main limitations are in the quality of the translation dictionaries which we built and the still limited coverage of the grammars. Furthermore, we need to better address the word sense disambiguation and the proper translation for multiword expressions.
The translation of patents using this robust parsing is still in an embryonic state, but we have developed a complete translation system that combines GF and SMT to overcome the input controlled language assumption. This hybrid system implies the construction of in-domain dictionaries and grammars that make use of probabilistic components, and the integration with an SMT engine that is able to complement GF translations. Regarding these resources for patent translation, we emphasise the generation of static lexicons obtained from SMT translation tables, and the on-line generation of lexicons with unseen vocabulary but available in the monolingual dictionaries. For German, also a dictionary of compounds has been built. A grammar for dealing with patents in English, French and German has been built on top of the resource grammar with several additions devoted to deal with chunks instead of sentences. Particular constructions appearing in patents are also covered by this new in-domain grammar. As a demand of the selected domain, we have also developed a detector and tokeniser of chemical compounds. A full translation system uses this tokeniser and prepares the patent to be translated. This involves chunking and parsing the source sentences which are first translated by GF and afterwards sent to an SMT decoder which is fed with this information. An SMT engine trained on the domain is also used by the top decoder. The final hybrid system is available for download and has several options that take into account which method to build the lexicon has to be used and which kind of integration is to be applied.
The novelties since the last report correspond on the one hand to the improvement of the previous hybrid MT systems, its portage to German, and the development of new hybrid systems. On the other hand, we highlight the generation of lexical resources from WordNet, Apertium dictionaries, and SMT translation tables and the development of a statistical robust parser which results two order of magnitude faster than comparable state-of-the-art probabilistic parsers. The last points allow to extend the coverage of GF and are useful for a general translation or a translation in any domain. The first one, on the contrary, starts from the translation on a concrete domain and tries to extend the coverage outside the coverage of the grammar.
Although some specific tasks have been evolving through the life of the project the three main lines have been accomplished:
i) GF grammar for the patents domain
ii) SMT system for patents
iii) Combination GF-SMT translators
By the evolution of the tasks through the project we mean for example that more time than the estimated has been devoted to improve the GF patents grammar and to work on the soft integration hybrid system that depends on it. The hybrid system that depends on using GF tree fragment pairs is in a less mature state. The dependence on the performance of the robust parser showed to be crucial and most efforts have been devoted into this direction.
in the first part of the project we developed a GF Mathematical Grammar Library (mgl) based on several OpenMath content dictionaries. This encompasses the OpenMath layer of the mgl. For the next part we developed, on top of it, the module Commands that allows the use of human language at commanding a Computer Algebra System (cas) into computing the objects described in the OpenMath layer, and getting the answers delivered in natural language too.
For the final part we undertook creating a prototype for assisting students into modeling and solving word problems: The statements of these problems relates to notions of ordinary life and the goal of proposing these to the students is for they to learn how to describe mathematically the relevant relations in the statement into equations (modeling) and then, how to solve these to get the numeric solutions interpreted in terms of the original statement (solving).
The kind of reasoning needed in this the description logic used by WP4 (OWL reasoners) was found wanting in its arithmetical capabilities. We needed a dialog system more than a query/answer system. This moved into creating a new reasoner based on Prolog, along the lines of WP11, able to cope with basic arithmetic settings. That means, being able to automatically decide whether a problem statement is free from contradictions and whether it contains enough information to deduce the solution. On the other hand, since we want the system to guide the student into the proper equations, we need to account for the state of the modeling process, storing new facts discovered by the student and automatically providing next-step hints to him/her. All this took much time that originally planned and forced us to concentrate in the novel challenge (modeling) and keep aside the solving part.
We developed a tool that runs on a Scala shell for constructing simple word problems, sentence by sentence, using one of the four languages supported: Catalan, English, Spanish and Swedish. It checks that the sentences written so far are consistent and complete to make a problem and saves it as Prolog code with comments in GF.
We developed an assitant that runs on a text terminal and engages the student in a dialog in one of the aforementioned languages. This dialog starts with the statement of the problem and the proceeds by providing hints on how to do next or answering questions about the information that has been discovered. The process ends when the student provides an equation that captures the relevant information to solve the problem. Then, the system delivers the solution in natural language.
We could not use the grammars of WP4 as stated since the reasonong and language are different. In the Query Technologies worpackage, questions are about objects in classes having properties, while in our case the questions are about cardinals of sets of objects. On the other hand, we departed form the query/answer form and went into a dialog. All this required new grammars.
Time constraints, as mentioned above, forced us to leave the integration of the solving step into the prototype. Apart from this, a vital component that mediated the communication between the GF side and the cas side (Sage simple server) was deemed obsolete by the Sage community, so it was no advisable to pursue it further until a clear standard for communicating with Sage arises. At the moment of writing this document such a candidate seems to dominate (sagecell) but still is not distributed among the standard packages of Sage (and fails to install in some platforms for the last version of Sage (5.9)).
The aim of "WP7:Patents Case Study" was to create a prototype for automatic translation and multilingual retrieval of patents. The online prototype is publicly available at: http://molto-patents.ontotext.com/.
This patents case study has set up the grounds where to put together several technologies in order to come up with a useful platform for multilingual patent retrieval system. The main challenges addressed in the prototype are a) to translate semantically enriched patent documents, including the original mark-up, b) to design the mechanisms to enable the multilingual indexing and retrieval of the patents, c) to define and develop a query language and the query grammar to enable a user-friendly interaction with the system, and d) to set up an on-line application for retrieval of patent document that serves as a testbed of our work.
The patents prototype combines semantic annotations, retrieval techniques and two different approaches for machine translation. The integration of different translation methodologies into the system has been crucial to increase its capabilities and make possible extended features and functionalities, with respect to preliminary version of the system.
For the massive translation of text, a statistical system has been trained and adapted to translate the text and transfer the semantic annotations into the target languages. One of the challenges in this task was to come up with a mechanism to translate the semantics of the source texts to the target files. As a result, the patent documents are semantically enriched and translated using the statistical system. Then, the multilingual documents are used to feed the databases and indexes of the retrieval system. What remains as a future challenge is the use of these annotations to still increase either the accuracy of the annotations or the quality of the translations.
On the other hand, a rule-based system is built in order to translate from (controlled) natural language to the semantic query language (SPARQL), in the interface. The GF has been proved an efficient way of generating the SPARQL queries, as if it was “Yet Another Query Language”. In other words, it allows to translate a natural language query from the user’s language to SPARQL, which makes the system accessible to a broader community rather than just skilled users. This automation facilitates also the interoperability between the query grammar and the ontologies and speeds up the development and maintenance of the querying subsystem.
Finally, the patent prototype is not comparable with the interfaces exposed by the European Patent Office, namely because they were conceived for different purposes. Nonetheless, the MOLTO patents prototype demonstrates that a patents retrieval system that addresses multilingualism by means of automatic translation techniques is commercially viable.
The preliminary version of the prototype, described in Deliverable 7.1 had only original patent documents in the databases and the system was only available in English and French.
A complete version of the prototype, described in Deliverable 7.2, included resources also for German, and patent documents translated using the Statistical Machine Translation (SMT) system trained on the domain, and described in Deliverable 5.2.
The news introduced with respect to previous versions of the prototype are: 1. A new process for statistical-based translation of patents that allows to transfer the semantic annotations and the original mark-up in the source documents to the target language.
The development of the patent translator API in order to integrate the translation system into remote applications, such as online patent translation in the GF cloud.
The updates on the retrieval architecture in order to improve the response time, such as snippeting.
A new querying approach for SPARQL generation based on the grammar – ontology interoperability automation, driven by the Grammatical Framework.
A new query grammar for the biomedical patents domain, which has been improved in terms of coverage and compliance to the patent domain ontology that is behind the information retrieval system.
The new functionalities integrated in the user interface in order to improve the usability of the application, such as the integration of the free-text search as a back-off mechanism for the query language, based on free text search.
Some updates on the on-line user interface that address usability aspects and further functionalities.
The main objectives of the work package have been fulfilled:
i) create a commercially viable prototype of a system for MT and retrieval of patents in the bio-medical and pharmaceutical domains,
ii) allowing translation of patent abstracts and claims in at least 3 languages
iii) exposing several cross-language retrieval paradigms on top of them.
This workpackage started with six months of delay because the WP leader, Matrixware, left the MOLTO Consortium during Month 3. 
After the re-scheduling, the tasks related to this workpackage were kept up to date according to the calendar. 
The final version of the prototype was agreed to be delayed till M36 due multiple dependencies with other workpackages.
The new calendar allowed to incorporate the latest developments (grammar and ontologies interoperability in WP4 and hybrid translation from WP5), in the final demoed applications.
The multilingual Semantic Web system covers semantic data from the Gothenburg City Museum database and DBpedia. The grammar enables automatic coherent descriptions of paintings and answering to queries over them in 15 languages for baseline functionality and in 5 languages with an extended semantic coverage. The system contains an automatic process for translating museum names from Wikipedia. The process can be easily extended to translate names of painters, places, etc.
The system provides a public SPARQL endpoint against which the user can explore the knowledge base with manually written natural language queries.
We created the Museum Reason-able View where several ontologies were linked, including: the CIDOC-CRM, the Painting ontology and the Museum Artifacts Ontology (MAO).
We build an ontology-based system for communication of museum content on the Semantic Web and made it accessible in 15 languages. The multilingual system automatically generates coherent Wikipedia-like articles. It has been made available online for cross-language retrieval and representation using Semantic Web technology.
We were able to reuse the query technology that has been developed in WP4 and adapt it successfully to our needs.
We extended the semantic coverage of the grammar to five languages and demonstrated the benefits of exploiting a modular approach in the context of multilingual Semantic Web.
Due to the progress of other work packages, the actual evaluation work was started at Spring 2013. Some of the evaluations were made within work packages, for instance the patent cases (WP7) were evaluated with automatic evaluation metrics, and the semantic multilingual wiki (WP11) was evaluated internally for usability. WP9's contribution to the project is translation quality evaluation with native or near-native speakers.
In the evaluations, human evaluators were presented with translations by MOLTO tools and references by other MT systems (Google, Bing, Systran), and they chose the most adequate, either for post-editing or to accept as such. From these results we calculated error rates, and in addition, the percentages to what extent the evaluators preferred MOLTO translations over other systems. The results vary between languages and use cases, but in general, both automatic evaluation metrics and the percentage of the evaluators' preferred translations suggest that MOLTO method fares better in the chosen domains.
During the evaluations, some errors were detected and the grammars in question were sent to be corrected. The time and effort needed to fix the languages that get the poorest results is another factor which is favorable to MOLTO tools: a systematic fix in the grammars corrects all instances of an erroneous construction.
Some methodological issues about the qualitative evaluation were raised during the project, especially concerning the evaluation of Phrasebook. MOLTO's goal has been publishable quality automatically, but the evaluation results have been less than perfect—however, this doesn't mean that the results are incorrect, but simply that there are many ways to say the same thing, and an evaluation method that compares an edit distance to a reference doesn't capture the whole picture. This discrepancy between the human perceptions of quality and post-editing operations is discussed in the project deliverable, and has been a topic of two conference papers by Maarit Koponen, one between M31-M39 period in AMTA 2012 Workshop on Post-editing Technology and Practice, and one presentation at the XI Symposium on Translation and Interpreting: Technology and Translation in Turku, Finland.
N/A
The major work has been to produce the final deliverables for this work-package, a report on dissemination and exploitation and the final version of the MOLTO web services. In order to produce these, we have tweaked the website and added a number of ways to generate and view the publication activity of the Consortium. Part of the work has also included the delivery of an archival version of the software prototypes as bibliographical items, with describing metadata, on the project's publication list and on a devoted page:
http://www.molto-project.eu/view/biblio/type/Software.
We have checked the Open Access policy of the partners and requested the publication on OAI-PMH compatible repositories. The listing of such archives is documented in Deliverable D10.4.
The presence and dissemination of MOLTO via social sites has been constant throughout the lifetime of MOLTO and in the last period we have started to plan how to sustain the MOLTO Community after the project's end. We have been testing various platform, most recently a Google+ Community, where we also streamed the talks from the final Open Day and archived them on YouTube.
The final demonstrations are reachable from the website and they are accompanied by videos in order to supply documentation also in the far future, when the technologies will be obsolete and not available any longer.
Proper documentation and archiving of all these resources is underway. The resources produced by the project are very different in nature and present a challenge in terms of sustainability and future accessibility. They include software (often depending on third-party libraries), technical reports and publications in digital and/or printed form, and multimedia material. We intend to store all of these on an archival media however it is not clear how persistent they will remain.
We continued working on our two main projects: (1) developing ACE-in-GF (multilingual grammar of ACE) and (2) developing AceWiki-GF (multilingual CNL-based semantic wiki).
ACE-in-GF was extended to almost all the languages supported in the GF resource grammar library (~20 languages), although only the languages reported in D11.1 are fully implemented and tested.
The main work on AceWiki-GF was completed, and reported in D11.2 and a ESWC 2013 conference paper. Smaller extensions and improvements continue.
In the last 5 months of the project we focused on the evaluation of both ACE-in-GF and AceWiki-GF. The design and results of both of these evaluations are reported in D11.3.
| Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern | 
|---|---|---|---|---|
| UZH | 3 (N. E. Fuchs) | 11.5 (K. Kaljurand, T. Kuhn) | 0 | 2.5 (L. Canedo, V. Ungureanu) | 
This period we continued the work on the further adoption of GF in the Business Process Platform of Be Informed. On of our goals to leverage the adoption of GF was to create a framework in which models could automatically be verbalized. Domain experts usually do not have a background in modeling and thus checking whether a rule or law is modeled correctly usually proves to be a difficulty for them. Be Informed wants to take away these barriers by creating verbalizations of their models. These verbalizations however should not be only a textual representation the models, but it wants the possibility to create verbalizations of the same models for a set of the distinguished tasks.
In order to do this Be Informed created the 3D framework together with the University of Bielefeld. An article on this work will be published in "Jeroen van Grondelle, Christina Unger: A 3-Dimensional Paradigm for Conceptually-scoped Language Technology" in Towards the Multilingual Semantic Web, Paul Buitelaar and Philip Cimiano, eds., Springer, Heidelberg, Germany, 2013. This orthogonal modularization supports specification of the conceptualization and lexical information per dimension, i.e. specifying domains independent from tasks and vice versa. The dimensions can then be freely combined by choosing the particular domains, tasks and languages supported for a specific application.
While the task grammars are written once by hand, each of the domain grammars is created automatically from a Be Informed or OWL ontology and plugged right into the grammars already created for the framework. In order to create these domain grammar automatically Be Informed created three verbalizers, each one with its own heuristics to create verbalizations.
In an evaluation the likelihood of the verbalizations created with grammars from these verbalizers were compared to the verbalizations created by the velocity templates, the verbalizer which is currently implemented in the Be Informed product suite. The results show that the GF based verbalizers are better than these velocity verbalizers.
Work on the adoption and evaluation has been finished and reported in D12.2 (http://www.molto-project.eu/biblio/deliverable/d122-user-studies-bis-exp...).
| Node | RTD | MGNT | OTHER | 
|---|---|---|---|
| BI WP1 | 0 | 0.4 | 0 | 
| BI WP9 | 1.7 | 0 | 0 | 
| BI WP10 | 1.3 | 0 | 0 | 
| BI WP12 | 15.5 | 0 | 0 | 
The list of project deliverables can be obtained from the web pages at http://www.molto-project.eu/view/biblio/deliverables and in annotated form at http://www.molto-project.eu/view/biblio/type/Deliverable. The administrative view provides the links to the final PDF but also to the online version of the documents at http://www.molto-project.eu/workplan/deliverables. Below is a summary for the final period.
Project management during the last period aimed at strengthening the cooperation work of the partners for finalizing the tools and technologies delivered. This coordination was mainly achieved through the creation of a Google group for the MOLTO project: all members of the Consortium have been subscribed to it.
A major issue has been the relocation of the project manager Olga Caprotti to the partner node UPC for the period 1 March 2013 - 11 June 2013. This was made necessary by national Swedish regulations that prevent hiring on temporary positions for longer than 2 years (she had already used a 1-year long non-renewable temporary post as a visiting researcher). A part of the funding has been transferred to UPC to cover the costs of hiring Dr. Caprotti until the end of the project.
The use of resources has undergone some internal shift among work-packages in order to cover extra work that had become necessary due to initial delays and to personnel turnover. Where appropriate, they are documented case by case in the work-package reports.
As reported in Deliverable D1.6, followup actions after the annual review included discussions within the Consortium on how to organize a better showcase for the final results of the project and in addressing the reviewers' remarks and suggestions. Updated versions of some deliverables were produced and made available on the website.
Here below we address each comment in the review.
Technical coordination should be strengthened. Continuous and strict monitoring should be applied. Reviewers made several recommendations in the 1st review but most of them have not been implemented or it was unclear what was done with respect to them. As it is shown in the remarks per WP, the adoption of most of these recommendations would support monitoring of the work progress towards the project’s objectives.
The greatest effort undertaken to strengthen the coordination of the partners was to define a number of "flagships" aimed at demonstrating the integration of the MOLTO technologies. These showcase demonstrators have been developed during the final months of the project by tight cooperation of the partners, each flagship adopting and reviewing some tool or technology from a different partner.
The recommendation from the 1st review “How grammar rules are extracted (from lexical databases, ontologies, text examples) needs to be specified in detail and a concrete schedule should be included in the updated workplan (D1.1)” has not been included in D1.1.It should be included in D2.3 “Grammar tool manual and best practices”, due in M27. This is a crucial deliverable since the best practices with respect to the other work packages should be included here
The recommendation from the 1st review “Details on the integration steps (the integration of the vocabulary editor with the translation editor, the integration of the vocabulary editor with TermFactory (TF), and the integration of TF with the Knowledge Representation Infrastructure (KRI) of WP4) need to be provided in the updated workplan (D1.1). Concerning the integration of TF and KRI, it seems that there are overlaps between these tools. The partners must clarify which functions of these tools will be used in the case studies in order to exploit complementarities of the tools and avoid overlaps.” has not been addressed properly and is presented as still “less understood” by the WP leader. This is a major issue of concern. The problems of the integration of WP3 tools remain. These should be discussed in an updated D1.1.
Follow-up: D4.3A makes comparison between KRI and TF and suggests steps to be taken to integrate TF and KRI. The integration requires a mirror of a KRI site, whose semantic repository is open to edit with TermFactory. The resulting knowledge base in the mirror site will be grammatically enriched, so that the new information is presented to the user. Moreover, the integration can facilitate lexicon extraction for the GF grammars and the query language of KRI.
The translator's tool s that should be developed in WP3 should not be given up. Although the WP's leader's impression is that the MT quality is too low for the tools to ever be used, the developed tools can be useful for those subdomains/language pairs where MT quality is better.
Follow-up: The development of the translator's tool has been continued, but with another platform. Deliverables 3.1 and 3.2 use GlobalSight, a translation management system, and an external editor that supports GF. However, we found that GlobalSight was not maintained, and changed to Pootle, a modern and lightweight translation platform with an active user base. D3.3 describes the integration of the GF translation to Pootle. A demo video is found at MOLTO's youtube channel.
The recommendation from the 1st review “Critical issues with respect to the semi-automatic creation of abstract grammars from ontologies, as well as deriving ontologies from grammars, are still to be clarified. Concrete steps to handle these issues need to be specified in detail and a schedule should be included in the updated work plan (D1.1). In addition, as noted with respect to WP3, complementarities between KRI and TF should be exploited avoiding possible overlaps. Terminology should be added and abbreviations explained in Deliverable D4.1 in order to facilitate reading by non-experts in the field” should still be addressed. The issue of the two-way interoperability between ontologies and GF grammars still remains unclear, although as noted in the DoW this represents one of the two most research-intensive parts of MOLTO. This should be solved in the new versions of D4.2 and D4.3 The current version of deliverables D4.2 “Data Models, Alignment Methodology, Tools and Documentation”, and D4.3 “Grammar-Ontology Interoperability” are not approved. D4.2 is too general. For instance, a lot is said about LOD and the museum case and not on the alignment methodology. D4.3, on the other hand, does not give a clear picture of the interoperability issues and the degree of automation that can be expected. What is required for porting this to a new application? Concrete steps should be provided making clear what can be automated and what cannot with the provided infrastructure.
Follow-up:
The current description of work in WP6 lacks details on the prototype multilingual dialogue system to be developed. As recommended in the 1st review, an example dialogue and specifications of this prototype should be provided. These can be included in D9.1E.
Example dialog and description are available at D6.3 cover document (http://www.molto-project.eu/sites/default/files/D6.3.pdf) (Sections 1, 3 and 4).
WP7 work should focus on the major issues examined in MOLTO, especially in relation to the grammar – ontology interoperability automation. Specific scenarios are needed for the exploitation of MOLTO tools in this case study. It is recommended to include such scenarios in deliverable D9.1E.
Follow-up:
In response, two use case scenarios were described: UC-71 and UC-72.
UC-71 focuses on grammar-ontology interoperability. User queries, written in CNL, are used to query the patent retrieval system. We defined a query language and a new query grammar in order to a) decrease the number of ambiguities in the queries and b) increase the coverage of the ontology. As a result, we come up with a more reusable grammar (YAQL), easier to maintain, that facilitates the lavour of building query grammars for the application domain and languages. NL queries are translated into SPARQL using this approach. Additional details are given in D7.3 and D4.3A.
UC-72 focuses on high-quality machine translation of patent documents, and the ultimate goal is to endow the retrieval system with the information required to enable multilinguality. We used an SMT baseline system to translate a big dataset of patents and feed the retrieval databases. The automatic translation included the semantic annotation, available only in English documents. This mechanisms allowed to extract multilingual lexicons for the domain ontology, which were used also to build the query grammars. More details are also given in D7.3.
Finally, the exploitation plans for the technologies developed within this WP, which are further discussed in D10.4, are focused on multilingual text processing and cross-lingual translation of various domain data within search and retrieval techniques.
The recommendation from the 1st review “Preparation of a new version of D9.1 is recommended including prototype specifications and scenarios for the three case studies (WP6, WP7, WP8)” should still be addressed. A concrete evaluation methodology is needed focusing on MOLTO's major goals: How will the consortium prove that its objectives were fully/partially met? We expect to see this in D9.1E “Addendum to the MOLTO test criteria, methods and schedule” hoping that the recommendations suggested above as well as in the 1st review, in relation to D9.1, will be included there.
Follow-up: D9.1A “Appendix to MOLTO test criteria, methods and schedule” addresses these issues.
The way the project’s web site is structured, although it contains the necessary content, affects its readability in some cases. It should contain a structure according to the work packages, including all documentation related to a specific work package.
The content published on the web site can be navigated according to the way the producer has tagged it. If the author has decided to tag a certain item as belonging to a work-package then this content will display when selecting the proper tag: e..g http://www.molto-project.eu/category/dow/potential-impact/dissemination or, for publications, http://www.molto-project.eu/biblio/keyword/88 will select the WP7-related bibliography. However, to the casual reader of the website, the distinction in work-packages is not very informative and the results are best viewed independently of the contingent organization in the work-plan. Following this principle, we have created a navigation menu that distinguishes the internal, work-plan related items from the public more general publications.
The deliverables on the work plan (D1.1) and the dissemination plan (D10.1) should be updated at the beginning of the 3rd year.
We have adopted the methodology to continuously use online publication tools on the internal section of our web pages in order to maintain the work plan, the dissemination plan and their updates. Partners that are undergoing new activities use the news feed to inform the Consortium. Work package leaders have been given the option to create tasks, allocate and manage them. Some of the work planning has been coordinated by the partners using third party specific tools such as Trello (trello.com) and Symphonical (https://www.symphonical.com).
 
 
The figures in the attached table come from the participants' time sheets. They are therefore a preliminary estimate: the final figures will be available in the NEF, Form C, when every participant has finalized their reporting there.

I, as scientific representative of the coordinator of this project and in line with the obligations as stated in Article II.2.3 of the Grant Agreement declare that:
The attached periodic report represents an accurate description of the work carried out in this project for this reporting period;
The project (tick as appropriate):
x has fully achieved its objectives and technical goals for the period;
☐ has achieved most of its objectives and technical goals for the period with relatively minor deviations.
☐ has failed to achieve critical objectives and/or is not at all on schedule.
The public website, if applicable:
x is up to date
☐ is not up to date
To my best knowledge, the financial statements which are being submitted as part of this report are in line with the actual work carried out and are consistent with the report on the resources used for the project (section 4) and if applicable with the certificate on financial statement.
All beneficiaries, in particular non-profit public bodies, secondary and higher education establishments, research organisations and SMEs, have declared to have verified their legal status. Any changes have been reported under section 3 (Project Management) in accordance with Article II.3.f of the Grant Agreement.
Name of scientific representative of the Coordinator:
....................................................................
Date: 30/7/2013
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D2.1. GF Grammar Compiler API | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M13 | 
| Actual date of delivery: | March 2010 | 
| Type: | Prototype | 
| Status & version: | Draft (evolving document) | 
| Author(s): | A. Ranta, T. Hallgren, et al. | 
| Task responsible: | UGOT | 
| Other contributors: | 
Abstract
The present paper is the cover of deliverable D2.1 as of M13.
GF, Grammatical Framework, is a programming language for multilingual grammars. GF is used in the MOLTO project to build translation systems. How to write GF grammars is specified in the numerous tutorials and manuals available via http://grammaticalframework.org. The compiler API is a document that explains aspects of the compiler of the GF language:
The GF API can be downloaded from the MOLTO svn server at svn://molto-project.eu/compiler-api and compiled by running make. For compilation to succeed, and produce an HTML file readable in the browser,  it is necessary to have the txt2tags software (from http://txt2tags.org) and the graphviz software (http://www.graphviz.org).
A version maintained by the GF developers is also available online at http://www.grammaticalframework.org/compiler-api.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D2.2 Grammar IDE | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M18 | 
| Actual date of delivery: | September 2011 | 
| Type: | Prototype | 
| Status & version: | Final | 
| Author(s): | A. Ranta, T. Hallgren, et al. | 
| Task responsible: | UGOT | 
| Other contributors: | John Camilleri, Ramona Enache | 
Abstract
Deliverable D2.2 describes the functionalities for an Integrated Development Environment (IDE) for GF (Grammatical Framework). The main question it addresses is how should such a system help the programmers who write multilingual grammars? Two IDE's are presented: a web-based IDE enabling a quick start for GF programming in the cloud, and an Eclips plug-in, targeted for expert users working with large projects, which may involve the integration of GF with other components. Example-based grammar writing is also described in the end.
An IDE, Integrated Development Environment, is a software system targeted for programmers. It helps the programmer to write code and test and maintain it. The tasks where IDE helps can include
(cf. http://en.wikipedia.org/wiki/Integrated_development_environment). While interactive command interpreters (such as Unix shells) were historically the first systems recognized as IDE's, the contemporary notion of "IDE's proper" assumes a set of visual tools. The most widely used IDE's are probably Eclipse (http://www.eclipse.org/), Microsoft Visual Studio (http://www.microsoft.com/visualstudio), and Apple's XCode (http://developer.apple.com/xcode/). Each of these is a desktop program of substantial size. But recent times have also seen Web IDE's (WIDE), where the user can write programs without implementing any software locally; an example is CodeRun (http://www.coderun.com).
The purpose of this document is to introduce an IDE for GF, Grammatical Framework (http://www.grammaticalframework.org/). GF is a programming language designed for writing multilingual grammars and their applications (Ranta 2011). Typical applications are translation systems (with many simultaneous languages) and the localization of natural language processing systems such as question answering (with many alternative languages).
This paper will introduce two IDE's for GF:
The Web IDE is intended to be a quick way to use GF, since it doesn't require any software installation, and also has some helpful functionalities to guide novice users. But it is less adapted for large GF programs consisting of large numbers of modules, such as GF grammar libraries. The Web IDE is a mature program tested by many users, but new developments are still expected.
The Eclipse plug-in is meant for power users of GF, who have to maintain perhaps hundreds of GF modules simultaneously and to link them with other software. But it is less quick to get started with, since it requires the installation of both the GF compiler, the Eclipse platform, and the GF Eclipse plug-in. The Eclipse plug-in is still in the beginning of its development.
Both these tools are new, and have been built during 2011 within the MOLTO project. The traditional "IDE" for GF is one familiar from the Unix environment:
The interactive shell is a Read-Eval-Print loop similar to LISP and, more recently, Haskell (GHCI). While it has more IDE functionalities than many programming languages provide, we are not calling it an IDE, but reserve that name to the graphical Web IDE and Eclipse systems. Actually, the GF shell can be seen as an API (Application Programmer's Interface) to the GF compiler. It provides a set of commands that can be used for compiling, diagnosing, and testing GF grammars. More sophisticated IDE's can be built by using the shell command language to communicate with the compiler. The document The GF Grammar Compiler API (MOLTO Deliverable 2.1) gives more information on the available functionalities.
A GF program is a multilingual grammar, which for
n languages consist of 1+n modules: one abstract syntax defining the semantic content in a language-independent way, and for each language a concrete syntax showing how this content is expressed in that language. Here is a "hello world" example for English, Finnish, and Italian:
  abstract Hello = {
    cat Greeting ; Recipient ;
    fun 
      Hello : Recipient -> Greeting ;
      World, Mum, Friends : Recipient ;
  }
  
  concrete HelloEng of Hello = {
    lin 
      Hello rec = "hello" ++ rec ;
      World = "world" ;
      Mum = "mum" ;
      Friends = "friends" ;
  }
  
  concrete HelloFin of Hello = {
    lin 
      Hello rec = "terve" ++ rec ;
      World = "maailma" ;
      Mum = "äiti" ;
      Friends = "ystävät" ;
  }
  
  concrete HelloIta of Hello = {
    lin 
      Hello rec = "ciao" ++ rec ;
      World = "mondo" ;
      Mum = "mamma" ;
      Friends = "amici" ;
  }
The GF compiler produces from this code a system that can parse phrases like hello world, ciao mamma and also generate them each language, thus enabling translation between any pair of languages.
The Hello grammar is of course extremely simple, on purpose. But it shows the essential
structure of multilingual grammars, and it is easy to see how the grammar could be
extended by adding new functions (i.e. combination rules like Hello and words like Mum).
The GF compiler controls that the abstract and concrete syntaxes are in synchrony.
For instance, it checks that each abstract syntax function (fun) actually has
a linearization (lin) in each concrete syntax. An IDE is expected to go one
step further: it reminds the programmer, prior to running the compiler, of those
linearizations that are missing. And when a new language (i.e. a new concrete syntax)
is added to the system, the IDE initializes its code with a template for all required
linearization rules.
Multilinguality is one aspect of GF's module system: each language, as well as the abstract syntax, has its own module. Larger GF applications have an additional complexity created by the inheritance and opening of modules; a large grammar can easily have 20 modules involved for each language, and this is multiplied by the number of languages plus one for the abstract syntax. While the opening and inheritance correspond to the module dependencies found in most other programming languages (such as inheritance and the use of libraries), the multilinguality aspect is an extra dimension, which makes GF programs more complex than usual programs.
A GF project with 15 languages, as targeted in the MOLTO project, involves hundreds of modules in scope at the same time. These are roughly divided to two groups,
The total resource grammar code in September 2011 comprises 755 modules, addressing 20 natural languages. This code is normally distributed in binaries (although the source is also available) and never read or written by the application programmer. But the programme of course needs to inspect the code: to see, for instance, what functions are available to contruct objects of a given type such as noun or sentence. Inspecting the library code is one of the most important things that should be supported by the IDE.
Traditionally, GF grammars are created in a text editor and tested in the GF shell. Text editors know very little (if anything) about the syntax of GF grammars, and thus provide little guidance for novice GF users. Also, the grammar author has to download and install the GF software on his/her own computer.
In contrast, the GF online editor for simple multilingual grammars is available online, making it easier to get started. All that is needed is a reasonably modern web browser. Even Android and iOS devices can be used.
The editor also guides the grammar author by showing a skeleton grammar file and hinting how the parts should be filled in. When a new part is added to the grammar, it is immediately checked for errors.
Editing operations are accessed by clicking on editing symbols embedded in the grammar display: + = Add an item, × = Delete an item, % =Edit an item. These are revealed when hovering over items. On touch devices, hovering is in some cases simulated by tapping, but there is also a button at the bottom of the display to "Enable editing on touch devices" that reveals all editing symbols.
In spite of its name, the editor runs entirely in the web browser, so once you have opened the web page, you can continue editing grammars even while you are offline.
 
 
 

At the moment, the editor supports only a small subset of the GF grammar notation. Proper error checking is done for abstract syntax, but not (yet) for concrete syntax.
The grammars created with this editor always consists of one file for the abstract syntax, and one file for each concrete syntax.
The supported abstract syntax corresponds to context-free grammars (no dependent types). The definition of an abstract syntax is limited to
Available editing operations:
Error checks:
At the moment, the concrete syntax for a language L is limited to
SyntaxL and ParadigmsL,
  LexiconL and ExtraL,
  Available editing operations:
Also,
Error checks:
When pressing the Compile button, the grammar will be compiled with GF, and any errors not detected by the editor will be reported. If the grammar is free from errors the user can then test the grammar by clicking on links to the online GF shell, the Minibar or the Translation Quiz.
 

While the editor normally stores grammars locally in the browser, it is also possible to store grammars in the cloud. Grammars can be stored in the cloud just for backup, or to make them accessible from multiple devices.
There is no automatic synchronization between local grammars and the cloud. Instead, the user should press  to upload the grammars to the cloud, and press
 to upload the grammars to the cloud, and press  to download grammars from the cloud. In both cases, complete grammars are copied and older versions at the destination will be overwritten. When a grammar is deleted, both the local copy and the copy in the cloud is deleted.
 to download grammars from the cloud. In both cases, complete grammars are copied and older versions at the destination will be overwritten. When a grammar is deleted, both the local copy and the copy in the cloud is deleted.
Each device is initially assigned to its own unique cloud. Each device can thus have its own set of grammars that are not available on other devices. It is also possible to merge clouds and share a common set of grammars between multiple devices: when uploading grammars to the cloud, a link to this grammar cloud appears. Accessing this link from another device will cause the clouds of the two devices to be merged. After this, grammars uploaded from one of the devices can be downloaded on the other devices. Any number devices can join the same grammar cloud in this way.
Note that while it is possible to copy grammars between multiple devices, there is no way to merge concurrent edits from multiple devices. If the same grammar is uploaded to the cloud from multiple devices, the last upload wins. Thus the current implementation is suitable for a single user switching between different devices, but not recommended for sharing grammars between multiple users.
Also note that each grammar is assigned a unique identity when it is first created. Renaming a grammar does not change its identity. This means that name changes are propagated between devices like other changes.
This prototype gives an idea of how a web based GF grammar editor could work. While this editor is implemented in JavaScript and runs in the web browser, we do not expect to create a full implementation of GF that runs in the web browser, but let the editor communicate with a server running GF.
By developing a GF server with an appropriate API, it should be possible to extend the editor to support a larger fragment of GF, to do proper error checking and make more of the existing GF shell functionality accessible directly from the editor.
The current grammar cloud service is very primitive. In particular, it is not suitable for multiple users developing a grammar in collaboration.
The aim behind developing a desktop IDE for GF is to provide more powerful tools than may be possible and/or practical in a web-based setting. In particular, the ability to resolve cross-references between source files and libraries instantaneously during development time is one of the primary goals and motivations for the project.
The choice was made to develop this desktop IDE as a plugin for the Eclipse Platform as it seemed to be the most popular choice among the GF developer community. Support for the platform is vast and many tools for adapting Eclipse to domain-specific languages already exist. Unlike the zero-click WIDE approach, using the GF Eclipse plugin (GFEP) will require some manual installation and configuration on the development machine. Thus the GFEP is aimed more at seasoned developers rather than just the curious.
Implemented (including partially)
Coming soon
Long-term goals
The starting point for the GFEP is using the Xtext DSL Framework for Eclipse (http://www.eclipse.org/Xtext/). By converting the GF grammar into the appropriate Extended-BNF form required by the LL(*) ANTLR parser, the framework provides a good starting point for future plugin development, already including a variery of syntax checking tools and some cross-reference resolution support. The specific requirements of the GF language, particularly in the way of its special module hierarchy, mean that significant customisations to this generated base plugin are needed.
As of 1st October 2011, a first prototype of the GFEP has been released to GF developers to gather some initial feedback. This first release is not intended to be a mature development tool, but a showcase of some of the potential features that can be provided by developing GF grammars within a powerful desktop IDE. Reactions from within the GF developer community will guide the way forward, both in prioritizing the future tasks and also in better guaging the person-month cost that an eventual mature version of the plugin would require.
 
cat definition for example will produce warnings and/or errors in other the modules.
 
 
recip in:
   Hello recip = {s = "hello" ++ recip.s ! Masc} ;
Masc works but ResEng.Masc does not.
It is typically the case that the writer of a GF concrete grammar is at least fluent in the language and has GF skills which are directly proportional with the complexity of the abstract syntax to implement. However, in the case of a rather complex multilingual grammar comprising 5 to more languages, as for instance was the case with the MOLTO Phrasebook(reference...) which was first available in 14 languages and which has a reasonably rich semantic interlingua, the task of finding grammar developers is a difficult one. Even if there exist such developers, their task can still be made easier, by trying to automate where possible, and alleviate over certain technicalities of GF programming that would slow down the grammar development.
When writing a application grammar, one such problem would be to use the resource library in order to build generate text for a given language with the help of the primitives already defined in the correspondent resource grammar. For this, however, one needs to be familiar with the almost 300 existing functions, assuming that the domain writer is different than the resource grammar write, as it is often the case.
In order to make the users' task easier, an API is provided so that the domain grammar writer only needs to know the GF categories and how they can be built from each other. This layer makes the interaction with the resource library smoother for users, and also makes it easier to make new constructions from the library available.
For example, the sentence "I talked to my friends about the book that I read", is parsed to the following abstract syntax tree:
UseCl (TTAnt TPast ASimul) PPos (PredVP (UsePron i_Pron) (ComplSlash (Slash3V3 talk_V3 (DetCN (DetQuant DefArt NumSg) (RelCN (UseN book_N) (UseRCl (TTAnt TPast ASimul) PPos (RelSlash IdRP (SlashVP (UsePron i_Pron) (SlashV2a read_V2))))))) (DetCN (DetQuant (PossPron i_Pron) NumPl) (UseN friend_N))))
If we use the API constructors, the abstract syntax tree is simpler and more intuitive:
mkS pastTense (mkCl (mkNP i_Pron) (mkVP (mkVPSlash talk_V3 (mkNP the_Art (mkCN (mkCN book_N) (mkRCl pastTense (mkRCl which_RP (mkClSlash (mkNP i_Pron) (mkVPSlash read_V2))))))) (mkNP (mkQuant i_Pron) plNum friend_N)))
In this way, the domain grammar writer, can just use the functions from the API, and combine them with lexical terms from dictionaries and functions from outside the core resource library that implement non-standard grammatical phenomena, that do not occur in all languages.
One step further in the direction of automating the development of domain grammars is to have the possibility to enter function linearizations as a positive example of their usage. This is particularly helpful in larger grammars containing syntactically complicated examples that would challenge even the more experienced grammarians. If instead an example is provided, even though the grammar could return more than one parse tree, the user can select the good tree or take advantage of the probabilistic ranking and take the most likely one.
The example-based grammar writing system is still work in progress, but there is a basic prototype of it available already, and it will be further developed and improved. The basic steps of the system will be shortly described further on, along with the directions for future work.
The typical scenario is a grammarian working on a domain concrete grammar for a given language - which we call X for convenience.
In this case, he would need at least a resource grammar for X. Preferably there should also be a large lexical dictionary and/or a larger-coverage GF grammar with probabilities. Currently, larger lexical resources exist for English, Swedish, Bulgarian, Finnish and French. For Turkish there exists a large lexicon also, but the resource grammar is not complete.
We also assume that the user has an abstract syntax for the grammar already and that the _lincats_ (namely representations of the abstract categories in the concrete grammar) are basic syntactic categories from the resource grammar(NP, S, AP).
Consequently, the functions from the abstract syntax will be grouped in a topological order, where the ordering relation a < b <=> b takes a as argument in a non-recursive rule. There are no cycles in this chain of ordered elements, since a similar check is being performed at the compilation stage. The elements will be ordered in a list of lists - where every sub-list represents incomparable elements. The user will be provided first with the first sub-list and after completing it, with the next ones.
For each such function, an abstract tree from the domain grammar having as root the given function will be generated. The arguments are chosen among the functions already linearized. In case that another concrete grammar exists already, the user can also see a linearization of the tree in the other language, and also an example showing how the given construction fits into a context. For example, if the user needs to provide an example for Fish in a given grammar, say the tourist phrasebook, and there is an English grammar already then he would get a message asking him to translate fish as in "fish" / "This fish is delicious".
When providing the translation, the user will be made aware of the boundaries of the grammar, by the incremental parser of the resource grammar. If the example can be parsed and the number of parse trees is greater than 1, then either the user can pick the right one, or the system can choose the most probable tree as a linearization. From here, the system will also generalize the tree by abstracting over the arguments that the function could have. Finally the resulting partial abstract syntax tree will be translated to an API tree and written as linearization for the given function.
The key idea is based on parsing, followed by compilation to API and provides considerable benefits,especially for idiomatic grammars such as the Phrasebook, where the abstract syntax trees are considerably different. For example, when asking for a person's name in English the question "What is your name" would be written using API functions as:
mkQCl (mkQCl whatSg_IP (mkVP (mkNP (mkQuant youSg_Pron) name_N)))
which stands for the abstract syntax tree:
  UseQCl (TTAnt TPres ASimul) PPos (QuestVP whatSg_IP (UseComp 
    (CompNP (DetCN (DetQuant (PossPron youSg_Pron) NumSg) (UseN name_N)))))
On the other hand, in French the question would be translated to "Comment t' appelles tu" (literally translated to "How do you call yourself") which is parsed to:
  UseQCl (TTAnt TPres ASimul) PPos (QuestIAdv how_IAdv 
    (PredVP (UsePron youSg_Pron) (UseV appeler_V)))
and corresponds to the following API abstract tree:
mkQS (mkQCl how_IAdv (mkCl p.name appeler_V)))
Currently, steps are made to integrate the system with the Web Editor and in this way combine the example-based methods with traditional grammar writing. In this case the set of functions that can be linearized from example will be computed incrementally, depending on the state of the code.
 
A similar procedure to the one that determines which functions can be linearized from example can be used to find the functions that can be tested - functions already linearized that can be learned from example. In this way, the functions linearized in the editor - manually or by example can be also tested by randomly generating an expression and linearizing it in the language that is under development and also in one or more languages for which a concrete grammar exists. In case the linearization is not correct, the user can proceed to ask for a new example, or to modify the linearization himself.
Other plans for future work, in addition to integrate the method with the GF Web Editor, include a thorough evaluation of the utility of the method for larger grammars and with grammarians of different levels of GF skills. Moreover, we plan to include a handle for unknown words, that should make it easier for the user to build a small lexicon from examples.
As a solution to this, we devised the example-based grammar learning system, that is meant to automate a significant part of the grammar writing process and ease grammar development.
The two main usages of the system are to reduce the amount of GF programming necessary in developing a concrete grammar and the second and more important - to make possible learning certain features of a language for grammar development.
In the last years, the GF community constantly increased and so did the number of languages from the resource library and the number of domain grammars using them. The writer of a concrete domain grammar is typically different than the writer of the resource grammar for the same language, has less GF skills and is most likely unaware of the almost 300 constructors that the resource grammars implement for building various syntactical constructions; see http://www.grammaticalframework.org/lib/doc/synopsis.html.
We have presented two IDE's for GF. The web-based IDE is a stable system, which makes it easy to develop multilingual applications in the cloud. The Eclipse plugin brings GF to one of the leading desktop environments of software development. It is already usable for simple tasks such as syntax highlighting and cross-modular references, but more functionalities are being added; the further development of the Eclipse plugin will be sensitive to the actual users in the other sites of the MOLTO project. In addition to the IDE's, we have introduced the technique of example-based grammar writing, which has already been implemented as a desktop shell program and within the web-based IDE.
The IDE's are expected to make the use of GF more efficient for power users and more accessible for beginning users. The success in this will be monitored and evaluated in the case studies of the MOLTO project.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D3.1 Translation Tools API | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M18 | 
| Actual date of delivery: | December 2011 | 
| Type: | Document | 
| Status & version: | Final | 
| Author(s): | L. Carlson | 
| Task responsible: | UHEL | 
| Other contributors: | 
Abstract
Deliverable D3.1 explains the components of the MOLTO translation tool API. The intended audience for the tools are translators with no experience in grammar writing, but who are familiar with standard translation industry tools, such as translation memories and document managers. The API has two levels: the core API for a single author/translator, and the extended API for a community of authors, translators, and grammar engineers. We use the open source translation platform GlobalSight to handle the management.
MOLTO promises a translation tool based on Grammatical Framework, a programming language for multilingual grammars. The Grammatical Framework (GF) is used in the MOLTO project to build translation systems for EU languages. The user of the MOLTO translation tool need not know how to write GF grammars. She is supposed to use domain specific grammars developed by others to translate documents in the domains covered. Basic domain language coverage does not guarantee that all terms and idioms in the translatable document are covered. The MOLTO translation tool should handle lexical gaps in a way that benefits and benefits from a wider community of translators. It should also provide fallback solutions when a text is not covered by the available grammar(s).
This report explains the components of the MOLTO translation tool API. The API is the main value because end user translation environments and tools are many and change quickly with time, while MOLTO tools should have a more long lasting value. The components of the API include at least the following (The Core API):
Provisions for interfacing with further typical CAT tool components will be considered (The Extended API).
WP3 design diagram:

A model implementation of MOLTO translation tools based on the API will be demonstrated in the MOLTO prototype due as deliverable 3.2 in February 2012.
GF is a multilingual interlingual translation system geared toward multilingual generation. As a proof of concept, GF demos display immediate generation into dozens of languages from a tiny grammar. Extended to a more realistic case, this scenario could have a native-language editor producing text for more or less immediate multilingual distribution, for instance, a multilingual website. For this scenario to work, the translations should be acceptable as is without target language native revision.
The GF approach as such suits best an authoring/pre-editing scenario, where an original author or authorised content editor can choose or edit the original message to conform to a domain specific constrained language grammar, which GF is then expected to blind translate reliably to a number of languages. In real life situations, the text to translate is likely to be at least partially new to the system, and no guarantee can be given that the translation is correct in all the generated languages. It is specifically such extensions of the original MOLTO "fridge magnet" demo scenario that this document tries to address.
The current professional human translation scenario is quite different. It is a post-editing scenario. The roles of author (client) and translator are separated. The translator has quite restricted authority over the target text and almost none over the original, aside from obvious errors. The translation process is normally bilingual. This is because the translation is created or at least chosen by human, and human translators rarely have professional competence in more than one or two languages besides their native language.
The preferred professional translation direction is from a second language to native language, because for humans language generation is more demanding than language understanding. In this direction, the translator can exploit external resources to understand the source text and use her native competence to check the quality of the target text. Even in this case, a native subject expert is usually needed to check the translation. The reviser need not know the source language or have the source text at hand.
The extended MOLTO translation scenarios considered here spread between these two extremes. We may still assume that the translator has some authority over the text to produce, i.e. she is the author or is authorized to adapt the text to better satisfy the constraints of the translation grammar. The MOLTO pre-editor/translator should be native or at least fluent in the source language, and familiar with the domain or at least its special language in order to know how the message can be (para)phrased. Thus the extended MOLTO scenario retains an element of constrained-language authoring or pre-editing.
But we may need to relax the blind generation assumption. Although the GF engine may give warnings or suggestions when it is unsure or knows the translation fails, there are likely to be cases where the translation is technically correct, but inadequate for human consumption. A native revision is then needed for one or more target language(s). As in the human translation case, the author/translator can at best serve as informant for one or a few target languages. For the rest, the translation needs to be distributed to a pool of revisers. In real life, a translation has to go out even if GF fails. There must be a way to override GF with human translations. If the translation were a one-off affair, that could end the process. However, in many real life scenarios, the same or very similar texts will come up for (re)translation, and in that case, the results of the revisions should get fed back to the translation cycle, to avoid making the same errors twice. In other words, we should make the MOLTO translation system consisting of GF and the human users an adaptive whole. This is the most demanding part to conceive here. Pre-editing MT has not been very successful in the past, probably partly just because not enough attention has been given to practical concerns like these.
As document automation progresses, professional translation is merging into localization, or the adaptation of software to a new locale (language and culture). Translation used to differ from localization in that translators were not expected to worry about formats or the document lifecycle. Translations were shipped to translators as raw text and returned as such. In an intermediate phase, a specialized localization industry developed to multilingualize software, preserving the source format. More recently, with multi channel publishing and document toolchains, there is again a push to separate form from content. The localization industry solution to these conflicting pressures is to separate content from form in a reversible fashion. Localization formats and tools like Gnu gettext and XLIFF make provisions for extracting the translatable text from a document in a way that allows embedding the target text in the same document form.
The current GF translation engine as such is neutral about the format of the text it receives, but the existing resource grammars expect text to come in raw form. It should be technically possible to include document formatting in GF parsing and generation, and if suitably restricted, that might be the most efficient solution for the translation of inline tags. However, for the rest, it seems best to take advantage of existing content extraction technologies in translation industry. We propose to use XLIFF for MOLTO translatable document format in the extended API.
XLIFF is one of the OASIS LISA OAXAL standards. As of 2011 February 28, the Localization Industry Standards Association (LISA) is insolvent. The LISA standards continue to be used by the industry. The OASIS Open Architecture for XML Authoring and Localization (OAXAL ) reference model, comprises the following open standards:
For the extended scenario, we may add other industry standard CAT tools for MOLTO translators to use besides the core list above. There is a plethora of packages for CAT and translation project management/automation both commercial and open source. It seems best to borrow from existing open source packages that comply with translation industry standards, instead of reinventing the wheel. Examples of CAT packages are
SwordFish and HeartSome are commercial. Examples of translation project management and workflow automation packages are
Of the systems listed above, ProjectOpen and GlobalSight are open source, the rest are commercial.
From existing open source projects we can shop for properties generally expected from TM (http://en.wikipedia.org/wiki/Translation_memory), CAT (http://en.wikipedia.org/wiki/Computer-assisted_translation), and translation project management software. Some commercial systems also have open interfaces, e.g. Across (http://en.wikipedia.org/wiki/Across_Systems). Here are some translation tools listings from the Web.
For comparisons, see e.g. Wikipedia.
Here we study variants of the machine assisted translation process, to develop a version that suits MOLTO.
To have a point of comparison, we review the practices in the professional translation industry today. Going beyond the 90's single-user computer-assisted translation (CAT) setup with a translation editor, translation memory, and termbase, current translation management system (TMS) packages provide tools for managing complex translation industry projects involving clients, project managers, and a distributed pool of translators, reviewers, and subject experts. Many aspects of the workflow and the associated communication (notification, document transfer) can be automated in these systems. For an example of a translation industry workflow, we take the ]project-open[ Translation Workflow. Tasks typically covered by translation project and workflow management packages include
In ]project-translation[ five user roles can be defined.
GlobalSight has yet more default roles:
New roles can be invented at will in GlobalSight. As discussed in the MOLTO requirements document (Deliverable 9.1), The role cast in MOLTO can have at least these roles:
The figure below shows in a schematic way in which the workflow proceeds:

Similarly for editors and proofreaders. Finally, the project manager retrieves the document and sends it to the customer. Alternatively, the project manager can allow the customer to download the files directly. In addition to tracking the status of a project at every stage, the system allows the project manager to allocate projects to the most suitable team and streamline the freelancers’ job.
GlobalSight (http://www.globalsight.com/) is an open source Translation Management System (TMS) released under the Apache License 2.0. Version 8.2. was released on Sept 15, 2011. As of version 7.1 it supports the TMX and SRX 2.0 Localization Industry Standards Association standards.[2] It was developed in the Java programming language and uses MySQL database and OpenLDAP directory software. GlobalSight also supports computer-assisted translation and machine translation.
According to the documentation the software has the following features:
GlobalSight provides a web services API (http://www.globalsight.com/wiki/index.php/GlobalSight_Web_Services_API). It is used to integrate external systems to GlobalSight in order to submit content to the localization/translation work-flow, and monitor its status. The Web services API allows any client to connect and exchange data with GlobalSight, regardless of its implementation technology or operating system. The web service provides methods for
For convenience, we shall borrow parts of the MOLTO extended API from the GlobalSight translation management system.
The translation industry workflow is top-down controlled, built on email and file transfers. For a more collaborative bottom-up approach, we can look at web localization. Web platforms are getting localized by a collaborative translation workflow. Here, translation is typically crowdsourced to a pool of volunteers, who either translate manually online or download po files to work on with local tools. The website coordinates the effort. Different projects may have assigned managers that monitor the collaboration. It is exemplified by the Translate toolkit (http://en.wikipedia.org/wiki/Translate_Toolkit) used to collaboratively localize open source software packages.
An instance of the Translate toolkit is Pootle http://en.wikipedia.org/wiki/Pootle, an online translation management tool with translation interface. It is written in the Python programming language using the Django framework and is free software originally developed and released by Translate.org.za in 2004. It was further developed as part of the WordForge project and the African Network for Localisation and is now maintained by Translate.org.za.
Pootle is intended for use by free software translators but is usable in other situations. Its main focus is on localization of applications' graphical user interfaces as opposed to document translation. Pootle makes use of the Translate Toolkit for manipulating translation files. The Translate Toolkit also offers offline features that can be used to manage the translation of Mozilla Firefox and OpenOffice.org in Pootle. Some of Pootle's features include terminology extraction, translation memory, glossary management and matching, goal creation and user management.
It can play various roles in the translation process. The simplest displays statistics for the body of translations hosted by the server. Its suggestion mode allows users to make translation suggestions and corrections for later review, thus it can act as a translation specific bug reporting system. It allows online translation with various translators and lastly it can operate as a management system where translators translate using an offline tool and use Pootle to manage the workflow of the translation.
The Translate Toolkit API is documented at http://translate.sourceforge.net/doc/api/. It is open source subject to the GPL licence. The Google Translator Toolkit
Google provides a free service for translating webpages by post-editing Google MT results. The toolkit allows users to
The Google Translator Toolkit Data API allows client applications to access and update translation-related data programmatically. This includes translation document, translation memory, and glossary data stored with Google Translator Toolkit. The Google Translator Toolkit API is now a restricted API (http://code.google.com/apis/gtt/).
We now consider the MOLTO translation scenario. The MOLTO translation demo editor (see figure further below) supports a one-person workflow where the same person is the author(ised editor) of the source and the translator. Technically we can extend this to a more collaborative scenario where more actors are involved as in the professional workflow above, by adding the usual project support tools to the toolkit. A more difficult part is to adjust the workflow so that the adaptivity goal above is satisfied. In the professional workflow, corrected translations accumulate in the translation memory, which helps translators avoid the same errors next time. In the MOLTO workflow, GF has an active role in generating translations, so it is GF that should learn from the corrections. Concretely, when a translator or reviser changes a wording, the correction should not go unnoticed, but should find its way to back to the GF grammar, preferably through a round of community checks.
We next try a description of one round of the ideal MOLTO translation scenario.
Although it is possible that an author is ready to create and translate in one go (especially in a hurry), it is more normal to have some document(s) to start from. The document/s might be created in a GF constrained language editor in the first place. In that case, the only remaining step is translation. If translation coverage and quality has been checked, nothing more is neeeded. But frequently, some changes are needed to a previously translated document, or a new one is to be created from existing pieces and some new material. Imaginably, some of the parts come from different domains, and need to be processed with different grammars. Some such complications might be handled with document composition techniques in the manner of Docbook or DITA toolchains.
The strength of GF is that it ought to handle grammatical variation of existing sources well, so as to avoid manual patching of previous translations. Assume there is a previously GF translated document, and we want to produce a variant. Then it ought to be enough to load the document, make desired changes to it under the control of the GF grammar, and let GF generate the modified translations.
Is it necessary to show the translations to the user? Not unless the translator knows the target language(s). We should distinguish two profiles: blind translation, where the author does not know or is not responsible for the target languages herself, but relies on outside revision, and plain translation, in which there is one or two target language known to the author/translator to translate to, who wants to check the translations as she goes.
In the blind profile, the author has to rely on revisers, and the revision cycle is slower. The revisers can either notify the author that the source does not translate correctly in their language(s), or they may notify the grammar/lexicon developer(s) directly, or both. If there is a hurry, the reviser/s should provide a correct translation directly for the author/publisher to use as canned text. In addition, they should notify the grammar developer/s of the revisions needed to GF. The notification/s could happen through messages, or conveyed through a shared translation memory, or both. In this slower cycle, it may not be realistic to expect the author to change the source text and repeat the revision process many times over for the same source and possibly a multiplicity of languages to get everything translate right before publication.
In the plain profile, a faster cycle of revision is called for. The author/translator can try a few variations of the input. If no variant seems to work, then she probably wants to use her own translation, but also to make sure that GF learns of and from the failure. The failure can be a personal preference, or a general fix that the community should profit from. If it is a personal preference, the user may want to save the corrected translation in her translation memory and/or glossary, but also she may want to tweak her GF grammar to handle this and similar cases to her liking next time. If it is just a lexical gap or missing fixed idiom, then there should be in GF translation API a service to modify the grammar without knowing GF. The modifications could happen at different levels of commitment. The most direct one would be to provide a modular PGF format which would allow advising the compiled user grammar on the fly. Such a runtime fix would make sure that the same error will not happen during the same translation session or subsequent ones at least until the domain grammar is recompiled.
The next level of commitment to a change would be to generate new GF source, possibly from example translations provided by the author/translator, compile them, and add the changed or extra modules to the user's GF grammar. The cycle involved here might be too slow to do during translation, but it could happen between translation sessions. If fully automatic grammar revision is too error prone, the author/translator could just go on with canned translations in this session, and commit change requests to the grammar developer community. In this case, the changes would be carried out in good time, with regression tests and multilingual revision cycles, especially if the changes affect the domain semantics (abstract grammar) and thereby all translation directions.
| Attachment | Size | 
|---|---|
| workflow.png | 10.68 KB | 
The MOLTO Translation Tools API exposes the most important operations used in translating with GF in MOLTO. It makes them available for programmers who want to create alternative accesses to GF translation tools, besides the MOLTO web translation demo platform. The API is divided into a Core API basically answering the needs of a single author/translator, and an Extended API addressing the needs of a community of authors, translators, and grammar engineers.
The components of the MOLTO TT Core API include at least the following:
The components of the MOLTO TT Extended API include the following:
The first five are extensions of the corresponding facilities in the Core API. The lexical resources API borrows from TermFactory. The translation memory and the reviewing/commenting facilities are adapted from GlobalSight. The last item is based on the GF grammar development tools API.
The core API basically provides for the one-editor/translator scenario, where an editor/translator creates or edits a source document under constraints of a selected GF grammar in PGF form and generates translations for the source. For lexical gaps (out-of-vocabulary items) there is a simple term editor which allows looking up concepts and adding equivalents. The demo prototype translation editor
This section describes the translation editor developed by K. Angelov at UGOT.
To guide the development of a suitable translation editor API to support MOLTO translation needs, UGOT has created a prototype web-based translation editor. It is implemented in Google Web Toolkit. It is usable for authoring with small multilingual grammars. It doesn't require any downloads or use of command shells. All that is needed is a reasonably modern web browser.

The editor runs entirely in the web browser, so once you have opened the web page and have documents and grammars loaded, you can continue translation editing while you are offline.
Signing in should allow a user controlled access to her own and some (maybe not all) shared resources. Ideally, the same login should work throughout the different parts of the distributed toolkit. There should be some group scheme to set group level access restrictions.
The demo editor has a simple grammar manager that retrieves the user's grammars from a mySQL database via a ContentService implemented in Haskell, subject to a successful login through Google.
Available operations in ContentService:
The demo editor has a simple file database manager that uploads and requests the user's documents from a mysql database using the same ContentService as the grammar manager.
The demo editor has a simple treegrid editor for searching and editing translation correspondences from the web of data, including TermFactory services. It is not yet connected to the GF grammar back end. The management of lexical resources and ontologies is detailed in connection with the extended API below.

The editor guides the text author by showing a set of fridge magnets and offers autocompletion to hint how a text can be continued within the limits of the current grammar. In the current version, there is a sign-in box and tabs for grammars, documents, editor, and terms, plus two to query and browse the loaded grammar.
The prototype gives a first rough idea of how a web based GF translation editor could work. While this editor is implemented in JavaScript and runs entirely in the web browser, we do not expect to create a full implementation of the MOLTO translation tools that runs in the web browser, but let the editor communicate with outside servers, including a TMS server (Globalsight) and a GF server.
For more flexibility (as well as vendor independence), an open source LDAP (The Lightweight Directory Access Protocol) based user management implementation can be used. There is one in GlobalSight. It allows distinguishing different roles and user groups, and controlling access to resources by roles.
The simple document manager of the demo editor will be complemented with a more sophisticated XLIFF based document manager built using the GlobalSight document management API. Document format conversions belong to the day's work in the translation business, and they can be assumed to be handled by the extended dcoument manager, using XLIFF as a fixpoint.
XLIFF (XML Localisation Interchange File Format) is an XML-based format created to standardize localization. XLIFF was standardized by OASIS in 2002. Its current specification is v1.2[1] released on Feb-1-2008. The XLIFF Technical Committee is currently at work on XLIFF 2.0. The specification is aimed at the localization industry. It specifies elements and attributes to aid in localization.
XLIFF cognizant open source editors and localization platforms include
Example 1: A simple XLIFF file with strings extracted from a Windows RC file. Here the skeleton (the data needed to reconstruct the original file are) is stored in a separate file:
<?xml version="1.0" encoding="windows-1252" ?>
<xliff version="1.1" xml:lang='en'>
 <file source-language='en' target-language='fr' datatype="winres"
  original="Sample1.rc">
  <header>
   <skl><external-file href="Sample1.rc.skl"/></skl>
  </header>
  <body>
   <group restype="dialog" resname="IDD_DIALOG1">
    <trans-unit id="1" restype="caption">
     <source>Title</source>
    </trans-unit>
    <trans-unit id="2" restype="label" resname="IDC_STATIC">
     <source>&Path:</source>
    </trans-unit>
    <trans-unit id="3" restype="check" resname="IDC_CHECK1">
     <source>&Validate</source>
    </trans-unit>
    <trans-unit id="4" restype="button" resname="IDOK">
     <source>OK</source>
    </trans-unit>
    <trans-unit id="5" restype="button" resname="IDCANCEL">
     <source>Cancel</source>
    </trans-unit>
   </group>
  </body>
 </file>
</xliff>
Example 2: an XLIFF document storing text extracted from a Photoshop file (PSD file) and its translation in Japanese:
<xliff version="1.2">
 <file original="Graphic Example.psd"
  source-language="en-US" target-language="ja-JP"
  tool="Rainbow" datatype="photoshop">
  <header>
   <skl>
    <external-file uid="3BB236513BB24732" href="Graphic Example.psd.skl"/>
   </skl>
   <phase-group>
    <phase phase-name="extract" process-name="extraction"
     tool="Rainbow" date="20010926T152258Z"
     company-name="NeverLand Inc." job-id="123"
     contact-name="Peter Pan" contact-email="ppan@xyzcorp.com">
     <note>Make sure to use the glossary I sent you yesterday.
      Thanks.</note>
    </phase>
   </phase-group>
  </header>
  <body>
   <trans-unit id="1" maxbytes="14">
    <source xml:lang="en-US">Quetzal</source>
    <target xml:lang="ja-JP">Quetzal</target>
   </trans-unit>
   <trans-unit id="3" maxbytes="114">
    <source xml:lang="en-US">An application to manipulate and 
     process XLIFF documents</source>
    <target xml:lang="ja-JP">XLIFF 文書を編集、または処理
     するアプリケーションです。</target>
   </trans-unit>
   <trans-unit id="4" maxbytes="36">
    <source xml:lang="en-US">XLIFF Data Manager</source>
    <target xml:lang="ja-JP">XLIFF データ・マネージャ</target>
   </trans-unit>
  </body>
 </file>
</xliff>
XLIFF is bilingual (each translation unit offers a and a elements). There are however ways to have multilingual XLIFF documents:
In the GF interlingual model, the source "language" can be the abstract syntax representation of a translation unit.
The above considerations entail some requirements for translation-time document management in the MOLTO Translation tools API:
Associated to the MOLTO Translation Tools API, there must be tools for extracting XLIFF content documents out of various types of original skeleta and putting translated content back to the skeleton. (These tools are outside of MOLTO proper on the because many such tools already exist and because it is up to the provider of a new document type to also provide XLIFF support for it.)
There must be methods in the Molto Translation API for extracting raw text from XLIFF source elements, feeding it into GF and inserting the translation into the XLIFF target element. The GF translation API should also have methods for handling XLIFF coded inline tags. The best solution for that could be a special purpose GF grammar, because the correct placement of inline tags can depend on the translation of the content. 
A key consideration for the usability of MOLTO translation is the ease with which its text coverage can be extended by a user community. We need to pay great attention to adaptability. The most important factor in extensibility is lexical coverage. Grammatical coverage can be developed and maintained with language engineering, and grammatical gaps can often be circumvented by paraphrasing. There are two cases to consider: either the abstract grammar misses concepts, or concrete grammars for some language/s are missing equivalents. In the first case, we need to extend the domain ontology and its abstract grammar. In the second case, we need to add terms.
For ontology and term management, we propose to apport to MOLTO the TermFactory ontology based terminology management concept. TermFactory is a system of distributed multilingual term ontology repositories maintained by a network of collaborative management platforms. It has been described at length in the TermFactory Manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml.
The user of the MOLTO translation editor has direct access to the treegrid editor for querying and editing term equivalents for concepts already in available ontologies, either already in TermFactory or 'raw' from the Web of Data, in particular, the OntoText services serving data from FactForge repository.
Say for instance there is no equivalent listed for cheese in some language's concrete grammar FooLang. The author/translator can use the treegrid editor to query for terms for the concept food:Cheese in TermFactory or do a search through OntoText services for candidate equivalents, or, if she knows the answer herself, submit equivalents through the treegrid editor. The new equivalent/s are saved in the user's own MOLTO lexicon, and submitted to TermFactory as term proposals for the community to evaluate.
If there is a conceptual gap not easily filled in through the treegrid editor, there is the option of forwarding the problem to an appropriate TermFactory collaborative platform. This route is slower, but the quality has a better guarantee in the longer run, as inconsistency or duplication of work may be avoided. Say there is no concept in the domain ontology for the new notion that occurs in the source text. In easy cases, new concepts can be added through the treegrid editor, subclassing some existing concept in the ontology. In more complex cases, where negotiations are needed in the community, an ontology extension proposal is submitted through a TermFactory wiki. TermFactory offers facilities for discussing and editing ontologies and their terms. In due time, them modified ontology gets implemented in a new release of the GF domain abstract grammar. Translation editing
The translation editor demo is a good prototype, but different scenarios and platforms may call for different combinations of its features. One way to go is to extend the demo with further tabs and facilities for CAT tool support. But there is the also the opposite alternative to consider of calling MOLTO translation tool services from a third party editor. GlobalSight has two built in translation editors, called popup editor and inline editor. The popup editor is a Trados TagEditor lookalike, while the inline editor has something of the look and feel of old Trados versions running WYSIWYG on Microsoft Word. The inline editor has been implemented in javascript using the FCKEditor library. It might just be feasible to embed MOLTO demo editor functionalities into the GlobalSight editor(s). In the Globalsight setup, there is already support for importing cut-and-dried MT translations from a MT service, but here we are talking about something rather more intricate.
It is not immediately obvious which route would provide least resistance. From the point of view of GF usability, finding a neat way of embedding GF editing functions in third party translation editors could be a better sales position than trying to maintain a whole new MOLTO translation environment. (Unless of course, the new environment is clearly more attractive to targeted users than existing ones.) We may also try to have it both ways.
It was noted above that blind translation in the case of incomplete or inadequate coverage in resource grammars can occasion a round of reviewing and giving feedback on the translations before publication. This part of the process is in its main outlines familiar from the translation industry workflow, and can be implemented as a variation of it. In the MOLTO workflow, reviewer comments are not returned (just) to the human author/translator(s), but they should have repercussions in the ontology and grammar management workflows. This part requires modifying and extending the existing GlobalSight revisioning tools to communicate with the MOLTO lexical resources and grammar services. The GlobalSight revisioning tools now use email as the human-to-human communication channel. We probably want to use a webservice channel for machine-to-machine communication, and possibly some web commenting system as an alternative to email.
To the extent grammar engineering can be delegated to translation tool users, it must happen transparently without requiring knowledge of GF. One way to do this is through what is known as example-based grammar writing in GF. Example-based grammar writing is a new GF technique for backward-engineering GF source from example translations. It can play a significant role in the translation-to-grammar feedback cycle. This part of the TT API will be borrowed from the MOLTO Grammar Developer Tools API. See the last section of this document.
To develop the above outlined web-based translation environment further, or implement other usage scenarios, a web service interface to the MOLTO editor API will be useful. The interface consists of several parts.
The editor demo is in the MOLTO darcs repository. The services provided by the GF server are outlined in the MOLTO Grammar Tools API document. The GlobalSight WS API was described above. The TermFactory web services are documented in the TF Manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml#Services .
The translation tools glue connects the different parts of the whole. It includes at least:
Here is a figure showing some of the connections in the design.

The above design generates a wishlist of requirements on the GF grammar and translation API.
Assume the GF translation goes to a reviser, working with or without another copy of the MOLTO translation tool. The corrected translation, in XLIFF form, should be brought to GF's attention. This calls for a new functionality from the GF grammar API: one which corrects the grammar and lexicon software to produce the output required by the corrected translation. This functionality is to be built on the GF example-based grammar writing methodology.
In order for the corrections to converge, revised translations must accumulate so that the newest corrections do not falsify earlier ones. The collection of manual corrections may become ambiguous or inconsistent, which situation should also be recognised and brought to the attention of a grammar engineer. Again, it is important to pay attention to user roles and rights.
We may want to provide ways to override GF translations with canned translations. At the translation tools level, this can happen by preferring TM translations over GF. We should also consider ways to override compositional translations on GF grammar level.
Another requirement is translation time update of grammar, at least the lexicon, so that translator's on the fly lexicon additions are possible.
If we want to support translating formatted documents using XLIFF, the minimum requirement is that the GF translation API handles XLIFF inline tags.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D3.2 MOLTO translation tools prototype | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M24 | 
| Actual date of delivery: | March 2012 | 
| Type: | Prototype | 
| Status & version: | Final | 
| Author(s): | Lauri Carlson | 
| Task responsible: | UHEL | 
| Other contributors: | Thomas Hallgren, Krasimir Angelov, Seppo Nyrkkö, Lauri Alanko, Chunxiang Li, Inari Listenmaa | 
Abstract
Deliverable D3.2 consists of a prototype of the MOLTO translation tools, documentation of the translation scenario and instructions on the download and installation of the prototype.
MOLTO promises a translation tool based on Grammatical Framework, a programming language for multilingual grammars. The Grammatical Framework (GF) is used in the MOLTO project to build translation systems for EU languages. The user of the MOLTO translation tool need not know how to write GF grammars. She is supposed to use domain specific grammars developed by others to translate documents in the domains covered by the grammars. GF resource grammars offer basic grammar coverage in dozens of languages. A domain specialist is supposed to write an abstract grammar for a domain based on an ontology of the domain that provides the key concepts and their relationships. Language specific grammar engineers are supposed to map the common abstract grammar to the different resource grammars. Basic domain language and coverage does not guarantee that all terms and idioms found in a translatable document are covered. To be really usable, the MOLTO translation tool should handle lexical gaps in a way that benefits and benefits from a wider community of translators. It should also provide fallback solutions when a text is not covered by the available grammar(s).
This document builds on the Translation Tools API document, which lays out the translation scenario/s addressed by the prototype and describes the various programming APIs available for building the prototype. This document explains the installation, use, and limitations of the MOLTO translation tool prototype. The prototype integrates some but not yet all aspects of the MOLTO translation tools design in the TT API document. As in the API document, we single out a set of core tools for a standalone translator used by one translator, from an extended set of tools that are designed to support MOLTO translation communities.
The core MOLTO Translation Tool (TT) consists of these parts.
The core TT editor can be used standalone. It is being integrated, though in the prototype deliverable, only just embedded in the rest of MOLTO translation and ontology/terminology maintenance tools (the extended prototype), in particular, the GlobalSight TMS and the TermFactory term ontology management. Eventually, they both shall play together with grammar GF maintenance and development machinery. To help orientation, we recapitulate the intended workflow from the Translation Tools API document.
The MOLTO TT (translation tools) editor supports a one-person workflow where the same person is the author(ised editor) of the source and the translator. The adoption of the GlobalSight TMS to MOLTO allows embedding it in a more collaborative scenario where more actors are involved as in the professional workflow described in the API document, by adding traditional CAT and translation project support tools to the toolkit. A more difficult part is to adjust the workflow so that the adaptivity goal is satisfied. In the professional workflow, corrected translations accumulate in the translation memory, which helps translators avoid the same errors next time. In the MOLTO workflow, GF has an active role in generating translations, so it is GF that should learn from the corrections. Concretely, when a translator or reviser changes a wording, the correction should not go unnoticed, but should find its way to back to GF, preferably through a round of community checks. More generally improvements should be shared by the community, so that the whole community acts adaptively.
We next try a description of one round of the ideal MOLTO translation scenario.
Although it is possible that an author is ready to create and translate in one go (especially in a hurry), it is more normal to have some document(s) to start from. The document/s might be created in a GF constrained language editor in the first place. In that case, the only remaining step is translation. If translation coverage and quality has been checked, nothing more is neeeded. But frequently, some changes are needed to a previously translated document, or a new one is to be created from existing pieces and some new material. Imaginably, some of the parts come from different domains, and need to be processed with different grammars. Some such complications might be handled with document composition techniques in the manner of Docboook or DITA toolchains.
The strength of GF is that it ought to handle grammatical variation of existing sources well, so as to avoid manual patching of previous translations. Assume there is a previously GF translated document, and we want to produce a variant. Then it ought to be enough to load the document, make desired changes to it under the control of the GF grammar, and let GF generate the modified translations.
Is it necessary to show the translations to the user? Not unless the translator knows the target language(s). We should distinguish two profiles: blind translation, where the author does not know or is not responsible for the target languages herself, but relies on outside revision, and plain translation, in which there is one or two target language known to the author/translator to translate to, who wants to check the translations as she goes.
In the blind profile, the author has to rely on revisers, and the revision cycle is slower. The revisers can either notify the author that the source does not translate correctly in their language(s), or they may notify the grammar/lexicon developer(s) directly, or both. If there is a hurry, the reviser/s should provide a correct translation directly for the author/publisher to use as canned text. In addition, they should notify the grammar developer/s of the revisions needed to GF. The notification/s could happen through messages, or conveyed through a shared translation memory, or both. In this slower cycle, it may not be realistic to expect the author to change the source text and repeat the revision process many times over for the same source and possibly a multiplicity of languages to get everything translate right before publication.
In the plain profile, a faster cycle of revision is called for. The author/translator can try a few variations of the input. If no variant seems to work, then she probably wants to use her own translation, but also to make sure that GF learns of and from the failure. The failure can be a personal preference, or a general fix that the community should profit from. If it is a personal preference, the user may want to save the corrected translation in her translation memory and/or glossary, but also she may want to tweak her GF grammar to handle this and similar cases to her liking next time. If it is just a lexical gap or missing fixed idiom, then there should be in GF translation API a service to modify the grammar without knowing GF. The modifications could happen at different levels of commitment. The most direct one would be to provide a modular PGF format which would allow advising the compiled user grammar on the fly. Such a runtime fix would make sure that the same error will not happen during the same translation session or subsequent ones at least until the domain grammar is recompiled.
The next level of commitment to a change would be to generate new GF source, possibly from example translations provided by the author/translator, compile them, and add the changed or extra modules to the user's GF grammar. The cycle involved here might be too slow to do during translation, but it could happen between translation sessions. If fully automatic grammar revision is too error prone, the author/translator could just go on with canned translations in this session, and commit change requests to the grammar developer community. In this case, the changes would be carried out in good time, with regression tests and multilingual revision cycles, especially if the changes affect the domain semantics (abstract grammar) and thereby all translation directions.
Here is a figure of the overall design.
 
The MOLTO Translation Tools architecture is recapitulated here briefly. It consists of many largely independent components. There is a core basically answering the needs of a single author/translator, and an Extended API addressing the needs of a community of authors, translators, and grammar engineers.
The components of the MOLTO TT editor prototype currently include the following:
The components of the MOLTO TT extended prototype include the following:
The first five are extensions of the corresponding facilities in the core. The lexical resources API borrows from TermFactory. The translation memory and the reviewing/commenting facilities are adapted from GlobalSight. The last item is based on the GF grammar development tools API.
This section describes the code constituting the prototype. The code base of the translation tools extended prototype currently consists of the following parts.
This document describes the prototype's software packages, their installation, use and current limitations. The last two components are not discussed further in this document, because they are described in other MOLTO deliverables. The services currently provided by the GF server are outlined in the MOLTO Grammar Tools API document. The GlobalSight WS API was described in the MOLTO Translation Tools API document. TermFactory is documented at length in the TermFactory manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml .
This section describes the GF translation editor originally developed by Bringert and Angelov at UGOT and reworked at UHEL.
To guide the development of a suitable translation editor API to support MOLTO translation needs, UGOT created a prototype web-based translation editor. It is implemented using the Google Web Toolkit and usable for authoring with small multilingual grammars. To use it from the web, all that is needed is a reasonably modern web browser. To install it locally, one needs in addition a web server, MySQL database and GF services.

The editor runs entirely in the web browser, so once you have opened the web page and have documents and grammars loaded, you can continue translation editing while you are offline.
In order to install the editor, you need to have the following components:
In this section we assume that the user has Apache, MySQL and GF server configurations done. Please see Appendix for instructions on background settings.
The prototype TT editor code is packaged as an Eclipse project archive http://tfs.cc/molto/molto-tt-0.9-linux-eclipse-20120529.zip ready for import in Eclipse (Helios).
Import the project in Eclipse. You should have Google Web Toolkit plugin (tested with version 2.3.1). 
The runtime editor files are found in TT-0.9/www/editor/. To install the runtime, the following files are placed under Apache2 server root (here /var/www) as shown.
/var/www/editor$ ls grammars index.html org.grammaticalframework.ui.gwt.EditorApp WEB-INF
When you have placed the files under /var/www, then you can launch the project in Eclipse. Choose from the menu Run -> Run configurations -> Web Application -> (new configuration). In the tab Server untick Run built-in server. If you have put the files in directory /var/www/editor, then the launch address will be 127.0.0.1:8888/editor/index.html?gwt.codesvr=127.0.0.1:9997.
Web server: Apache2 fastcgi and action modules must be enabled for the services. See installation notes at the end for a sample Apache2 virtual host below to handle the services from port 8888 (the default).
GF server: The editor requires also an installation of GF server. The server binaries are content-service (for authentication and simple mysql database management) and pgf-service (for gf grammars). When compiling, the cabal option --global should be used; then the GF service binaries get installed in /usr/local/bin. They can be copied/linked under webserver (by default Apache2) fcgi-bin directory as follows.
/var/www/fcgi-bin$ ls -l content-service -> /usr/local/bin/content-service pgf-service -> /usr/local/bin/pgf-service
Database: The TT editor back end requires an installation of MySQL, HSQL and a Haskell library hsql-mysql by Krasimir Angelov. Further instructions how to create a database for MOLTO TT tools are in the installation notes.
The content service needs to read mysql database connection parameters from file /usr/local/bin/fpath. It should be in the same directory as content-service and contain four tokens, the mysql host and database names and the database owner credentials.
/usr/local/bin$ cat fpath localhost moltodb moltouser moltopass
Then, the database is created by typing the following:
/usr/local/bin$ ./content-service fpath
-->
Sign in: The prototype editor currently uses the Google authentication API for sign in. Authentication and authorization for Google APIs allow third-party applications to get limited access to a user's Google accounts for certain types of activities. A user needs to have a Google account to sign in to the application.
All back-end requirements are needed also for the user version.
Now, instead of opening the package in Eclipse, the only thing needed is to place the following files under Apache2 server root (here /var/www) as shown.
/var/www/editor$ ls grammars index.html org.grammaticalframework.ui.gwt.EditorApp WEB-INF
Then, to run the editor, just type the address 127.0.0.1:8888/editor/index.html?gwt.codesvr=127.0.0.1:9997 into browser.
Ideally, the same login should work throughout the different parts of the distributed toolkit. There should be some group scheme to set group level access restrictions. Eventually, we may want to provide MOLTO single-sign-on as a replacement for Google authentication.
The prototype editor has a simple grammar manager that is supposed to allow a user to upload her grammars to the editor's grammar cache under her name. The cache kept is on the editor server for reasons of speed and xss restrictions. The user chooses the current grammar from among the cached grammars using a drop-down list.
The grammar manager is not yet completed.
The prototype editor has a simple document manager that saves a translated document in and retrieves one from from the mysql database using ContentService. The current document is saved in the database using a diskette icon on the editor page. The Documents tab shows the currently saved documents and allows the user to load a selected document for continued translation.
Naming of documents is not yet supported. Both the grammar manager and document manager remain to be linked to the TMS.
The TT editor includes a simple tabular equivalents editor for searching and editing translation correspondences from the web of data, including TermFactory services. The equivalents editor is an independent web application that may also be used standalone or as a plugin to other applications. When complete, the equivalents editor lets the user extend their GF grammars with terms entered in the term editor and/or upload them as term proposals to TermFactory.
The equivalents editor was built with the ExtJS javascript library. It can be downloaded from http://tfs.cc/molto/molto-term-editor.tgz. Unpack it and put the whole molto_term_editor directory under /var/www/ (or wherever your web server wants them, for example in Windows the path is probably C:\Program Files\Apache\htdocs). Open the file editor_sparql.html in a browser.
Note that this is also included in the complete editor as one of the tabs. As for function, the versions are identical. The screenshot below is from the standalone version.

The term editor consists of two tabular grids. In the first (left side) grid, enter a term in the text input and opt for wider or narrower concepts. In the latter case (the default) the editor shows on the right another grid of concepts that are classed narrower than the search term in the data source (by default, OntoText FactForge) and their designations in a predefined selection of languages. In the former case, the editor fills out the left side grid with concepts that are classed in the data source as wider than the search term. Clicking on one of them does a search for its subconcepts and terms, shown in the right side grid.
The term grid is editable and the editor remembers the user's edits to the cells in the grid.
The data source and choice of languages are not yet user definable. The editor is not yet connected to the TermFactory or GF grammar back ends.
In the current version, there is a sign-in box and tabs for grammars, documents, editor, and terms, plus two to query and browse the loaded grammar. The latter services are familiar from other GF front ends and based on the GF grammar Web API.
After sign in, the editor calls content-service to show the logged in user's grammars from the grammarusers mysql table in the grammar list. The user chooses a domain grammar. This brings to view the initial vocabulary known by the grammar as fridge magnets to choose from. Alternatively, the user can type or paste text in the editor window. At every new input, the active translation unit is sent to the back end for translation, and the set of fridge magnets is updated. When a translation unit is complete and translatable, it is simultaneously translated to all the available languages and the translations are shown on the screen (in blue). If an input is not parsable, the editor underlines the unparsable part. The user can back off to the point of deviation using backspace. In addition, There is a button for clearing the input.
The editor guides the text author by showing a set of fridge magnets and offers autocompletion to hint how a text can be continued within the limits of the current grammar.
The prototype gives a first rough idea of how a web based GF translation editor could work. At present, however, it remains oriented to a very small vocabulary (fridge magnets are not apt to work well with thousands of words). It is also doubtful that the setup is fast enough for the amount of interactivity caused at speeds involved in professional translation. A reconsideration how the editor and the back end best play together is indicated. A related limitation is the strict left-to-right orientation of the parsing. UGOT seems to be working on a robust parser which allows other manners of combining parsing and editing. The proper disposition of the translation result is not worked out yet.
We now move on to the extended prototype. We first recapitulate how the extended translation tools extend the one-translation scenario to a community of translators collaboratively using and maintaining MOLTO translation tools.
For more flexibility (as well as vendor independence), the open source LDAP (The Lightweight Directory Access Protocol) based user management implementation from GlobalSight has been adapted for MOLTO. It allows distinguishing different roles and user groups, and controlling access to resources by roles. The GlobalSight user management solution has been conservatively extended for the needs of MOLTO TermFactory users. The following screenshot displays a user's roles as an ontology editor.

Term ontology management roles are defined per domain, where a domain is represented by a regular expression on ontology URIs. The MOLTO GlobalSight user management system lets a company project administrator create users and grant them MOLTO TermFactory ontology read and write permissions. The TermFactory back end GateService reads the permissions off the GlobalSight LDAP directory and database and controls access to TermFactory content accordingly. If a user's credentials are not sufficient, TermFactory Gate will not permit term ontology queries or commits. The MOLTO permissions come over and above any constraints that ontology endpoints may impose on the content they manage. They enable fine grained project level control on who is allowed to do what to shared or restricted TermFactory resources.

The simple document manager of the prototype editor remains to be upgraded to a more sophisticated XLIFF based document manager built using the GlobalSight document management API. See the MOLTO TT API document for more detail.
A key consideration for the usability of MOLTO translation is the ease with which its text coverage can be extended by a user community. We need to pay great attention to adaptability. The most important factor in extensibility is lexical coverage. Grammatical coverage can be developed and maintained with language engineering, and grammatical gaps can often be circumvented by paraphrasing. In contrast, paraphrasing is not a real option for special domain terms. There are two cases to consider: either the abstract grammar misses concepts, or concrete grammars for some language/s are missing equivalents. In the first case, we need to extend the domain ontology and its abstract grammar. In the second case, we need to add terms.
For both ontology and term management, we apport to MOLTO the TermFactory ontology based terminology management concept. TermFactory is a system of distributed multilingual term ontology repositories maintained by a network of collaborative management platforms. It has been described at length in the TermFactory Manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml.
The user of the MOLTO translation editor has direct access through the equivalents editor to querying and editing term equivalents for concepts already in available ontologies, either already in TermFactory or 'raw' from the Web of Data, in particular, the OntoText services serving data from FactForge repository.
Say for instance there is no equivalent listed for cheese in some language's concrete grammar FooLang. The author/translator can use the equivalents editor to query for terms for the concept food:Cheese in TermFactory or do a search through OntoText services for candidate equivalents, or, if she knows the answer herself, submit equivalents through the equivalents editor. The new equivalent/s are saved in the user's own MOLTO lexicon, and submitted to TermFactory as term proposals for the community to evaluate.
If there is a conceptual gap not easily filled in through the equivalents editor, there is the option of forwarding the problem to an appropriate TermFactory collaborative platform. This route is slower, but the quality has a better guarantee in the longer run, as inconsistency or duplication of work may be avoided. Say there is no concept in the domain ontology for the new notion that occurs in the source text. In easy cases, new concepts can be added through the equivalents editor, subclassing some existing concept in the ontology. In more complex cases, where negotiations are needed in the community, an ontology extension proposal is submitted through a TermFactory wiki. TermFactory offers facilities for discussing and editing ontologies and their terms. In due time, them modified ontology gets implemented in a new release of the GF domain abstract grammar.
TermFactory ontologies are extensible and support reasoning. Instead of implementing domain ontology-to-grammar bridges over and again for every new domain and application, it seems more promising to take advantage of the semantic network structure of (term) ontologies. Suppose verbalizations are already defined for a selection of upper or middle level ontologies. Special domain ontologies can subclass them and thereby also inherit the verbalizations that go with the superclasses and properties. UHEL is currently looking at the generalization of the MOLTO museum case ontology-to-grammar mapping in this direction.
The TT translation editor is just a prototype. Different scenarios and platforms may call for different combinations of its features. One way to go is to extend the prototype with further tabs and facilities for CAT tool support. But there is the also the opposite alternative to consider of calling MOLTO translation tool services from a third party editor. GlobalSight has two built in translation editors, called popup editor and inline editor. The popup editor is a Trados TagEditor lookalike, while the inline editor has something of the look and feel of old Trados versions running WYSIWYG on Microsoft Word. The inline editor has been implemented in javascript using the FCKEditor library. It might just be feasible to embed MOLTO prototype editor functionalities into the GlobalSight editor(s). In the Globalsight setup, there is already support for importing cut-and-dried MT translations from a MT service, but here we are talking about something rather more intricate.
It is not immediately obvious which route would provide least resistance. From the point of view of GF usability, finding a neat way of embedding GF editing functions in third party translation editors could be a better sales position than trying to maintain a whole new MOLTO translation environment. (Unless of course, the new environment is clearly more attractive to targeted users than existing ones.) We may also try to have it both ways.
It was noted above that blind translation in the case of incomplete or inadequate coverage in resource grammars can occasion a round of reviewing and giving feedback on the translations before publication. This part of the process is in its main outlines familiar from the translation industry workflow, and can be implemented as a variation of it. In the MOLTO workflow, reviewer comments are not returned (just) to the human author/translator(s), but they should have repercussions in the ontology and grammar management workflows. This part requires modifying and extending the existing GlobalSight revisioning tools to communicate with the MOLTO lexical resources and grammar services. The GlobalSight revisioning tools now use email as the human-to-human communication channel. We probably want to use a webservice channel for machine-to-machine communication, and possibly some web commenting system as an alternative to email.
To the extent grammar engineering can be delegated to translation tool users, it must happen transparently without requiring knowledge of GF. One way to do this is through what is known as example-based grammar writing in GF. Example-based grammar writing is a new GF technique for backward-engineering GF source from example translations. It can play a significant role in the translation-to-grammar feedback cycle. This part of the TT API will be borrowed from the MOLTO Grammar Developer Tools API.
The following sections describe what parts of the above list are already in place in the prototype and what remains to do.
GlobalSight (http://www.globalsight.com/) is an open source Translation Management System (TMS) released under the Apache License 2.0. Version 8.2. was released on Sept 15, 2011. As of version 7.1 it supports the TMX and SRX 2.0 Localization Industry Standards Association standards.[2] It was developed in the Java programming language and uses MySQL database and OpenLDAP directory software. GlobalSight also supports computer-assisted translation and machine translation.
According to the documentation, GlobalSight has the following features:
The latest full Linux install version of GlobalSight is 7.1.0.x . It can be updated to the current version 8.2.2.0 using publicly available upgrade packages. The GlobalSight 7.2.0.0 base version and the upgrade packages are available from SourceForge. (Copies are available from tfs.cc under /srv/GlobalSight_backup/upgrade. More detailed install instructions, including scripts to install LDAP for GlobalSight can be found at http://tfs.cc/globalsight-molto-install/. A fully functional GlobalSight site also needs access to email services.
To upgrade from a working install of GlobalSight 8.2.2.0 to MOLTO GlobalSight, download, unpack and run http://tfs.cc/molto/GlobalSight_Installer_8.2.2.1.zip.
There is also a complete MOLTO GlobalSight eclipse project archive at http://tfs.cc/molto/molto-globalsight-8.2.2.1-linux-eclipse-20120529.zip containing the source as well as the runtime.
MOLTO GlobalSight differs from GlobalSight out of the box in two ways. First, MOLTO GlobalSight extends MOLTO user roles to terminology editing. It will be discussed in more detail below in connection with TermFactory. Second, GlobalSight has two built in translation editors, called popup editor and inline editor. The popup editor is a Trados TagEditor lookalike, while the inline editor has something of the look and feel of old Trados versions running WYSIWYG on Microsoft Word. The inline editor has been implemented in javascript using the FCKEditor library. MOLTO GlobalSight extends the selection by embedding the MOLTO TT editor as a third option on the editor menu:

Clicking the option opens the Molto TT Editor in another window.
As yet, content from the document under translation is not automatically imported into the MOLTO TT editor. Content can be cut and pasted into the MOLTO TT editor.
The MOLTO TermFactory prototype consists of the generic TermFactory codebase plus MOLTO related ontology content. At present, such content comprises the English-Finnish WordNet ontology. Integration of the TermFactory back-end with the MOLTO KRI over JMS is underway.
The TermFactory codebase consists of
TermFactory is an architecture and a workflow for Semantic Web based, multilingual, collaborative terminology work. What this means in practice is that it applies Semantic Web and other document and language technology standards to the representation of multilingual special language terms and the related concepts, and provides a plan for how such terminologies can be collected, updated, and agreed about by professionals, not only terminology professionals, all over the globe, during their everyday work on virtual work platforms over the web. As a whole, TF could be termed a semantic web framework for multilingual terminology work.
TF provides
for people to work on terms jointly or separately, building on the results of the work of others, while maintaining quality and consistency between the different contributions.
As a prototype, there is a MediaWiki platform for human to human collaboration on collectiong terminological data plus a TF editor plugin for conveying the results of the collaboration into TermFactory ontology format. Here is a snapshot of a random MOLTO TF concept in the Wiki.

MOLTO TermFactory Mediawiki is used in the usual way a wiki works. In the demo prototype, it has been populated with the Finnish-English Wordnet (ca. 100K concepts, 2 languages, ca. 200K terms per language). The pages are generated automatically on demand. A Wordnet page currently only consists of a set of iframes and links to related lexical resources on the web. In actual use, each category (Wordnet is one) may generate its own boilerplate page design to help users describe and discuss the concepts of a category and their designations in different languages. A commenting system is in place that can be shared between different platforms and applications. The discussion threads are indexed by the URI of the relevant resource.
The TermFactory ontology content related to a resource can be queried and edited on the Mediawiki platform using a TermFactory ontology editor extension, shown on top of the page as the Entry Editor tab. Below is a snapshot showing the TF editor opened to the TermFactory entry corresponding to the chosen WordNet term.

Instead of going by way of fill-in forms, the TermFactory approach is to support direct WYSIWYG editing of localized ontology triples in a HTML textarea editor. The TermFactory editor application uses the CKEditor javascript textarea editor for this purpose. TF adds to the CKEditor standard release a special purpose plugin that adds TermFactory specific action buttons and a menu to the standard issue.
While staying conceptually close to the original RDF format of the data, the TermFactory editor layout is quite versatile. With suitable parameters, it can be tweaked to show ontology content editable in shapes already familiar to professional terminologists. There is a customisable, schema-aware insertion menu to help inserting relevant content, plus customisable input and output layout templates. The editor is not limited to TermFactory ontologies, as it is built on a general purpose textarea editor using a generic RDF to HTML mapping.
A specialty of TermFactory is that it supports terminological reflexion. The metaterminology used in the editor is not fixed, but can be changed by giving it a TF term ontology as parameter. Using TF localization and bridge ontologies, not only the editor interface, but also the content shown can be localized to a user community's conceptualization, language and terminology. Here is the same editor page fetched after setting Mediawiki language settings set to Finnish. Note how the terminological metalanguage used in the entry is now shown in Finnish. (The localization is not complete, because the current localization ontology's coverage has some gaps.)

The TermFactory source code is on svn at svn.it.helsinki.fi/repos/termfactory. A username and password on the repository server is needed for checkout. 
To check out a path, choose installation directory, go to it and do
svn checkout https://<username>@svn.it.helsinki.fi/repos/termfactory/path
.
The compiled web archive files for TF are
io/lib/tf-io.jar The core library (offline tools) ws/service/TFServices.aar The Axis2 webservice archive ws/servlet/TermFactory.war The Tomcat webapp archive
These three archives should be enough for deployment of TF in Linux from binaries on Tomcat running Axis2. Installations of mysql and Jena TDB are needed for persistent storage of ontologies on the TermFactory server. File upload services require prior installation of WebDAV. Detailed TF source build and install instructions are available on request.
TermFactory MediaWiki is MediaWiki out of the box plus the TermFactory MediaWiki extension, downloadable from the TermFactory svn path fe/TermFactory. The extension requires installing TF back end, of course.
require("$IP/extensions/TermFactory/TermFactory.php"); to LocalSettings.php in the main directoryUser management between MediaWiki, TermFactory services, and TermFactory WebDAV is not fully in synch yet.
This section comments on the current status of the integration the different parts.
Here is a figure showing some of the connections in the design.
 
This section repeats the wishlist of requirements from Translation Tools on the GF grammar and translation API.
Assume the GF translation goes to a reviser, working with or without another copy of the MOLTO translation tool. The corrected translation, in XLIFF form, should be brought to GF's attention. This calls for a new functionality from the GF grammar API: one which corrects the grammar and lexicon software to produce the output required by the corrected translation. This functionality is to be built on the GF example-based grammar writing methodology.
In order for the corrections to converge, revised translations must accumulate so that the newest corrections do not falsify earlier ones. The collection of manual corrections may become ambiguous or inconsistent, which situation should also be recognised and brought to the attention of a grammar engineer. Again, it is important to pay attention to user roles and rights.
We may want to provide ways to override GF translations with canned translations. At the translation tools level, this can happen by preferring TM translations over GF. We should also consider ways to override compositional translations on GF grammar level.
Another requirement is translation time update of grammar, at least the lexicon, so that translator's on the fly lexicon additions are possible.
If we want to support translating formatted documents using XLIFF, the minimum requirement is that the GF translation API handles XLIFF inline tags.
The complete MOLTO TT prototype editor code is downloadable as an eclipse (Helios) project archive http://tfs.cc/molto/molto-tt-0.9-linux-eclipse-20120529.zip. The TT editor's database back-end Haskell source code is packaged as http://tfs.cc/molto/molto-tt-server-0.9-linux-20120529.zip.
Install apache and fastcgi with apt-get:
sudo apt-get install apache2 libapache2-mod-fastcgi
Here is a sample Apache2 virtual host below to handle the MOLTO TT back end services from port 8888 (the default). The back end server is supposed to be in the same domain as the editor to avoid cross-domain scripting violations. Copy the text below to /etc/apache2/sites-available/default
<VirtualHost *:8888>
    ServerAdmin webmaster@localhost
    DocumentRoot /var/www
    AddDefaultCharset UTF-8
    <Directory />
        Options FollowSymLinks
        AllowOverride None
    </Directory>
    <Directory /var/www/>
        Options Indexes FollowSymLinks MultiViews
        AllowOverride None
        Order allow,deny
        allow from all
    </Directory>
        # Allow fastcgi services from fcgi-bin. 
    <Directory "/var/www/fcgi-bin/">
        Options +ExecCGI 
        AddDefaultCharset UTF-8
        SetHandler fastcgi-script
    </Directory>
        # Identify pgf-service as a fastcgi server
        FastCgiServer /var/www/fcgi-bin/pgf-service 
        # Identify content-service as a fastcgi server
        FastCgiServer /var/www/fcgi-bin/content-service 
        # Make action pgf-service handle pgf files
    Action pgf-service /fcgi-bin/pgf-service
    AddHandler pgf-service .pgf
    AddCharset UTF-8 .pgf
</VirtualHost>
After you have copied the above to /etc/apache2/sites-available/default, activate the changes:
sudo a2enmod fastcgi sudo a2enmod actions
Finally, restart apache by typing sudo service apache2 restart.
sudo chown -R root:www-data /var/www/ (the name of the group might vary, you can see yours by seeing which group has /var/www; do grep "/var/www" /etc/passwd).Get the latest sources for GF from darcs repository. The instructions are here: http://www.grammaticalframework.org/download/index.html
Assuming you have the source files, go to src/server and type sudo cabal install -f content --global. It is important to use the option --global, because by default they are installed in the home directory, and that doesn't go well with Apache. The binaries will be installed in /usr/local/bin. Because of Apache, you need to set their owner group to the apache group (depending on machine, e.g. www-data or apache).
The next step is to link the binaries to /var/www/fcgi-bin.
/var/www$ sudo ln -s /usr/local/bin/content-service fcgi-bin/ /var/www$ sudo ln -s /usr/local/bin/pgf-service fcgi-bin/
   gf-server-1.0 depends on fastcgi-3001.0.2.3 which failed to install.
sudo apt-get install libfcgi-dev), then try again to install PGF service and content service: sudo cabal install -f content --globalThis applies only if you want to use the editor from Eclipse. Otherwise you don't need any of this.
To test the TT editor under eclipse using GWT devMode, we found it necessary to recompile content-service to add the gwt code server port parameter to page URLs. To do so activate the following lines in ContentService.hs . We have been using eclipse 3.6 JEE with Google Web Toolkit version 2.3.1.
-- devModeScriptName = (liftM2 (++)) (getVarWithDefault "SCRIPT_NAME" "") (return "?gwt.codesvr=127.0.0.1:9997") -- path <- devModeScriptName
A corresponding change is neeeded in the client code. Activate the following line in TT-0.9/src/org/grammaticalframework/ui/gwt/client/SettingsPanel.java
// String defaultUrl = "/fcgi-bin/content-service?gwt.codesvr=127.0.0.1:9997";
Launch settings for building and testing under devMode under eclipse (in $HOME/workspace/.metadata/.plugins/org.eclipse.debug.core/.launches):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<launchConfiguration type="com.google.gdt.eclipse.suite.webapp">
<booleanAttribute key="com.google.gdt.eclipse.core.RUN_SERVER" 
                  value="false"/>
<stringAttribute key="com.google.gdt.eclipse.core.SERVER_PORT"
                  value="80"/>
<stringAttribute key="com.google.gdt.eclipse.suiteMainTypeProcessor.
PREVIOUSLY_SET_MAIN_TYPE_NAME" value="com.google.gwt.dev.DevMode"/>
<booleanAttribute key="com.google.gdt.eclipse.
suiteWarArgumentProcessor.IS_WAR_FROM_PROJECT_PROPERTIES" value="true"/>
<listAttribute key="com.google.gwt.eclipse.core.ENTRY_POINT_MODULES">
<listEntry value="org.grammaticalframework.ui.gwt.EditorApp"/>
</listAttribute>
<stringAttribute key="com.google.gwt.eclipse.core.URL"
                 value="editor"/>
<listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_PATHS">
<listEntry value="/TT-0.9-ORIGINAL"/>
</listAttribute>
<listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_TYPES">
<listEntry value="4"/>
</listAttribute>
<stringAttribute key="org.eclipse.jdt.launching.CLASSPATH_PROVIDER"
                 value="com.google.gwt.eclipse.core.moduleClasspathProvider"/>
<stringAttribute key="org.eclipse.jdt.launching.MAIN_TYPE"
                 value="com.google.gwt.dev.DevMode"/>
<stringAttribute key="org.eclipse.jdt.launching.PROGRAM_ARGUMENTS"
                 value="-startupUrl editor -war $HOME/workspace/TT-0.9/www/editor \
                 -noserver -remoteUI "${gwt_remote_ui_server_port}:${unique_id}" \
                 -logLevel INFO -codeServerPort 9997 \
                  org.grammaticalframework.ui.gwt.EditorApp"/>
<stringAttribute key="org.eclipse.jdt.launching.PROJECT_ATTR" value="TT-0.9"/>
<stringAttribute key="org.eclipse.jdt.launching.VM_ARGUMENTS" value="-Xmx512m"/>
</launchConfiguration>
First you need to install HSQL. It's in hackage, so it can be installed by typing sudo cabal install hsql --global.
The public version of the Haskell MySQL package hsql-mysql-1.8.1 used by the TT content service appears to have a bug that prevents multiple successive mysql procedure calls. A debugged version of the package can be found at http://tfs.cc/molto/hsql-mysql-1.8.1-molto.zip.
Install the debugged version:
hsql-mysql-1.8.1-molto.ziphsql-mysql.cabalsudo cabal install --globalsudo cabal install hsql-1.8.1 --global or downloading the version from hackage. In addition to the above, you need a mysql server. You can install one by typing sudo apt-get install mysql-server.
Content-service uses a mysql database to store users, grammars and documents.
To create the database connection you need to do the following steps:
host db user pwd in format known to haskell readFile.content-service fpath
Now create the database:
$ mysql -u root -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. ... mysql> CREATE DATABASE moltodb; CREATE USER moltouser IDENTIFIED BY 'moltopass'; GRANT ALL on moltodb.* to moltouser; Query OK, 1 row affected (0.02 sec) mysql> Query OK, 0 rows affected (0.00 sec) mysql> Query OK, 0 rows affected (0.00 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | moltodb | +--------------------+ 2 rows in set (0.00 sec) mysql> quit Bye
Next, create the database files.
/usr/local/bin$ ./content-service fpath
And then log in to mysql with the user moltouser.
mysql -u moltouser -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. mysql> use moltodb; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables; +-------------------+ | Tables_in_moltodb | +-------------------+ | Documents | | GrammarUsers | | Grammars | | Users | +-------------------+ 4 rows in set (0.00 sec)
If so the tables got created ok.
There is nothing yet in the tables. After you first sign in with your google account, then your user account will be in the table Users. You can query any of the tables by writing  select * from <table>.
To install the tabular equivalents editor source, do this:
https://svn.it.helsinki.fi/repos/molto/trunk/molto_term_editor/ and put it in any place where the apache server can reach, i.e. /var/www/. 
http://extjs.cachefly.net/ext-4.0.2a-gpl.zip, and uncompress it as extjs-4.0.3 under the directory molto_term_editor/.
http://localhost/molto_term_editor/editor_sparql.html, if you put the source code under /var/www/.
The GlobalSight installation values are kept in $HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight/install/data/installValues.properties. The following shows settings used for a MOLTO GlobalSight eclipse installation (with $HOME replacing the installation directory and PASSWORD the password/s) :
#Mon May 28 22:52:03 EEST 2012 mailserver=mail.domain.com system_log_directory_forwardslash=$HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight/logs install_data_dir_forwardslash=$HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight/install/data server_host=localhost database_password=PASSWORD database_server=localhost database_username=globalsight ldap_password=PASSWORD ldap_install_dir=/var/lib/ldap server_port=9090 ldap_host=localhost gs_home=$HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight ldap_username=ldap_connection admin_email=WelocalizeAdmin@domain.com system4_admin_username=gsAdmin ldap_base=globalsight.com ldap_port=389 GS_HOME=$HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight cap_login_url=http\://127.0.1.1\:9090/globalsight
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D3.3 MOLTO translation tools / workflow manual | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M31 | 
| Actual date of delivery: | March 2013 | 
| Type: | Manual | 
| Status & version: | Draft | 
| Author(s): | Inari Listenmaa, Jussi Rautio | 
| Task responsible: | UHEL | 
| Other contributors: | Lauri Alanko, John Camilleri, Thomas Hallgren | 
Abstract
Deliverable D3.3 consists of a manual and a description of the workflow of the MOLTO translation tools. The document introduces the components: the open-source translation management system Pootle, the Simple Translation Tool, which supports many different translation methods and the Syntax Editor, which allows to modify text by manipulating abstract syntax trees.
The document presents two translation workflows. The first scenario integrates MOLTO tools in a professional translation on fixed source, using Pootle. MOLTO translations with GF grammars are added in machine translation options. Another direction taken is the population of translation memory with GF generated data. In the second workflow, the translator is authorised to do changes to source. The tools used in this scenario are the Simple Translation Tool and the Syntax Editor.
This deliverable D3.3, is a manual and a description of workflow for the translation tools produced within WP3. As stated in the previous deliverables 3.2 and 3.1, the user of the translation tools is not required to know how to write GF grammars. They are either translators, whose job is to translate from fixed source, or authorized to modify the source text in order to fit into the structures covered by the domain-specific grammar(s). This document presents workflows for both scenarios.
In section 2, we present the components: the open-source translation management system Pootle, the Simple Translation Tool and the Syntax Editor. In section 3, we present the translation workflows. (Technical details, where?) In section 4 we talk about future work (and failures? e.g. professional scenario hard to adjust to the idea of non-fixed source. TF not really in use, so lexicon adding did not work as planned.).
We have changed the plan from deliverables D3.1 (translation tools API) and D3.2 (prototype). The previous deliverables use GlobalSight, a translation management system, and an external editor that supports GF.
Due to the changes, it is not necessary to include a summary. The sections of this document are self-contained; assuming that a reader is in general familiar with MOLTO project.
The Simple translation tool (http://cloud.grammaticalframework.org/translator) is a translator's editor that supports manual and automatic translation. Documents consist of a sequence of segments that are translated independently. The user can import text in the source language and obtain automatically translated text in the target language. Imported text can be segmented based on punctuation. Optionally, one can also use line breaks or blank lines to indicate segmentation in imported text. Text can be edited after it has been imported.
In the image below, the translator chooses the source and target languages and uploads a text in the source language.

The text can be displayed as parallel texts or segment by segment, as shown below.


The translator can choose a translation method for the whole document, or when needed, for each segment. The translation methods include various GF grammars, transfer-based machine translation by Apertium and manual translation. Other machine translation options can be added as well. The choice of grammars is shown in the picture above. Choosing a different grammar for different segments is relevant, if the body of the text is unrestricted, but there are passages where precision is required, such as formulas in a patent application. Then the unrestricted text can be translated with a method that has more coverage but less precision, and the formulaic parts with a specialised grammar.
The Syntax Editor (http://cloud.grammaticalframework.org/syntax-editor/editor.html), written by John Camilleri, is a tool for building and manipulating abstract syntax trees in GF.
The image below shows an initial view of the Syntax Editor. The chosen grammar is Phrasebook and the level of the construction is Action; a verb phrase whose tense and polarity are not fixed yet. This excludes some of the possibilities to construct a sentence in Phrasebook, for instance greetings and other fixed phrases.

A syntax tree can also be imported as raw text, as shown in the image below.

The editor maintains the structure of the abstract syntax tree (AST) and outputs linearizations in the languages of the concrete syntaxes. In the following three images, the first shows a change of an argument in the AST and the consequent change in the linearizations. Son is changed to a daughter, in English only the word changes, but in French and Bulgarian, changing the gender of the argument also affects the agreement of the possessive pronoun.

The second image shows the change of the head of the phrase. Both AKnowPerson and ALove take two arguments of type Person, so from the point of view of the AST, they are interchangeable.

The third image manipulates the polarity of the sentence. As opposed to the previous two examples, this AST is complete, belonging to the start category of the grammar, that is, Phrase.

Finally, a tree created or modified in the editor can be exported as raw text.

Future work includes integration to the Simple Translation Tool. Further details about the plans in 4.X.
Pootle (http://tfs.cc/pootle LINK TODO) is an open-source translation and project management platform that is completely implemented as a Web service. The platform accepts most of the formats used in translation industry like XLIFF, TMX and PO.
Pootle has a support for both standard translation memories and machine translation. Google Translate and Apertium MT systems are available in the standard installation, and Web server queries to other MT systems can be sent by modifying the source code of Pootle. GF translation via GF Web API has been added to the Pootle translation environment as a proof of concept.
Unlike Google Translate and Apertium which only require the translatable segment and the source and target language codes as input, the GF system also needs the name of the resource grammar(s) to be used. The selection of the grammar has been added to the Pootle project administration dialog, where the project manager normally selects the languages, file formats, translation memories and other resources to be used:

The translator user interface shows the buttons for different MT systems above the edit box (GF, Google and Apertium buttons are seen on the screenshot below).

When the GF button is clicked, the web browser sends the source segment to the GF server, which parses and linearizes it into the target language using the given grammar(s). The translated text is then sent back to the browser. The translation suggestion can then be edited, after which the segment is added to the project translation memory.
This illustrates that the GF translation system can be relatively easily added into various translation editing applications via the Web API. The GF Web API can be modified to comply with any standard Web API like the one proposed by TAUS (https://tauslabs.com/interoperability/taus-translation-api)
This part consists of two separate workflow descriptions. The first workflow is that of a traditional professional translation, using the translation platform Pootle, with GF integrated in machine translation and translation memory. The second workflow describes a case where the translator is authorised to do changes to source. The tools of choice are the Simple Translation Tool and the Syntax Editor.
The workflow of a professional translation is often fairly complex, including roles such as project manager, translator and reviewers of both content and language. Machine translation is used as the translator's aid, along with other tools such as dictionaries and translation memories. This is an established practice in computer-assisted translation; one of the main objectives of WP3 is to demonstrate that MOLTO tools can be adapted in a traditional translation workflow.
The translator is not allowed to modify the source text, which is a serious limitation for the MOLTO translation, precision at the cost of coverage. However, in this scenario we assume that the translator necessarily knows both source and target languages. The role of the machine translation is not to provide publication quality text for blind translation, but to help the translator to produce translations.
When does it make sense to use MOLTO tools? With free text of unrestricted domains, the most common case is that GF grammars do not produce any translation at all, due to missing words or constructions. When we have a professional translator post-editing GF grammars beat general-purpose MT in situations where the structure is crucial. Any formulaic parts within unrestricted text, such as mathematical constructions (case study X) and chemical formulas (case study Y). A construction such as (2S)-2-[(4S)-4-(2,2-difluorovinyl)-2-oxopyrrolidinyl]butanamide is constructed by elaborate rules, which can be expressed precisely with a GF grammar. However, statistical machine translation fails to capture the structure, and the result is worthless for post-editing; a change, addition or omission of even one element is enough to change the formula completely.
Thus, we have integrated GF as one of the translation options in Pootle. The technical side is handled by GF Web API calls, explained in more detail in section 2.3.
The Pootle translation enviroment implements the now industry-standard workflow where the translatable material, the translation memories and the editing tools reside on the same Web server. The system also includes rel-time word-count reporting, user management and terminology asset handling. This greatly reduces the effort needed for a translation project, as all the tools and resources are centralized. As the translations are updated into a shared translation memory in real time, the need to create, update and document the memories after the project is unnecessary. Pootle also allows the local downloading of necessary resources in cases where the translator does not have an always-on internet connection.
The translation project manager can upload the files to be translated to the Pootle server and define the language pairs, translation memories, glossaries (either general or project specific) and file formats to be used in the translation. The systems allows the use of standard translation file formats like XLIFF (for source material), TMX (translation memories) and TBX (terminology). In our GF machine translation enabled version of Pootle, the PM can also select the GF grammar or grammars used for translation (see Section 2.3 for an example).
When the translation assets have been configured, the PM can give the necessary access rights to them for the translator s and reviewers. The material can also be translated as crowd-sourcing, so anyone with an access to the Pootle server can participate in the translation. This method has been used in many open-source localization projects, for example in the OpenOffice suite.
The translators can then log in with their credentials onto the Pootle server, and see all the translation tasks assigned to them by the PM:

After clicking a project name, the Languages page shows the target languages the translator has been assigned to:

Clicking the language name opens the editing environment:

Any exact or fuzzy match ("Translation suggestion" in Pootle terminology) found in the translation memory can be selected and edited for a translation. As explained in the previous section, the translator can use machine translation services (including GF) by clicking the relevant button. The translator is thus able to use either the sugggestions from the translation memory, a choice of machine translations or a translate the segment from scratch.
The Pootle editing tool includes automatic checking for quality issues, for example missing tags, variables or numbers, wrong capitalisation, punctuation and so on, so the translator gets instant feedback on possible formatting errors in the translation. The translator is also able to include comments to reviewers and PMs as separate field in the tool and add or review terms in the terminology.
During the translation, the PM can follow the progress of the project on the Projects page. When the translation of a component is ready, the PM gets a notification, and the translation can be sent to reviewers, who then check and correct the translations by accepting or rejecting the suggestions. After the review process, the PM is again nofified, and the translated and reviewed file can be downloaded for post-processing.
We hold on to the assumptions stated in D3.1:
In a case of at least partially blind translation, the quality of MT needs to be excellent. External revisers can be added to this scenario as well, but we assume the quality to be in general good, errors are due to bugs in grammars and grammar writers are correcting them. A concrete scenario could be a multilingual website, where the authorized users can create content in any of the languages, and it is updated simultaneously into all of them. Assuming there are users for every language, they can work themselves as reviewers, providing feedback in case there is an error in the grammar. Then a grammar writer fixes the grammar, and the all structures that had the same problem will be updated.
There might be a source document or it can be created from scratch. In any case, there is a need for guided authoring, to ensure that the produced text is recognized by the grammar. This is not currently implemented, but planned by UGOT and explained further in section 4.2.
The Simple Translation Tool (STT) offers the functionalities for pre- and post-editing of MT. When needed, machine translation can be completely overridden by manual translation. The functionalities of STT are demonstrated with a toy text about pizzas. More complex grammars produced within MOLTO include the patent grammars in WP7 and the mathematical grammar library in WP6, but they are not integrated to STT at the moment. We plan to produce a video demo with some real use cases.
In the first image, the text is uploaded into STT and a default translation method is chosen.

In the second image, we see three errors. The first one, indicated with number 1, is an error in the grammar; an unidiomatic word choice. This type of error is easiest to fix just by modifying the target -- followed by a bug report to a grammar writer. Of course, spotting this type of error requires that the translator knows the target language. In cases where not, they just need to assume that input is correct.

The second error manifests as no translation. The solution here is to paraphrase the source; in this case, changing the modifier "really" to "very", that is supported by the lexicon. This example is very simplified; in any realistic situation, the possible changes are numerous. Either we need to assume that the translator/author has a good documentation on the allowed constructions in the restricted language, or the program needs to guide the translator. The latter is a planned feature, the first depends on the individual use case.
The third error also shows no translation, but it is due to the segment being totally different domain. In the example document of 5 phrases, the first is a commentary, written on completely unrestricted text, and the four remaining phrases are the sort of restricted language that translates with our chosen GF grammar. In a realistic situation, the first phrase could be generic instructions and the latter ones could be mathematical formulas, in order for the scenario to make sense. In any case, the error is corrected by changing the translation method for that segment. Instead of any GF grammar, we choose Apertium, with more coverage but quality not quaranteed. In case the translator spots errors in generic MT option, there is the source post-editing option.

Finally, all three errors have been corrected. The user can view the texts parallel and save the project.

UHEL will conduct experiments on creating translation memory data with GF grammars. Grammars of a given domain are used to generate bilingual aligned data, which can be converted into a translation memory. Then, when translating new material, the translation memory can provide fuzzy matches for cases where the constructions are similar, but words are different. This is one way to compensate for the lack of lexicon in a situation where adding new vocabulary is hard.
Jonson (2006) describes an experiment on synthesizing a corpus with GF for training speech recognition models. The idea is similar: use a grammar to generate reliable data for a data-driven approach.
By using GF translation suggestions in a translation memory it is also possible to use standard translation tools like Trados to generate pre-translation reports of exact and fuzzy matches. These percentages of different matches are easier to demonstrate to translation industry stakeholders, as the scientific metrics used in MT evaluation (BLEU, NIST, Rouge and so on) are not generally used or well understood within the industry.
Services in the GF cloud will be linked to each other by UGOT. Syntax editor (http://cloud.grammaticalframework.org/syntax-editor/editor.html) will be used within the Simple Translation Tool (http://cloud.grammaticalframework.org/translator/). As described in the DoW, there will be a mode for editing source text, where structural changes to the document can be made by manipulating abstract syntax trees. This functionality will be added to the Simple Translation Tool.
Simple Translation Tool will be extended to a bilingual, controlled language document authoring tool, with useful ways to enter and edit the source text too. Additions include a text input guided by word completion and syntax tree editing, by invoking the syntax editor (see section 2.2) on a source segment.
plan to submit a system demo paper to the MT Summit, http://www.mtsummit2013.info/ deadline 22 April. About a unique MT platform,
Grammar editing in the translator's tools is still an open question. One of the main shortcomings of a MOLTO type translation is the limited coverage, and that's why it is important that a translator can easily extend and modify the lexicon. Just importing raw lexicon data, with or without TermFactory, is described in Listenmaa (2012). This is a question of adding more content, but another question is modifying existing grammars, usually in a case of an error.
There are some steps taken to further this issue. D11.2 presents a multilingual semantic wiki, where it is possible for every user to modify the grammar behind the wiki. This is still an expert work, as the editing is done with raw GF code, but there are methods for guided grammar editing, such as the cloud-based IDE (see documentation). This environment offers an easy way of multilingual grammar writing and editing.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D4.1 Knowledge Representation Infrastructure | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | 1 Nov 2010 | 
| Actual date of delivery: | 1 Nov 2010 | 
| Type: | Regular Publication | 
| Status & version: | Final | 
| Author(s): | Petar Mitankin, Atanas Ilchev | 
| Task responsible: | ONTO ( WP4 ) | 
| Other contributors: | Borislav Popov, Reneta Popova, and Gergana Petkova | 
This document presents the specification of the Knowledge Representation Infrastructure (KRI), which is based on pre-existing products. The KRI ensures a mature basis for storage and retrieval of structured knowledge and content. The document provides a description of the technology building blocks, overall architecture, standards used, query languages and inference rules.
| Attachment | Size | 
|---|---|
| D4.1_reviewed.pdf | 1.07 MB | 
The purpose of this document is to describe the knowledge representation infrastructure in MOLTO. It clarifies the expectations concerning the back-end infrastructure, serving the various MOLTO knowledge engineering tasks. It is based on the summary and analysis of the requirements gathered from the case studies, from grammar development, and from the partners. The scope of the deliverable covers presentation of the requirements, specification and description of the MOLTO Knowledge Representation Infrastructure (KRI).
Blending these expectations with previous experience in knowledge engineering, and adding a pinch of common sense, we come up with the specification of the MOLTO Knowledge Representation Infrastructure. This KRI is the data modeling and manipulation backbone of the entire project, aiming to serve semi-automatic creation of abstract grammars from ontologies; deriving ontologies from grammars, and instance level knowledge from NL. In terms of retrieval, NL queries will be transformed to semantic queries and the resulting knowledge, expressed back in NL.
The KRI is based on pre-existing products, and ensures a mature basis for storage and retrieval of both knowledge and content, covering all modalities of the data. This document provide descriptions of the technology building blocks, overall architecture, standards used, query languages and inference rules.
The KRI should allow for:
The objective of this section is to introduce the KRI specification - the technology building blocks, overall architecture, standards used, query languages and inference rules. It describes how to modify the KRI, how to change the default underlying ontology and database, and how to adjust the inference rule-set of the OWLIM semantic repository.
A demo of the KRI is running on http://molto.ontotext.com.
The KRI is responsible for the storage and retrieval of content metadata, background knowledge, upper-level ontology, and other possible data, if available (users and communities), and exposes convenient methods for interoperability between the stored knowledge and the toolkit for rendering natural language to machine readable semantic models (ontologies) and vice versa.
The KRI includes:
The major component of the KRI is OWLIM - a semantic repository, based on full materialization and providing support for a fraction of OWL. It is implemented in Java and packaged as a database in Storage and Inference Layer (SAIL) for the Sesame RDF database. Following is a detailed description of its architecture and supported semantics.
Semantic Repositories are tools that combine the characteristics of database management systems (DBMS) and inference engines. Their major functionality is to support efficient storage, querying and management of structured data. One major difference to DBMS is that Semantic Repositories work with generic physical data models (e.g. graphs). This allows them to easily adopt updates and extensions in the schemata, i.e. in the structure of the data. Another difference is that Semantic Repositories use ontologies as semantic schemata, which allows them to automatically reason about the data.
The two principle strategies for rule-based reasoning are:
Imagine a repository, which performs total forward-chaining, i.e. it tries to make sure that after each update to the KB, the inferred closure is computed and made available for query evaluation or retrieval. This strategy is generally known as materialization.
Sesame is a framework for storing, querying and reasoning with RDF data. It is implemented in Java as an open source project by Aduna and includes various storage back-ends (memory, file, database), query languages, reasoners and client-server protocols.
There are essentially two ways to use Sesame:
Sesame supports the W3Cs SPARQL query language and Adunas own query language SeRQL. It also supports most popular RDF file formats and query result formats. Sesame offers a JBDC-like user API, streamlined system APIs and a RESTful HTTP interface. Various extensions are available or are being developed by third parties. From version 2.0 onwards, Sesame requires a Java 1.5 virtual machine. All APIs use Java 5 features such as typed collections and iterators. Sesame version 2.1 added support for storing RDF data in relational databases. The supported relational databases are MySQL, PostgreSQL, MS SQL Server, and Oracle. As of version 2.2, Sesame also includes support for Mulgara (a native RDF database).
A schematic representation of Sesame's architecture is shown in Figure 1 below. Following is a brief overview of the main components.

Figure 1 - Sesame Architecture
The Sesame framework is as a loosely coupled set of components, where alternative implementations can be exchanged easily. Sesame comes with a variety of Storage And Inference Layer (SAIL) implementations that a user can select for the desired behavior (in memory storage, file-system, relational database, etc). OWLIM is a plug-in SAIL component for the Sesame framework.
Applications will normally communicate with Sesame through the Repository API. This provides a high enough level of abstraction so that the details of particular underlying components remain hidden, i.e. different components can be swapped in without requiring modification of the application.
The Repository API has several implementations, one of which uses HTTP to communicate with a remote repository that exposes the Repository API via HTTP.
The SAIL API is a set of Java interfaces that support the storage and retrieval of RDF statements. The main characteristics of the SAIL API are:
Other proposals for RDF APIs are currently under development. The most prominent of these are the Jena toolkit and the Redland Application Framework. The SAIL shares many characteristics with both approaches, however an important difference between these two proposals and SAIL, is that the SAIL API specifically deals with RDFS on the retrieval side: it offers methods for querying class and property subsumption, and domain and range restrictions. In contrast, both Jena and Redland focus exclusively on the RDF triple set, leaving interpretation of these triples to the user application. In SAIL, these RDFS inferencing tasks are handled internally. The main reason for this is that there is a strong relationship between the efficiency of inference and the actual storage model being used. Since any particular SAIL implementation has a complete understanding of the storage model (e.g. the database schema in the case of an RDBMS), this knowledge can be exploited to infer, for example, class subsumption more efficiently.
Another difference between SAIL and other RDF APIs is that SAIL is considerably more lightweight: only four basic interfaces are provided, offering basic storage and retrieval functionality and transaction support. This minimal set of interfaces promotes flexibility and looser coupling between components.
The current Sesame framework offers several implementations of the SAIL API. The most important of these is the SQL92SAIL, which is a generic implementation for SQL92, ISO99 [1]. This allows for connecting to any RDBMS without having to re-implement a lot of code. In the SQL92SAIL, only the definitions of the data-types (which are not part of the SQL92 standard) have to be changed when switching to a different database platform. The SQL92SAIL features an inferencing module for RDFS, based on the RDFS entailment rules as specified in the RDF Model Theory [2]. This inferencing module computes the closure of the data schema and asserts these implications as derived statements. For example, whenever a statement of the form (foo, rdfs:domain, bar) is encountered, the inferencing module asserts that (foo, rdf:type, property) is an implied statement. The SQL92SAIL has been tested in use with several DBMSs, including PostgreSQL8 and MySQL9 [3].
OWLIM is a high-performance semantic repository, implemented in Java and packaged as a Storage and Inference Layer (SAIL) for the Sesame RDF database. OWLIM is based on Ontotexts Triple Reasoning and Rule Entailment Engine (TRREE) - a native RDF rule-entailment engine. The supported semantics can be configured through the definition of rule-sets. The most expressive pre-defined rule-set combines unconstrained RDFS and OWL-Lite. Custom rule-sets allow tuning for optimal performance and expressivity. OWLIM supports RDFS, OWL DLP, OWL Horst, most of OWL Lite and OWL2 RL.
The two editions of OWLIM are SwiftOWLIM and BigOWLIM. In SwiftOWLIM, reasoning and query evaluation are performed in-memory, while, at the same time, a reliable persistence strategy assures data preservation, consistency, and integrity. BigOWLIM is the high-performance "enterprise" edition that scales to massive quantities of data. Typically, SwiftOWLIM can manage millions of explicit statements on desktop hardware, whereas BigOWLIM can manage billions of statements and multiple simultaneous user sessions.
The KRI in MOLTO uses BigOWLIM Version 3.3.
| Attachment | Size | 
|---|---|
| sesamearch.png | 115.61 KB | 
OWLIM version 3.X is packaged as a Storage and Inference Layer (SAIL) for Sesame version 2.x and makes extensive use of the features and infrastructure of Sesame, especially the RDF model, RDF parsers and query engines.
Inference is performed by the TRREE engine, where the explicit and inferred statements are stored in highly-optimized data structures that are kept in-memory for query evaluation and further inference. The inferred closure is updated through inference at the end of each transaction that modifies the repository.
Figure 2 - OWLIM Usage and Relations to Sesame and TRREE
OWLIM implements the Sesame SAIL interface so that it can be integrated with the rest of the Sesame framework, e.g. the query engines and the web UI. A user application can be designed to use OWLIM directly through the Sesame SAIL API or via the higher-level functional interfaces such as RDFDB. When an OWLIM repository is exposed using the Sesame HTTP Server, users can manage the repository through the Sesame Workbench Web application, or with other tools integrated with Sesame, e.g. ontology editors like Protege and TopBraid Composer.
| Attachment | Size | 
|---|---|
| OWLIMarch.png | 233.19 KB | 
OWLIM is implemented on top of the TRREE engine. TRREE stands for "Triple Reasoning and Rule Entailment Engine". The TRREE performs reasoning based on forward-chaining of entailment rules over RDF triple patterns with variables. TRREEs reasoning strategy is total materialization, although various optimizations are used as described in the following sections.
The semantics used is based on R-entailment [4], with the following differences:
Further details of the rule language can be found in the corresponding OWLIM user guides. The TRREE can be con- figured via the rule-sets parameter, that identifies a file containing the entailment rules, consistency checks and axiomatic triples. The implementation of TRREE relies on a compile stage, during which custom rule-sets are compiled into Java code that is further compiled and merged in to the inference engine.
The edition of TRREE used in SwiftOWLIM is referred to as "SwiftTRREE" and performs reasoning and query evaluation in-memory. The edition of TRREE used in BigOWLIM is referred to as "BigTRREE" and utilizes data structures backed by the file-system. These data structures are organized to allow query optimizations that dramatically improve performance with large data-sets, e.g. with one of the standard tests BigOWLIM evaluates queries against 7 million statements three times faster than SwiftOWLIM, although it takes between two and three times more time to initially load the data.
OWLIM offers several pre-defined semantics by way of standard rule-sets (files), but can also be configured to use custom rule-sets with semantics better tuned to the particular domain. The required semantics can be specified through the rule-set for each specific repository instance. Applications, which do not need the complexity of the most expressive supported semantics, can choose one of the less complex, which will result in faster inference.
The pre-defined rule-sets are layered such that each one extends the preceding one. The following list is ordered by increasing expressivity:
empty: no reasoning, i.e. OWLIM operates as a plain RDF store;rdfs: supports standard RDFS semantics;owl-horst: OWL dialect close to OWL Horst; the differences are discussed below;owl-max: a combination of most of OWL-Lite with RDFS.Furthermore, the OWL2 RL profile [5], is supported as follows:
owl2-rl-conf: Fully conformant except for D-Entailment, i.e. reasoning about data types;owl2-rl-reduced: As above, but with the troublesome prp-key rule removed (this rule causes serious scalability problems).OWLIM has an internal rule compiler that can be used to configure the TRREE with a custom set of inference rules and axioms. The user may define a custom rule-set in a *.pie file (e.g. MySemantics.pie). The easiest way to do this is to start modifying one of the .pie files that were used to build the pre-compiled rule-sets all pre-defined .pie files are included in the distribution. The syntax of the .pie files is easy to follow.
OWL compliance, OWLIM supports several OWL like dialects: OWL Horst [4], (owl-horst), OWL Max (owl-max) that covers most of OWL-Lite and RDFS, and OWL2 RL (owl2-rl-conf and owl2-rl- reduced).
With the owl-max rule-set, which is is represented in Figure 3, OWLIM supports the following semantics:
The differences between OWL Horst [4], and the OWL dialects supported by OWLIM (owl-horst and owl-max) can be summarized as follows:
owl-max). These are listed in the OWLIM user guides;Even though the concrete rules pre-defined in OWLIM differ from those defined in OWL Horst, the complexity and decidability results reported for R-entailment are relevant for TRREE and OWLIM. To put it more precisely, the rules in the owl-host rule-set do not introduce new B-Nodes, which means that R-entailment with respect to them takes polynomial time. In KR terms, this means that the owl-horst inference within OWLIM is tractable.
Inference using owl-horst is of a lesser complexity compared to other formalisms that combine DL formalisms with rules. In addition, it puts no constraints with respect to meta-modeling.
The correctness of the support for OWL semantics (for those primitives that are supported) is checked against the normative Positive- and Negative-entailment OWL test cases [6]. These tests are provided in the OWLIM distribution and documented in the OWLIM user guides.

Figure 3 - Owl-max and Other OWL Dialects
| Attachment | Size | 
|---|---|
| owlmax.png | 207.68 KB | 
The RDFDB stores all knowledge artifacts - such as ontologies, knowledge bases, and other data if available - in the RDF form. It is the MOLTO store and query service. The RDFDB has the following features:
| Attachment | Size | 
|---|---|
| ordi.png | 28.05 KB | 
Using the RDF model to represent all system knowledge allows an easy interoperability between the stored data and the conceptual models and instances. Therefore, if the latter are enriched, extended or partially replaced, it is not necessary to change the implementation considerably. However, the requirements for tracking of provenance, versioning and stored knowledge meta-data make the use of RDF triples insufficient. Therefore, we use RDF quintuples and a repository model that supports them.
We will expose an open API, based on the ORDI SG data model that can be implemented for integration with alternative back-ends. The ORDI SG model is presented here in further detail, as it is the basis for the RDFDB API.
The atomic entity of the ORDI SG tripleset model is a quintuple. It consists of the RDF data type primitives - URI, blank node and literal, as follows:
{P, O, C, {TS1, ..., TSn}, where:
The ORDI data model is realized as a directed labeled multi-graph. For backward compatibility with RDF, SPARQL and other RDF-based specifications, a new kind of information is introduced to the RDF graph. The tripleset model is a simple extension of the RDF graph enabling an easy way for adding meta-data to the statements.
It is a new element in the RDF statement, previously expressed as a triple or a quadruple, to describe the relation between the statement and an identifiable group of statements. This new term is introduced to distinguish the new model from several similar, already existing, RDF extensions and the terms associated with them:
The following is true for the tripleset model:
Figure 4 below is a diagram of the relationship between the major elements in the ORDI data model.

Figure 4 - Entity-Relationship Diagram of the ORDI Data Model
The RDFDB service is already available in the distribution and one could use it either by generating a JMS client or through the OpenRDF Sesame API.
Using your shell, navigate to the bin directory of the deployed platform and invoke the following commands:
mkproxy -proxy com.ontotext.platform.qsg.ServiceClass $PATH_TO_EXAMPLES/target/classes/
mkproxy -client com.ontotext.platform.qsg.ServiceClass $PATH_TO_EXAMPLES/target/classes/
Both commands dump output to sysout. Get the code, clean it as appropriate and put it in your project's source code. Build the project and the client is located in your project's target directory. The client implements the interface of the service.
In order to use RDF-DB clients, the following services must be generated:
com.ontotext.rdfdb.ordi.OrdiService
com.ontotext.rdfdb.ordi.RdfStoreService
com.ontotext.rdfdb.ordi.RdfQueryService
By using the OpenRDF Sesame API one could manage and query the default repository through the Sesame Workbench, or operate over it using the HTTP Repository. The OpenRDF Sesame is integrated in the RDFDB. For more details about how to use it, please see the OpenRDF Sesame User Guide.
This section describes the conceptual models, ontologies and knowledge bases, used in the MOLTO KRI as a context background in the RDFDB component.
Most applications of the KRI require extending the conceptual models with domain ontologies and the underlying knowledge base with domain specific entities and facts.
PROTON ontology provides coverage of the most general concepts, with focus on named entities (people, locations, organizations) and concrete domains (numbers, dates, etc.).
The design principles can be summarized as follows:
The ontology is originally encoded in a fragment of OWL Lite and split into four modules: System, Top, Upper, and KM (Knowledge Management), shown on Figure 5 below.

Figure 5 - PROTON Ontology
The System module consists of a few meta-level primitives (5 classes and 5 properties). It introduces the notion of 'entity', which can have aliases. The primitives at this level are usually the few things that have to be hard-coded in ontology-based applications. Within this document and in general, the System module of PROTON is referred to via the "protons:" prefix.
The Top module is the highest, most general, conceptual level, consisting of about 20 classes. These ensure a good balance of utility, domain independence, and ease of understanding and usage. The top layer is usually the best level to establish alignment to other ontologies and schemata. Within this document and in general, the Top module of PROTON is referred to via the "protont:" prefix.
The Upper module has over 200 general classes of entities, which often appear in multiple domains (e.g. various sorts of organizations, a comprehensive range of locations, etc.). Within this document and in general, the Upper module of PROTON is referred to via the "protonu:" prefix.
The KM module has 38 classes of slightly specialized entities that are specific for typical Knowledge Management tasks and applications. Within this document and in general, the PROTON KM module is referred to via the "protonkm:" prefix.
The default KB contains numerous instances of PROTON Upper Module classes like: Public Company, Company, Bank, IndustrySector, HomePage, etc. It covers the most popular entities in the world such as:
The NE-s are represented with their Semantic Descriptions via:
The last build of the KRI KB contained 29104 named entities: 6006 persons, 8259 organizations, 12219 locations and 2620 job titles.
Although the KRI presented in this document comes directly from preexisting products that have not been developed specially for the needs of the MOLTO project, it provides the basic semantic functionality required by some MOLTO-specific applications. As an illustration, we present the following screen-shots of the KRI web UI. It allows the user to enter a natural language query, as shown in Figure 6. The natural language query is converted into a SPARQL query and the SPARQL query is executed by OWLIM through the RDFDB layer.
The conversion of the natural language query into a SPARQL query is out of the scope of this document, but will be described in details in Deliverable 4.3 Grammar-Ontology Interoperability.

Figure 6 - Natural Language Query
The web UI shows the results of the executed SPARQL query, as shown in Figure 7 below.

Figure 7 - Results from the SPARQL Query
The user can see and edit the SPARQL query or enter a new SPARQL query. The KRI web UI also gives the user the possibility to browse the underlying ontology and database.
| Attachment | Size | 
|---|---|
| nlq.png | 26.84 KB | 
| results.png | 36.41 KB | 
A virtual image of the KRI is available on sftp://ftp.ontotext.com. Table 1 below presents some of its basic characteristics.
| vmdk files | VM/MOLTO.tar | 
|---|---|
| operating system | Ubuntu 10.04.1 | 
| user name | onto | 
| password | guest | 
| rdfdb folder | /home/onto/rdfdb | 
| knowledge base folder | /home/onto/rdfdb | 
| owlim configuration file | /home/onto/rdfdb/config/rdfdb.ttl | 
Table 1 - KRI as a Virtual Image
The user starts the RDFDB (and respectively OWLIM) by /home/onto/rdfdb/bin/rdfdb start.sh and stops it by /home/onto/rdfdb/bin/rdfdb stop.sh. To change the knowledge base one has to:
/home/onto/rdfdb/bin/populatedThe reasoning rule-set used by OWLIM is set in the owlim configuration file as the value of the owlim:ruleset parameter. The default rule-set is owl-horst but one could change it to owl-max, for example. If the user needs a custom rule-set, then one has to specify it in a *.pie file, which is a part of the knowledge base. The KRI uses internally a lighttpd server. It is started by cd /home/onto/gf-src-3.1.6/src/server followed by lighttpd -f lighttpd.conf.
The KRI web UI is accessible via a tomcat server on port 8080. The tomcat server is started by sudo/etc/init.d/tomcat6 start and stopped by sudo /etc/init.d/tomcat6 stop.
In this deliverable we have presented the requirements for the MOLTO KRI and defined its specification starting with the architecture of the major KRI component - the OWLIM semantic repository. We have continued with the presentation of the RDFDB, which provides a remote access to the ORDI-SG layer over OWLIM via JMS, and mainly emphasized its use. We have also described the default KRI data sources.
The Knowledge Representation Infrastructure will enable MOLTO's baseline and use case driven knowledge modeling with the necessary expressivity of metadata-about-metadata descriptions for provenance of the diverse sources of structured knowledge (upper-level, domain specific and derived (from grammars) ontologies; thesauri; domain knowledge bases; content and metadata).
[1] ISO. Information Technology-Database Language SQL. Standard No. ISO/IEC 9075:1999, International Organization for Standardization (ISO), 1999. (Available from American National Standards Institute, New York, NY 10036, (212) 642-4900.).
[2] HAYES, P. RDF Model Theory. Working draft, World Wide Web Consortium. September 2001. Please, see http://www.w3.org/TR/rdf-mt/.
[3] BROEKSTRA, J; Kampman, A; van Harmelen, F. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. International Semantic Web Conference, Sardinia, Italy, 2002.
[4] TER HORST, H J. Combining RDF and Part of OWL with Rules: Semantics, Decidability, Complexity. In Proc. of ISWC 2005, Galway, Ireland, Nov. 6-10, 2005. LNCS 3729, pp. 668-684.
[5] MOTIK, B.; GRAU, B., C.; HORROCKS, I; WU, Z.; FOKOUE, A.; LUTZ, K. Owl 2 Web Ontology Language, 2009. Please, see http://www.w3.org/TR/owl2-overview/.
[6] CARROLL, J. J; DE ROO, J. OWL Web Ontology Language: Test Cases. W3C Recommendation 10 Feb. 2004. Please, see http://www.w3.org/TR/owl-test/.
API Application Programming Interface
DBMS Database Management System
JMS Java Message Service
KRI Knowledge Representation Infrastructure
RDF Resource Description Framework
SAIL Storage and Inference Layer
SPARQL SPARQL Protocol and RDF Query Language
TRREE Triple Reasoning and Rule Entailment Engine
UI User interface
Draft -
The following three tasks can be based on this interoperability:
We could follow the scheme:
Ramona Enache has applied a similar approach for SUMO using the patterns that go with this ontology.
It is possible to apply the pattern approach for translation of natural language queries to SPARQL, but the resulting natural language will be very restrictive. I developed in JAVA a demo example for such very restrictive natural language for PROTON. The small ontology subgraphs encode information about persons, locations, organizations and job titles. The table below shows some examples from the demo. The left column represents the natural language queries. The right column represents their SPARQL translation.
| NATURAL LANGUAGE | SPARQL | 
| Give me all persons associated with the organization X. | 
select distinct 
?person where { ?s rdf:type prt:Organization . 
                ?s rdfs:label "X" . 
                ?j prt:withinOrganization ?s . 
                ?person prt:hasPosition ?j . }
 | 
| Give me all persons and related job titles associated with the organization X. | 
select distinct 
?person ?job_title where { ?s rdf:type prt:Organization . 
                           ?s rdfs:label "X" . 
                           ?j prt:withinOrganization ?s . 
                           ?j prt:holder ?person . 
                           ?j pru:hasTitle ?job_title . }
 | 
| Give me all organizations associated with the location North America. | 
select distinct 
?organization where { ?s rdf:type prt:Location . 
                      ?s rdfs:label "North America" .
                      ?organization prt:locatedIn ?s . 
                      ?organization rdf:type prt:Organization . }
 | 
| Give me all organizations associated with the person X. | 
select distinct 
?organization where { ?s rdf:type prt:Person . 
                      ?s rdfs:label "X" . 
                      ?j prt:holder ?s . 
                      ?j prt:withinOrganization ?organization . }
 | 
| Give me all job titles associated with the person X. | 
select distinct 
?job_title where { ?s rdf:type prt:Person .
                   ?s rdfs:label "X" . 
                   ?j prt:holder ?s . 
                   ?j pru:hasTitle ?job_title . }
 | 
| Give me all job titles and related organizations associated with the person X. | 
select distinct 
?job_title ?organization where { ?s rdf:type prt:Person . 
                                 ?s rdfs:label "X" . 
                                 ?j prt:holder ?s . 
                                 ?j pru:hasTitle ?job_title . 
                                 ?j prt:withinOrganization ?organization . }
 | 
| Give me all organizations and related job titles associated with the person X | 
select distinct 
?organization ?job_title where { ?s rdf:type prt:Person . 
                                 ?s rdfs:label "X" . 
                                 ?j prt:holder ?s . 
                                 ?j pru:hasTitle ?job_title . 
                                 ?j prt:withinOrganization ?organization . }
 | 
The demo example I developed does not use GF, because the available GF resource API grammars are too restrictive and cannot parse the desired sentences. Of course, most robust solution is needed. For this aim we need suitable GF grammars. If we have them, it is possible to handle the correspondence between the ontology subgraphs and the trees that result from parsing the input queries with GF.
The same holds for information extraction. If we have suitable GF grammars, it is possible to handle the correspondence between the ontology subgraphs and the trees that result from parsing the input queries with GF.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D4.3A Appendix to D4.3 Grammar ontology interoperability | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | April 2013 | 
| Actual date of delivery: | |
| Type: | Regular Publication | 
| Status & version: | Draft | 
| Author(s): | Maria Mateva, Laura Tolosi, Ramona Enache, Aarne Ranta, Inari Listenmaa | 
| Task responsible: | Ontotext | 
| Other contributors: | 
During the review on March 20, 2012, an appendix was requested to document the heuristics, namely the rules expressing the interoperability, underlying the automated tools. Documentation on how to retrieve the software tools, their limitations, their usage is described in a Appendix to D4.3. In 2012 we have provided a renewed version of the D4.3.
In 2013, Ontotext has decided to deliver an annex to D4.3 that would extend it and summarize our overall experience and the experience of our consortium partners on the grammar-ontology interoperability. Also, the document will address the reviewers' remarks and recommendations from the M24 MOLTO review report, for example on possible steps of integration of Term Factory and KRI and the degrees of automation achieved in the field within MOLTO.
Next, this annex will provide brief summary on the techniques we used to build our MOLTO prototypes. It will aim to give the required technical details. It will also present the Grammar Ontology helper that was built as part of the GF Eclipse Plugin in the scope of WP2. Finally, we will give a short summary of the present NL to Ontology approaches.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D6.1. Simple Drill Grammar Library | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M18 | 
| Actual date of delivery: | September 2011 | 
| Type: | Prototype | 
| Status & version: | Final (evolving document) | 
| Author(s): | J. Saludes, et al. | 
| Task responsible: | UPC | 
| Other contributors: | 
Abstract
The present paper is the cover of deliverable D6.1 as of WP6. It gives installation instructions for the Mathematical Grammar Library and a short manual.
The living end of the library is publicly available using subversion as:
     svn co svn://molto-project.eu/mgl
A stable version can be found at:
    svn co svn://molto-project.eu/tags/D6.1
The mgl library consists on the following files and directories:
At the same time, the library can be organized in three layers of increasing complexity:
Inside the mgl directory:
    make
will compile the top (Operations) layer and produce Test.pgf. To compile only the OpenMath layer:
    make om
An online version of the mathbar demo is http://www.grammaticalframework.org/demos/minibar/mathbar.html.
The library compiles for the following EU languages: Bulgarian, Catalan, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, Swedish.
Regression testing of the OpenMath productions is possible through a treebank containing about 140 productions from this layer. At the present moment it contains linearizations for English, German, Polish and Spanish. At the time of writing this report, the entries of these languages (except for Polish) had been corrected by fluent speakers of the respective language. To allow for discrepancy, earlier corrections are also stored in the treebank, tagged with author and revision number.
The structure of the treebank is described in the evaluating document.
To test the library, make sure you have an up-to-date OpenMath.pgf. You can recreate it by issuing:
     make om
and then, on the test directory:
     ./tbm table
That will make a table indexed by treebank entry and testing language (English, German an Spanish), showing the number of differences between the actual linearization and the corrected one.
Each time a new revision is committed to the repository, the output of this command is saved into test/table. Comparing different revisions of this file allows to measure the progress of the bug-fixing effort.
To review the current defects for language L:
     ./tbm review -lL
It will walk all the defects showing the differences, the stored corrected concretes, the abstract and the current linearization. For a list of available sub-commands press
.
Krasimir Angelov, Olga Caprotti, Ramona Enache, Thomas Hallgren, Alba Hierro, Inari Listenmma, Aarne Ranta, Ares Ribo, Adam Slaski, Shafqat Virk and Sebastian Xambó.
Imperative mode forces "!" at the end?
Not what we want for exercises.
        
      
Test> l DoComputeF DefineV (Var2Fun f)
define f !
        
      
We want to express:
"x gleich y"
or
"x hoch y"
        
      
mkAdA : Str → AdA
        
      
It doesn't exist
Example
for all z , r , it isn't true that r and if p , then , it isn't true that r
für alle z, r , ist es nicht wahr daß r und wenn p dann ist es nicht wahr daß r
        
      
I think it would be better to write "gilt", "gilt nicht" (english "holds", "it does not hold") instead of "es ist nicht wahr", "es ist wahr", "it is true", "it isn't true":
        
      
for all z , r , r does not hold and if p , then , r does not hold
für alle z, r , r gilt nicht und wenn p dann r gilt nicht
        
      
exist (BaseVarNum x) (Var2Set C) (mkProp (divides (Var2Num y) (Var2Num x)))
map y (factorial (Var2Num x)) (suchthat (Var2Set A) x r)
set (BaseValNum (Var2Num y) (Var2Num z))
l exist (BaseVarNum x) (Var2Set C) (mkProp (divides (Var2Num y) (Var2Num x)))
hay x en C tal que y divida a x
        
      
hay→existe
divida → divide
el conjunt amb element únic el cub de pi
        
      
 
 
 
         DefNPwithbaseElem : CN → MathObj → MathObj =
        
      
\cn,o → DefSgNP (mkCN cn (prepAdv with_Prep (mkNP (mkCN (mkCN (mkA "únic") element_CN) o)))) ;
        
      
        Problema:
        
      
No puc escriure "d'element únic" perquè si canvio el with_Prep per un possess_Prep o un part_Prep (of) , omet la preposició! Perquè?
        
      
cartesian_product (BaseValSet (Var2Set A) (Var2Set B))
imaginary (Var2Num y)
lcm (BaseValNum (Var2Num y) (Var2Num z))
root2 (real (Var2Num x))
Problem: and_Conj in spanish does not include the case "e", for example for "x e y". It should be
and_Conj = {s1 = [] ; s2 = etConj.s ; n = Pl} ;
For the moment, we have created a new
myAnd_Conj=and_Conj;at MathI.gf and redefined it as
and_Conj = {s1 = [] ; s2 = etConj.s ; n = Pl} ;
at MathSpa.gf
        This should be fixed at StructuralSpa.gf
        
      
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D6.2. Prototype of comanding CAS | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M23 | 
| Actual date of delivery: | February 2012 | 
| Type: | Prototype | 
| Status & version: | Final (evolving document) | 
| Author(s): | Jordi Saludes, Ares Ribó | 
| Task responsible: | UPC | 
| Other contributors: | 
The present paper is the cover of deliverable D6.2 as of WP6. It gives description and installation instructions for the executables included in this deliverable.
The following table describes whats is needed in order to use the executables. In all case you'll need GF and Sage.
gfsage is the simple dialog executable, shell denotes the component that allow using natural language inside Sage and shell-complete is the same with auto-completion of commands.
| Component | O. S. | Extra requirement | Spoken output | autocompletion | 
|---|---|---|---|---|
| gfsage | Mac OS X, Linux Ubuntu | ghc, curl | OSX1, Linux | yes | 
| shell | all2 | — | no | |
| shell-complete | Linux | gf python bindings | yes | 
Depending on your permission settings you might have to run some of these command as sudo. For all of these first you have to checkout the Mathematics Grammar Library  from:
svn co svn://molto-project.eu/mgl
Be warned that develoopment will continue for some time in this HEAD branch. For a frozen version of it, checkout from:
svn co svn://molto-project.eu/tags/D6.2
You'll find detailed instructions for installing each executable in the following pages. For the moment, note that it is necessary to modify some files in your Sage files, for these executables to run. Usually, we have to make these changes just once: The first time, the installation procedure will warn you about it:
Please add 'sage.nlgf' to /usr/local/sage-4.7.2/devel/sage/setup.py
Since ours is not a regular Sage package, we must add a package reference  manually by tweaking setup.py given above (Notice that yours may have a different path). This is a python file that Sage reads to configure the system using the command setup. Please find it in the file, mine is at line 882 and looks like this:
code = setup(name = 'sage',
The setup command lists several items; Please locate  packages (which is a python list) and add 'sage.nlgf' (quotes included) among the other packages listed there. Python is picky about indentation and doesn't like to have spaces and tabs mixed. Please check that you're using the same spacing as the rest of the file.
The installation has been tested on Sage 4.7.1, 4.7.2 and 4.8
The goal of this work is to develop a command-line tool able to take commands in natural language and have them executed by Sage, a collection of Computer Algebra packages presented in a uniform way. We present here instructions on how to build the interface and examples of its intended use.
You'll need:
cabal, as in Haskell platformsage command. It assumes it's in your PATH)You can get this source version by:
cabal install gf
We can install the other dependencies too by:
cabal install json curl
Checkout the mathematics grammar library from:
 svn co svn://molto-project.eu/mgl
This is the active branch. For the fixed one use:
svn co svn://molto-project.eu/tags/D6.2
Go into the mgl/sage directory (D6.2/sage if you're using the fixed branch) and make it:
cd mgl/sage
make
The first time you make it will fail, asking you to make modifications in the Sage installation. Please refer to the installation page.
Now try to build gfsage again. All these build operations will ask Sage to "rebuild" itself. Be warned that the first rebuild takes some time:
make
The system as been tested in Mac (OS X 10.7) and Linux (Ubuntu).
Run the tool as:
./gfsage english
giving the input language as argument. It will take some seconds to start the server. After that it will reply with some server information and will show the prompt:
    sage>
You can then enter your query:
    sage> compute the product of the octal number 12 and the binary number 100.
    (3) 40
    answer: it is 40 .
To show that a CAS is actually behind the scene, let's try something symbolic:
    sage> compute the greatest common divisor of x and the product of x and y.
    (4) x
    answer: it is x .
and compare it with:
    sage> compute the greatest common divisor of x and the sum of x and y.
    (5) 1
    answer: it is 1 .
Sage does the right thing in both cases, x and y being unbound numeric variables.
    sage> compute the second iterated derivative of the cosine at pi.
    (6) 1
    answer: it is 1 .
Exit the session by issuing CRTL+D: This way the server exits cleanly.
Just another example in a different language:
    ./gfsage spanish
    Login into localhost at port 9000
    Session ID is c1ef10dfd49e4fdb3214fa6d3a3b9c92
    waiting... EmptyBlock 2
    finished handshake. Session is c1ef10dfd49e4fdb3214fa6d3a3b9c92
    sage> calcula la parte imaginaria  de la derivada de la exponencial en pi.
    (4) 0
    answer: es 0 .
More recent examples involving integer literals and integration:
    sage> compute the sum of 1, 2, 3, 4 and 5.
    (3) 15
    answer: it is 15 .
   
    sage> compute the summation of x when x ranges from 1 to 100.
    (4) 5050
    answer: it is 5050 .
    sage> compute the integral of the cosine from 0 to the quotient of pi and 2.
    waiting... (5) 1
    answer: it is 1 .
    sage> compute the integral of the function mapping x to the square root of x from 1 to 2.
    (6) 4/3*sqrt(2) - 2/3
    answer: it is 4 over 3 times the square root of 2 minus the quotient of 2 and 3 .
Use english:
gfsage      
Use LANGUAGE:
gfsage LANGUAGE
General invocation:
gfsage [OPTIONS]
where OPTIONS are:
| short form | long form | description | |
|---|---|---|---|
| -h | --help | Print usage page | |
| -i LANGUAGE | --input-lang=LANGUAGE | Make queries in LANGUAGE | |
| -o LANGUAGE | --output-lang=LANGUAGE | Give answers in LANGUAGE | |
| -V LEVEL | --verbose=LEVEL | Set the verbosity LEVEL | |
| -t FILE | --test=FILE | Test samples in FILE | |
| -v[VOICE] | --voice[=VOICE] | Use voice output. To list voices use ?as VOICE. | |
| -F | --with-feedback | Restate the query when answering. | 
This condition is signaled by the message:
gfsage: Connecting CurlCouldntConnect 
I used a Linux virtual machine to reproduce this condition and find that, sometimes, it takes about 10 retries for the server to catch, but then it stays running ok for hours. My guess is that is related to some timeout limit in the server. Killing the orphaned python processes from the previous retries might help too (killall python).
realsets.py is a Sage module to support subsets of the real field consisting of intervals and isolated points and was developed to demonstrate set operations of the MGL Set1 module.
It is based of previous work from Interval1Sage adding integration on real sets and real intervals.
An object in this module consists of a list of disjoint open intervals plus a list of isolated points (not belonging to these intervals). Notice that Infinite is acceptable as interval bound. Therefore, one can define:
Represent a set that can be the union of some intervals and isolated points. It consists of:
A closed interval:
? RealSet.cc_interval(1,4); 
[ 1 :: 4 ]
A single point:
? RealSet.singleton(1)
{1}
Union is supported with intervals and can be nested :
? I = RealSet.co_interval(1, 4)
? J = RealSet.co_interval(4, 5)
? M = RealSet.oc_interval(7, 8)
? I.union(J).union(M)
[ 1 :: 5 [ ∪ ] 7 :: 8 ]
? I.intersection(J)
()
? I.intersection(RealSet.cc_interval(2,5))
[ 2 :: 4 [
Is a point in the set?
? I = RealSet.oo_interval(1, 3)
? 2 in I
True
? 3 in I
False
Is a set discrete (i.e: does not contain intervals)?
? RealSet.oo_interval(0,1).discrete
False
? RealSet(points=(1,2,3)).discrete
True
Size of a discrete is the number of points:
? RealSet(points=range(5)).size
5
? RealSet.oo_interval(0,3).size
+Infinity
A is subset of B
? A = RealSet.oo_interval(0,1)
? B = RealSet.cc_interval(0,1)
? RealSet().subset(A)
True
? B.subset(A)
False
? A.subset(B)
True
? A.subset(A)
True
? A.subset(A, proper=True)
False
Return the infimum (greatest lower bound)
? RealSet(points=range(3)).infimum()
0
? RealSet.oo_interval(1,3).infimum()
1
The opposite of a set: –A = {-x | x ∈ A}
? -RealSet.oo_interval(1,2)
] -2 :: -1 [
Return the supremum (least upper bound)
? RealSet(points=range(3)).supremum()
2
? RealSet.oo_interval(1,3).supremum()
3
The complementary of a set:
? RealSet.oo_interval(2,3).complement()
] -Infinity :: 2 ] ∪ [ 3 :: +Infinity [
? RealSet(points=range(3)).complement()
] 0 :: 1 [ ∪ ] 1 :: 2 [ ∪ ] 2 :: +Infinity [ ∪ ] -Infinity :: 0 [
The set difference of A and B: \{x \in A, x\notin B\}
? I = RealSet.oo_interval(2,+Infinity)
? J = RealSet.oo_interval(-Infinity, 5)
? I.setdiff(J)
[ 5 :: +Infinity [
? J.setdiff(I)
] -Infinity :: 2 ]
gfsage is a prototype to demonstrate two-way natural language communication between a user and a Sage  system.
When you invoke the gfsage command interactively:
The details of these components are given below.
A GF module acts as a post office translating messages between the different parties (nodes) composing a dialog. This section is more a description of a proposed design strategy for a generic postoffice interface based on GF. The actual code implements ideas of this design, but, for instance, it contains no edges or nodes as explicit entities.
gfsage deals with just 2 agents:
in the case whether the input language is different of the output language, we may consider a third node (the output user).
There is a unique pgf module containing all GF information for the dialog system to work: Commands.pgf. Each node has a language (a GF concrete module) assigned: the user uses a  natural language (i.e., ComandsEng for English).
A node reacts to received messages by sending a reply. The chain of messages between two nodes is called a dialog. An active node as the user can start a dialog by sending a message. A passive node, like the Sage system here, just replies to the received messages.
A node can receive:
no_parse message from the postoffice telling that a previous outgoing message cannot be parsed.is_ambiguous message from the postoffice related to a previous message sent by the node, specifying that it was ambiguous and carrying additional info for the node to decide among the possible meanings. To respond to this, the node must send a  disambiguate message to the postoffice (see below).A node can send:
disambiguate message sent in response to an ambiguous  message. In this message the node chooses one of the options or aborts the transaction.A regular message between two given nodes corresponds to a fixed GF category. In the case of gfsage it is Command for messages traveling from User to Sage and Answer for messages going the other way.
A regular message from node N1 to node N2 goes through the following steps:
no_parse message is sent back to the sending node. If it contains more than one entry, an is_ambiguous message is sent. In the previous cases, the process stops here; Only when the computed set contains just an entry, is this pushed downstream to the node N2.For Sage to work alongside GF, we need a http sever listening to Sage commands and some scripts to set up the environment and respond to the type of queries that can be expressed in the Mathematics Grammar Library, MGL.
A Sage process is started in the background by the start-nb.py script in -python mode. This script starts a Sage notebook, as described in Simple server API, listening on port 9000 and up to requests in http format. It also installs a handler for cleanly disposing of the notebook object whenever the parent process terminates.
The parent process sends then an initial request to load some functions and variables that we'll need in the dialog system defined in prelude.sage and goes into the main evaluation loop.
realsets.pySet1 module of the MGL. (See the page about it)prelude.sageOS X has voice output  buit-in, usable from the shell by way of the say command. You can use several voices in English or download more for other languages.
 
  
 
mgl/sage as described previously.
     gfsage Use english
     gfsage LANGUAGE Use this language   
     gfsage [OPTIONS] where OPTIONS are:
     -h --help print this page
     -i INPUT --input-lang=INPUT Make queries in LANGUAGE
     -o OUTPUT --output-lang=OUTPUT Give answers in LANGUAGE
     -v[VOICE] --voice[=VOICE] use voice output. To list voices use ? as VOICE.
     -F --with-feedback Restate the query when answering.
The options relevant here are -v and -F. Use the first to select voice output. With no argument it will pick the first available voice for the OUTPUT voice selected:
./gfsage -i english -v
Voiced by Agnes
... It will use Agnes as English voice. Notice that if you do not give a -o option, the OUTPUT language is assume to be the same as the INPUT language.
To list the available voices use:
./gfsage -i english -v?
Agnes, Albert, Alex, Bahh, Bells, Boing, Bruce, Bubbles, Cellos, Daniel, Deranged, Fred, Hysterical, Junior, Kathy, Princess, Ralph, Trinoids, Vicki, Victoria, Whisper, Zarvox
It will list the English voices. To use a specific voice write:
./gfsage -i german -vYannick
Voiced by Yannick
The option -F is to make the system paraphrase your query on answering. First, get a simple answer:
./gfsage -i english
Login into localhost at port 9000
Session ID is df7ad7c769f2faac68b6bb9489bb97e2
waiting... EmptyBlock 3
sage> compute the factorial of 5.
(4) 120
answer: it is 120 .
... and now the same with paraphrasing:
./gfsage -i english -F
Login into localhost at port 9000
Session ID is 88549994a28940fe0657eb9e506a5e84
waiting... EmptyBlock 3
sage> compute the factorial of 5.
(4) 120
answer: the factorial of 5 is 120 .
So, to experience voice output in its full glory you have to use both -v and -F.
Following a suggestion from Aarne, I found some Google service for speech input, but the experiments are not encouraging:
I recorded  Compute this into a mp4 file using QuickTime Player on the mac
Converted it to flac using:
sox compute.m4a compute.flac rate 16k
And get into the service by:
curl -H "Content-Type:audio/x-flac; rate=16000" "https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US" -F "myfile=@compute.flac
But got:
 `{"status":0,"id":"56bdb158dd66b25fc2e221364004e620-1","hypotheses":[{"utterance":"coffee lol","confidence":0.46219563}]}`
Other examples:
"I like pickles" ⇒ "I like turtles"
"The determinant of x" ⇒ "new york" (with confidence 0.88!)
"Compute this" ⇒ "coffee lol"
Of course I'm not a native English speaker, but I expected a better performance.
To help with regression testing I recently added a test option to gfsage for batch-testing the system by reading dialog samples from a file.
The samples must be in a text file and consist in a sequence of dialogs which are sequences of query/responses to the Sage system. Notice that a dialog might carry a state in the form of assumptions that are asserted or variables that are assigned. In the same way, each dialog is completely independent of the others.
Each dialog starts with a BEGIN or BEGIN language line. It specifies the beginning of dialog triplets and the natural language for these triplets. The dialog runs until an END line. The language specified becomes the current language. Dialogs with no given languages are assumed to be in the current language. At the start of a testing suite, the current language is English.
A triplet is a sequence of 3 lines:
BEGIN spanish calcula el factorial del número octal 11.
362880 es 36280 . END BEGIN english let x be 4 .
compute the sum of x and 5 . 9 it is 9 . compute the sum of it and 5 . 14 it is 14 . END
Notice that blank lines are relevant: they mark that Sage responded nothing to the query. Therefore, it is not allowed to insert blank lines neither between triplets nor dialogs.
gfsage --test
will test the dialogs in and tell about the differences. You got a summary of the results:
Dialog 'compute Gamma....' failed 18 out of 19 dialogs successful.
By defining new Sage interfaces we can command the Sage shell and notebook server using natural language.
Move to the sage directory and build sage-shell:
cd mgl/sage
make sage-shell
The first time you build it, you may run into a warning as in the installation section of the front page, or:
Please add nlgf components to the interfaces list in /usr/local/sage-4.7.2/devel/sage/sage/interfaces/all.py
We must inform Sage that there are some new interfaces for it: We open interfaces/all.py (Notice that your actual path might be different), go to the end of the file and add something like this:
from nlgf import english, spanish
interfaces.extend(['english', 'spanish'])
The first line asks the system to load the interfaces for commanding Sage using English and Spanish. The next line add these to the list of available interfaces.
Now retry building:
make sage-shell
At the time of writing, the module nlgf provides catalan, english, german, and spanish interfaces.
In some systems you can have the commands Sage shell auto-completed by pressing the tab key. This is experimental and you have to make the installation completely by hand.
First you have to build the Python bindings for GF which, for the moment, only work in Linux. You'll find there a shared library called gf.so. Copy or move it into one of the directories that Python scans when resolving imports. Note that it may be the case that the Python instance run by Sage be different of the one your machine runs by default; To be sure, do as follows:
sage -python -c 'import sys; print sys.path'
it will list all the directories that Sage/python scans.
You'll know it's all right when:
sage -python -c 'import gf'
exits with no complain: The next time you enter into the Sage shell you'll have autocompletion for the GF interfaces.
Start a Sage shell:
sage
and switch to one of the defined natural language interfaces:
sage: %english
will reply with:
--> Switching to Gf <-- 
If you didn't install autocompletion (which is the usual case, auto-completion being experimental), a warning will appear:
No autocompletion available
Now you're ready to issue sage commands in English:
english: compute the summation of x when x ranges from 1 to 100.
5050
english: add 3 to it.
5053
english: let x be the factorial of 6.
720
english: let y be the factorial of 5.
120
english: compute the greatest common divisor of x and y.
120
english: compute the least common multiple of x and y.
720
Go back to the standard interface by typing ctrl+D or typing quit.
Sage has a notebook interface that gives a more flexible way to interact with it. To use it, start the shell as above and then:
sage: notebook(secure=true, interface='')
The notebook files are stored in: sage_notebook.sagenb
****************************************************
*                                                  *
* Open your web browser to https://localhost:8000  *
*                                                  *
****************************************************
There is an admin account.  If you do not remember the password,
quit the notebook and type notebook(reset=True).
2012-02-13 12:48:19+0100 [-] Log opened.
...
In some systems a browser will open simultaneously. Now you can use Sage from the browser.
Click on New Worksheet. You'll be asked to rename the worksheet (this is optional). A single cell will be ready for your input. Write your command and press evaluate. Notice that a cell can contain more than one command, separated by newlines.
Start a new cell by writing:
%english
and add one or more new lines with commands in English.

| Attachment | Size | 
|---|---|
| sage-notebook.jpg | 95.69 KB | 
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | Assistant for solving word problems | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | December 2012 | 
| Actual date of delivery: | May 2013 | 
| Type: | Prototype | 
| Status & version: | Final | 
| Author(s): | Jordi Saludes | 
| Task responsible: | Jordi Saludes | 
| Other contributors: | 
We will introduce a prototype for dealing with simple arithmetical problems involving concepts of the physical world (word problems). The first software component allows an author to state a word problem by writing sentences in several languages and converting it into Prolog code. The second component takes this code and presents the problem in the student's language. Then it provides step-by-step assistance in natural language into writing equations that correctly model the given problem.
This software deliverable is a prototype of a word problem solver, namely a system that interactively poses a word problem, (in many languages), then constructs a solution and a reasoning context for it. The overall architecture is based on the usage of third-party, open-source software components to provide the reasoning infrastructure for the system and are not distributed in this deliverable.
This document describes:
how to install the prototype,
how to create word problems involving simple arithmetic;
how to assisting a student into finding the equations related to it.
The first component is a Scala library (http://www.scala-lang.org) to be used inside the Scala Interpreter shell (in a Read-Evaluate-Print Loop), while the second component is a dialog system which runs in the command line. Both components were developed within the framework of the MOLTO project.
The source code for this deliverable can be downloaded from the MOLTO svn repository by:
  svn co svn://molto-project.eu/mgl/wproblems
It will appear into the wproblems directory. But prior to  building the system for the natural language interpretation of the word problems, you need to install the external components that are handling the computational aspects of the system.
Third-party software components provide the following functionalities in the system:
SWIPL_LIBDIR to the path to the swi-prolog library. The prototype employs Prolog as domain reasoner for certain schemata of word problems.In addition, the system requires the availability of the jpl library for accessing Prolog from Java code. It is installed by the prolog installer, check that the jpl.jar exists and  write down the path (It will be needed for the configuration step below).
Project-related software components:
GF; (in our architecture, version version 3.4 ). GF is used to provide the natural language parsing and generation in the Dialog system.
gf-java to use the GF web services from Java; (distributed under lib, version gf-java-0.8.1.jar ).
Install all the components as directed. 
Now configure by passing SWIPL_LIBDIR for the path to the swi-prolog library and JPL_LIBDIR for the path to the directory containing jpl.jar. In our case:
  ./configure SWIPL_LIBDIR=/opt/local/lib/swipl-6.2.2/lib/i386-darwin11.3.0/ JPL_LIBDIR=/opt/local/lib/swipl-6.2.2/lib/
and then build the system
  make
Having finished the installation step, we are now ready to use the system1.
In this document word problem means a mathematical problem requiring writing the equations describing all the relevant information needed to get the solution. We we'll split the solution of such a problem into:
There are a lot of applications for helping students with solving step but only a few for the modeling step. We will present here a prototype for addressing this step for problems requiring just elementary arithmetics.
The system allows two modes of usage, for authors (teachers) and for students:
The first application runs inside the Scala REPL, and consists in a library implementing the class Problem with resources for constructing problems from natural language sentences. Problems are saved as Prolog clauses with comments used to reconstruct the originating sentences. 
The second application is a Scala executable that loads a saved problem and engages the student in a nalural lenguage dialog conducting to have the problem correctly modeled.
Both applications use a Prolog database to reason about the problem. The basic difference is that for the author tool, the system constructs the model automatically in order to check if the problem is consistent (it does not contain contradictions) and complete (it has enough information to give a single solution), while for the student tool, the model construction is driven by the sentences proposed by the student. The system leads the student through several discovering steps (see next section) and checks that the proposed sentences are correct and relevant.
Invoke the author tool by:
./create
Create a new problem to be saved into file fruit.pl
scala: val p = new Problem("fruit.pl")
p: wp.Problem = Problem with 0 statements
We could use the Statement class to add new statements to the problem with the += operator. 
However, it is more convenient to define a statement factory for entering them in natural language (denoted by its 3-letter ISO code):
scala: val en = new StatementFactory("Eng")
We can now use a predictive parser to enter a new sentence into the problem:
scala: p += en.read
Eng: John has seven fruit .
Notice the final period. We can keep track of how many statements our problem has by:
scala: p
res1: wp.Problem = Problem with 1 statements
Let us add some more facts:
scala: p += en.read
Eng: John has two apples , some oranges and three bananas .
scala: p += en.read
Eng: how many oranges does John have ?
To take a look to the internal representation of the problem, use print:
scala: p.print
We can check if the problem is consistent (it does not contain contradictory statements) or complete (it has a single solution) by using the methods consistent and complete:
scala: p.complete
res3: Boolean = true
Remember to save the problem:
scala: p.save()
Saved to 'fruit.pl'
and now we can exit:
:q
We can now try to solve our problem, by calling model with the file containing the problem:
 ./model fruit.pl 
It shows us the statement of the problem:
John has seven fruit .
John has two apples , some oranges and three bananas .
how many oranges does John have ?
and displays the prompt:
 ?
We can always press return at the prompt (or type help) for the system to suggest the proper action:
you must assign a variable to the oranges that John has .
But we do not know how to assign variables. Let us ask for an example:
? give me an example
let $x$ denote the animals that Mary has
Using this template we can now compose a definition for the variable x:
? let x denote the oranges that Mary has
you must assign a variable to the oranges that John has .
I forgot that we were dealing with John's fruit, not Mary's:
? let x denote the oranges that John has
it is right .
Press again return for the next suggestion:
you must split the fruit that John has .
This means that we have to specify how John's fruit are split in different classes:
 ? the fruit that John has are the apples that John has and the bananas that John has
 you must consider oranges .
Yes, there are oranges too. Let us correct it:
 ? the fruit that John has are the apples that John has , the bananas that John has and the oranges that John has
 it is right .
Good. Next suggestion:
you must write an equation which says that the fruit that John has are the bananas that John has , 
the oranges that John has and the apples that John has .
What about this?
? y plus 2 plus 3 is equal to 7
it doesn't follow .
This means that the proposed equation can not be deduced from the statement of the problem. Let us see what is wrong with the variable y:
? tell me about y
nothing is known about it .
Perhaps we used a different variable to denote the amount of oranges:
? tell me about the oranges that John has
the oranges that John has are $x$ oranges .
So we used x for it. Just to confirm it:
  ? tell me about x
  $x$ denotes the oranges that John has .
We rewrite the equation using x:
? x plus 2 plus 3 is equal to 7
it is right .
Now the problem is correctly modeled. The next action will give us the solution:
the oranges that John has are two oranges .
To run the same problem but in Spanish, add the 3-letter-ISO code of the language as second argument:
./model examples/fruit.pl spa
...
Juan tiene siete frutas .
Juan tiene dos manzanas , algunas naranjas y tres plátanos .
¿ cuantas naranjas tiene Juan ?
Asking for help:
? 
debes asignar una variable a las naranjas que Juan tiene .
Asking for an example:
? dame un ejemplo
denota las cartas que María tiene por $z$
The system will start/stop the GF-java service for you, but if you run into trouble you can check the state of the service by:
bin/wpserver status and stop it by: bin/wpserver stop. ↩
The current prototype allows to state word problems of the following form:
   John|Mary has|owns  one|two|...|seven|some fruit|apples|oranges|bananas|animals|rabbits|cows
in the languages: English, Catalan, Swedish and Spanish.
The building block for the reasoning is the amount: a relaxed version of a set in which one does not have access to the composing elements, but can know the number of elements in it.
An amount is constructed by:
Giving the cardinal and the class of its elements (i. e. three oranges). Notice that the cardinal may be undefined (i. e. some oranges);
The own predicate binding an individual and a class (i. e. the apples that John has);
Disjoint unions of these constructions (three apples and two oranges)
Available sentences express the equality between two amounts (i. e. The fruit that John has are two apples and some oranges)
The modeling process implies transforming a set of propositions into another set in which the numerical interpretation is evident.
We consider two grammars to express these facts:
The plain language is for direct communication with the user;
The core language is for the reasoner to work with.
This is how we express the amount John apples in plain (Prolog concrete):
own(john, apple)
while in core:
p(X, apple, own(john,X))
The latter is more suited to reasoning with it.
Another step into normalizing (making it core) an amount is to disaggregate sums. In this way a statement like John has three apples and six bananas is converted into: John has three apples and John has six bananas.
Another case is to convert questions as how many apples does Mary have? which are represented in plain as:
find(own(mary,apple))
into the core expression:
find(X, apple, own(mary,X))
A set of statements in core language is what is needed to process a word problem. This is what the create tool saves: A Prolog file consisting of:
A GF abstract tree for the plain sentence of a problem. This is written as a Prolog comment.
Core statements in Prolog format that correspond to the plain expression.
As an example, this is a complete problem in core Prolog clauses. The comments contain the GF abstract tree corresponding to the original plain expression:
% abs:fromProp (E1owns john (gen Fruit n7))
% Eng:John has seven fruit .
-(p(_1, fruit, own(john, _1)), *(7, unit(fruit))).
% abs:fromProp (E1owns john (aplus (ConsAmount (gen Apple n2) (BaseAmount (some Orange) (gen Banana n3)))))
% Eng:John has two apples , some oranges and three bananas .
-(p(_5, apple, own(john, _5)), *(2, unit(apple))).
-(p(_6, banana, own(john, _6)), *(3, unit(banana))).
-(p(_7, orange, own(john, _7)), some(orange)).
% abs:fromQuestion (Q1owns john Orange)
% Eng:how many oranges does John have ?
find(_19, orange, own(john, _19)).
When the model tool is started on a word problem file, the system uses the GF abstract lines to display the statement of the problem in the selected language.
Now the student must go through a sequence of steps to have the problem correctly modeled:
Assigning variables. At the beginnig the student must choose variables to designate unknowns that are relevant to the problem. This includes the target unknowns (they appear as arguments of find clauses) and expressions like some apples.
Discovering relations. In this step the student has to combine information from different statements into new relations. For example, decomposing the fruits that John has into the apples and bananas that John has.
Stating equations. In the next step, the student converts the relations uncovered in the previous step into numerical equations. This steps finishes when there are enough equations to determine the unknowns of the problem. The system checks that the student's equations are consistent equations and are entailed by the problem information.
Final. At the last step, the system displays the solution for the unknowns of the problem and exits.
The current prototype is a proof of concept aiming at demonstrating that the semantics of word problems can be handled given a formalization of the specific domain, a decision procedure on the resulting model, and a natural language application that allows to express and semantically interpret the facts describing the specific world instance.
In this work we have considered problems of a specific kind but we maintain that, in the e-Learning scenario, which is our target area of application, word problems can be classified according to schemes which can be formalized along the lines shown here. In many problems, understanding of natural language formulation is translated to facts in the knowledge base where two seemingly independent facts are put in a relation and become a new assumption for solving the problem (if A is an animal tamer, then A is not afraid of animals. if F is the father of S, then F is older than S. Every orange is a fruit). Construction of the correct assumption can be done in an exhaustive way only under a finite world assumption (what is known is what it is explicitly stated).
Replacing the Prolog engine by a proof assistant. These systems delivers proofs of propositions in a theory. By using a theory supporting a kind of word problems and forcing the problem author to express the problem as a a valid theorem in this theory would have the benefit of uncover hidden assumptions on the problem statement.
Also, these systems being more expressive than Prolog clauses, and supporting complex tactics for automatic proving could benefit the maintenance of the system.
On the student side, the discovering of new facts is converting into asserting propositions that can be transparently proved by tactics or presented to the student to deal with them: This would lead to re-using the same problem in different educations levels according to what is assumed and what is proved by the student.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D8.2 Multilingual grammar for museum object descriptions | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | 1 Mar 2012 | 
| Actual date of delivery: | 16 Mar 2012 | 
| Type: | Prototype | 
| Status & version: | Draft | 
| Author(s): | D. Dannélls et al. | 
| Task responsible: | UGOT | 
| Other contributors: | All | 
| Attachment | Size | 
|---|---|
| WP8-D2.pdf | 241.34 KB | 
| d8.2-grammars.tar.gz | 9.21 KB | 
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D9.1. MOLTO test criteria, methods and schedule | 
| Security (distribution level): | Confidential | 
| Contractual date of delivery: | M7 | 
| Actual date of delivery: | October 2010 | 
| Type: | Report | 
| Status & version: | Draft (evolving document) | 
| Author(s): | L. Carlson et al. | 
| Task responsible: | UHEL | 
| Other contributors: | 
Abstract
The present paper is the summary of deliverable D 9.1 as of M6. Workpackage materials can be found at the UHEL MOLTO website (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/WebHome). This document also links to the MOLTO official website (http://www.molto-project.eu/).
(The official MOLTO website is the prime place for coordinating the project as (long as) material on it is uncluttered, reliable and up to date. For local work, informal project communication and creative planning, the UHEL MOLTO website is open to all MOLTO partners.)
This paper is structured into an introduction followed by sections per workpackage, The WPs are divided into the front end WPs (WP3 and use cases) and the back end ones (WPs 2,4,5). For each WP we survey promises from DoW, ongoing work, and derive requirements from them, followed by evaluation plans or recommendations. Text in brackets refer to source. Action points are in boldface.
The wealth of cited content aims to bring different strains of documented work planned or in progress together, in order to get an updated view of the ongoing MOLTO process, and thus cover the bases for making the tool and user WP requirements meet. We take as base what the technology offers and scale user expectations from that.
We go over the later WP9 tasks first:
D 9.1 is to define the requirements for both the generic tools and the case studies in a coherent way that can lead to maximal synergy between work packages. To do this need to detail the project plan and schedule. This then implies the main outline of the evaluation schedule.
The MOLTO dependency chart only shows dependencies for WP 9 with the use cases WP 6-8 plus the dissemination WP 10. The boldfaced bits above entail that there are dependencies to the tools workpackages as well.
By the MOLTO timetable, WPs 2,4,9 (tools, ontology, req/eval) started at once. Translation tools WP3 and use case WPs 5-6 start at m7 (Varna). Patents use case WP7 has not started due to failure of partner.
By the DoW, MOLTO aims to have working prototypes on the way. So far, each partner has been providing their own demos. Progressively, there will be more need for integration, WP3 in particular will use most of the rest as components. In the best case, integration can be just plugging in APIs, with local bilateral negotiation at best between a provider and a user. But to ensure this, we must agree in time what the APIs will provide.
As suggested in the DoW text (but not spelled out in the schedule), specification/version checkpoints should be agreed more often between the tools WPs. At Varna, we get the first update of the tools and ontology workpackages. We should get together to fix times and expected contents for the remaining internal checkpoints as well. It would help to add checkpoint dates plus time dependencies to the above schedule (turn it into a Gantt chart proper --- the “Gantt chart” in the DoW is more like a PERT chart.) It also helps to be clear just what capabilities each release is planned to offer. Proposals what to insert into the project schedule are made along the way below.
Checkpoints can be constructed from the deliverables list and the milestones table.
The deliverables list implies these checkpoints with implications to the evaluation timetable:
Milestone MS3 may need updating relative to the deliverables list. No important deliverables are scheduled between M6 and M12 that would motivate a demonstrator there. A more appropriate place for the next version of translation tool (after the Phrasebook) is after M18 . M18 should make available ontology interoperability, and along with that, new lexical tools.
Having fixed the schedule some, we go through the WP 9.1 tasks boldfaced from the WP9 statement of purpose.
[From DoW] The work will start with collecting user requirements for the grammar development IDE (WP2), translation tools (WP3), and the use cases (WP6-8). We will define the evaluation criteria and schedule in synchrony with the WP plans (D9.1). We will define and collect corpora including diagnostic and evaluation sets, the former, to improve translation quality on the way, and the latter to evaluate final results.
We have not been able to do much interviewing here because the patent user partner (WP7) is missing and the two others have not started their WPs yet. We have not got real end users in the use cases. In the mathematics case, the end users could be math teaching platform developers; in the patent case, patent office staff; in the museum case, museum workers. These are content professionals with more than average technical facility.
The use cases were scheduled as follows.
This problem was implicit in the original timetable which expected WP9 to work on the use cases before the use case WP's started working. This was noted in the kickoff meeting and agreed that this task would be rescheduled as necessary.
Pending user input, we decided to derive requirements from MOLTO's promises and compare them to the tools resources. The promises made by MOLTO from DoW are summarised below.
[DoW 5]
The single most important S&T innovation of MOLTO will be a mature system for multilingual on-line translation, scalable to new languages and new application domains. The single most important tangible product of MOLTO is a software toolkit, available via the MOLTO website. The toolkit is a family of open-source software products:
A helpful list of quality dimensions relevant to MOLTO evaluation can be derived from the DoW list of links between the main objectives and the tasks in WP’s:
Here are some measurable expected outcomes. Most of them are directly applicable as testable quantitative evaluation measures. It is another thing how many test rounds we can do, given the need of fresh test subjects.
| Feature | Current | Projected | Remarks | 
|---|---|---|---|
| Languages | up to 7 | up to 15 | languages treated simultaneously | 
| Domain size | 100’s of words | 1000’s of words | 4 domains with substantial applications (“substantial” not quantified here) | 
| Robustness | none | open-text capability | translation quality: “complete” or “useful” on the TAUS scale (Translation Automation Users Society) | 
| Development per domain | months | days | |
| Development per language | days | hours | |
| Learning (grammarians) | weeks | days | |
| Learning (authors) | days | hours | source authoring: the MOLTO tool for writing translatable controlled text can be learned in less than one hour, the speed of writing translatable controlled text is in the same order of magnitude as writing unlimited plain text | 
The number 18 of grammar library languages is the minimum number of languages we expect to be available at the end of MOLTO. The number 3 to 15 is the number of languages actually implemented in MOLTO’s domain grammars (3 in WP7, 15 in WP6 and WP8).
The measurements of all these features are performed within WP9 in connection to the project milestones. The advisory group will confirm the adequacy and accuracy of the measurements.
The objects of evaluation – even the translated texts – vary considerably per WP. We detail some criteria per WP below. Evaluation criteria and methods have been collected on the UHEL MOLTO website (esp. https://kitwiki.csc.fi/twiki/bin/view/MOLTO/EvaluationCookbook).
Not much could be done here (yet). We have not got patent corpora. The mathematicians have yet to collect their word problems. We got a small museum text corpus (approx. 25000 words in Swedish, a set of 9 short passages translated into English presumably by non-native speakers) from Gothenburg.
We have translated parts of this corpus both manually and using MT for test material in BLEU evaluation. A pilot comparing BLEU scores on this material to a manual error analysis is on the way.
A small test GF grammar for a sample of the corpus has been written (link). It has helped making more concrete the requirements on grammar-ontology interoperability (below).
We have also fetched the usual EU multilingual corpora on our test platform (hippu.csc.fi).
We have found time to install an evaluation platform, collect and test standard issue translation quality evaluation tools, to develop forthcoming MOLTO lexicon tools, to learn GF and develop ideas about the ontology to grammar interface. The IQmt evaluation platform was tested on a small sample of machine and human translated text (English into Finnish) (see https://kitwiki.csc.fi/twiki/bin/view/MOLTO/EvaluationCookbook).
UHEL also took part in the MOLTO phrasebook task, a demo for translating touristic phrases between 14 European languages: Bulgarian, Catalan, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Romanian, Spanish, Swedish. This experiment presents one way evaluate the effort required for adding new language versions (more on this below).
We divide the rest of the paper by WPs into the front end: translation tool, the use cases and associated lingware (ontologies and grammars), and the back end: the translation system (WPs 2,4,5), presented in this order. We also try to form an idea about what WPs are currently about to see how they are construing their tasks. Information about this (at least task titles) was found on MOLTO website.
The MOLTO workflow is a break to tradition in the professional translation business as well as the consumer end in that it merges the roles of content author and translator. In professional translation, a document is authored at source and the translator's work on the source is read-only. At the consumer end, MT is largely used for gisting from unknown languages to familiar ones.
The main impact is expected to be on how the possibilities of translation are viewed in general. The field is currently dominated by open-domain browsing-quality tools (Google translate and Systran), and domain-specific high-quality translation is considered expensive and cumbersome.
MOLTO will change this view by making it radically easier to provide high-quality translation on its scope of application—that is, where the content has enough semantic structure—and it will also widen this scope to new domains. Socioeconomically, this will make web content more available in different languages, including interactive web pages.
At the end of MOLTO, the technology will be illustrated in case studies that involve up to 15 languages with a vocabulary of up to 2,000 special terms (in addition to basic vocabulary provided by the resource grammar).
The generic tools developed MOLTO will moreover make it possible for third parties to create such translation systems with very little effort. Creating a translation system for a new language covering an unlimited set of documents in a domain will be as smooth (in terms of skill and effort) as creating an individual translation of one document.
(The last sentence sounds like a tall order. But probably it just points out that once MOLTO has been primed for one text it can translate any number of (sufficiently) similar ones.)
The MOLTO change of roles will also entail a change of scenarios.
Translator's new role (parallel to WP3: Translator's tools) will be designed and described in the D9.1 deliverable. Most current translator's workbench software treat the original text as read-only source. The tools to be developed within WP3 (+ 2) will lead towards more mutable role of source text. The translation process will resemble more like structured document editing or multilingual authoring than transformation from a fixed source to a number of target languages.
Since the MOLTO scenario implies major differences to the received translation workflow and current roles and requirements from translation client, translator, revisor etc. MOLTO is not likely to impact translation business at large in the near future. Instead, it has its chances in entering and creating new workflows, in particular, in multilingual web publishing. Multilingual websites are currently developed by means of crowdsourcing translation with tools borrowed from the software localization business. (links). MOLTO could complement or replace this workflow with its new role cast of a content producer or technical editor that generates multilingual content from a single language source. Applications may include multilingual Wikipedia articles, e-commerce sites, medical treatment recommendations, tourist phrasebooks, social media , SMS.
The introductory scenario of this proposal, is the multilingual Wiki system presented in (Meza Moreno and Bringert 2008). In this system, users can add and modify reviews of restaurants in three languages (English, Spanish, and Swedish). Any change made in any of the languages gets automatically translated to the other languages.
As for CAT in general, the advantages of MOLTO can be particularly clear in versioning of already existing sites.
We next review user requirements by type of user and the expected expertise of each. Consider the role cast around MOLTO. The role cast in MOLTO can have at least these:
• Author
• Editor
• Translator
• Checker
• Ontologist
• Terminologist
• Grammarian
• Engineer
So far, all of these roles are merged. Different use scenarios may separate some and merge others. Peculiar to MOLTO is the merge of the author/editor/translator roles. In the MOLTO scenario, the editor-translators cannot be expected to know (all) the target language(s). The target checker(s) and terminologist(s)-grammarian(s) are likely to be different from them, possibly a widely distributed crowd.
The translator's tool serves primarily for author/editor/translator/checker roles. It links to TF which serves ontologist/terminologist roles (and connects them to the former). Presumably, the Grammar IDE supports the last four roles on the above list.
The author is likely to be some sort of an expert on the subject matter, but not necessarily an expert on ontology work. The editor, if separate from the author, could be less of a subject expert but possibly more of an ontologist. How much of a difference there need be between these roles depends on the cleverness of the MOLTO tools.
Say an author types away and MOLTO counters with questions caused by the underlying ontology (of type do you mean this or that?) Unless the author agrees with the ontology, he may be hard put to answer, while an editor/ontologist (familiar with the ontology and/or the way MOLTO works) may know how to proceed – to choose the right thing or to realize the right alternative is missing and how to fix it.
Analogous comments can be made of the relations between author, translator, checker and terminologist. It is all very well for the author to immediately see translations in umpteen languages he does not know. He has no way of knowing whether they are correct (unless MOLTO provides some way for him to check – say back translation with paraphrase?). Also, concrete grammars may ask awkward questions (of the type do you mean male or female, familiar or polite?). To get things right, the author would need to know whether one should be familiar or polite in language N. Here, he needs (to be) a translator or native checker. Considerations like this need to be taken into account in WP3 requirements analysis.
The following lengthy quote from DoW recaps the main ingredients of the translation tools made available to WP3 by WP2.
[9 Translator’s tools in DoW]
For the translator’s tools, there are three different use cases:
• restricted source
• production of source in the first place
• modifying source produced earlier
• unrestricted source
Working with restricted source language recognizable by a GF grammar is straightforward for the translating tool to cope with, except when there is ambiguity in the text. The real challenge is to help the author to keep inside the restricted language. This help is provided by predictive parsing, a technique recently developed for GF (Angelov 2009). Incremental parsing yields word predictions, which guide the author in a way similar to the T9 method1 in mobile phones. The difference from T9 is, however, that GF’s work prediction is sensitive to the grammatical context. Thus it does not suggest all existing words, but only those words that are grammatically correct in the context.
Predictive parsing is a good way to help users produce translatable content in the first place. When modifying the content later, e.g. in a wiki, it may not be optimal ... This is where another utility of the abstract syntax comes in: [syntax editing]. in the abstract syntax tree, all that is changed is the noun, and the regenerated concrete syntax string automatically obeys all the agreement rules. This functionality is implemented in the GF syntax editor (Khegai & al. 2003).
The predictive parser of GF does not try to resolve ambiguities, but simply returns all alternatives in the parse chart. This is not always a problem, since it may be the case that the target language has exactly the same ambiguity and then it remains hidden in the translation. In practise this happens often in closely related languages. But if the ambiguity makes a difference in translation, it has to be resolved. There are two complementary approaches: using statistical models for ranking or using manual disambiguation. … For users less versed in abstract syntax, however, a better choice is to show the ambiguities as different translation results. Then the user just has to select the right alternatives. The choice is propagated back in the abstract syntax, which has the cumulative effect that a similar ambiguity in a third language gets fixed as well. This turns out to be very useful in a collaborative environment such as Wikipedia.
Both predictive parsing and syntax editing are core functionalities of GF and work for all multilingual grammars. While the MOLTO project will exploit these functionalities with new grammars, it will also develop them into tools fitting better into users’ work flows. Thus the tools will not require the installation of specific GF software: they will work as plug-ins to ordinary tools such as web browsers, text editors, and professional translators’ tools such as SDL and WordFast.
The snapshot in Figure 2 is from an actual web-based translation prototype using GF. It shows a slot in an HTML page, built by using JavaScript via the Google Web Toolkit (Bringert & al. 2009). The translation is performed in a server, which is called via HTTP. Also client-side translators, with similar user interfaces, can be built by converting the whole GF grammar to JavaScript (Meza Moreno and Bringert 2008).
To deal with unrestricted legacy input, such as in the patent case study, predictive parsing and syntax editing are not enough. The translator will then be given two alternatives: to extend the grammars, or to use statistical translation.
For grammar extension, some functionalities of the grammar writer’s tools are made available to the translator—in particular, lexicon extension (to cope with unknown words) and example-based grammar writing (to cope with unknown syntactic structures). In statistical translation, the worst-case solution is to fall-back to phrase-based statistical translation. In MOLTO, we will study the ways to specialize this to translation in limited domains, so that the quality is higher than in general-purpose phrase-based translation. We will also study other methods to help translators with unexpected input.
WP3 has its main deliverables at months 18, 24 and 30.
| Del. no | Del. title | Nature | Date | 
| D 3.1 | MOLTO translation tools API | P | M18 | 
| D 3.2 | MOLTO translation tools prototype | P | M24 | 
| D 3.3 | MOLTO translation tools / workflow manual | RP, Main | M30 | 
[WP3 in DoW]
The standard working method in current translation tools is to work on the source and translation as a bilingual text. Translation suggestions are sought from TM (Translation Memory) based on similarity, or generated by a MT system, are presented for the user to choose from and edit manually. The MOLTO translator tool extends this with two additional constrained-language authoring modes, a robust statistical machine translation (UPC) mode, plus vocabulary and grammar extension tools (UGOT), including: (i) mode for authoring source text while context-sensitive word completion is used to help in creating translatable content; (ii) mode for editing source text using a syntax editor, where structural changes to the document can be performed by manipulating abstract syntax trees; (iii) back-up by robust and statistical translation for out-of-grammar input, as developed in WP5; (iv) support of on-the-fly extension by the translator using multilingual ontology-based lexicon builder; and (v) example-based grammar writing based on the results of WP2.
The WP will build an API (D3.1, UHEL) and a Web-based translator tool (D3.2, by Ontotext and UGOT). The design will allow the usage of the API as a plug-in (UHEL) to professional translation memory tools such as SDL and WordFast. We will apply UHEL’s ContentFactory for distributed repository system and a collaborative workflow for multilingual terminology.
This is what we say about the eventual translation platform in DoW (section numbering 1.2.5 seems a random error):
1.2.5 Multilingual services
MOLTO will provide a unique platform for multilingual document management, satisfying the five desired features listed in Section 1.1. [?] It will enable truly collaborative creation and maintenance of content, where input provided in any language of the system is immediately ported to the other languages, and versions in different languages are thereby kept in synchrony. This idea has had previous applications in GF (Dymetman & al. 2000, Khegai & al. 2003, Meza Moreno and Bringert 2008). In MOLTO, it will be developed into a technology that can be readily applied by non-experts in GF to any domain that allows for an ontology-based interlingua.
The methodology will be tested on three substantial domains of application: mathematics teaching material, patents, and museum object descriptions. These case studies are varied enough to show the generalisability of the MOLTO technology, and also extensive enough to produce useful prototypes for end users of translations: mathematics students, intellectual property researchers, and visitors to museums. End users will have access in their own languages to information that may be originally produced in other languages.
This does not actually say that all three use cases use one and the same platform (unless 'unique' means just one). It is not even sure they want the same features. The mathematicians are likely to need some math editing tool and perhaps access to a computational algebra solver. Patent translators may need access to patent corpora and databases. Museum people may need to work with images. Future MOLTO users may have their own favourite platforms with such facilities in place.
Rather, the WP3 translation tools deliverable should be a set of plugins usable in many different platforms, in turn variously using the common GF back-end plugins listed above.
Still, we need a flagship demonstrator for the project. The flagship demonstrator should be a generic web editing platform. Minimally, it can be an extension of the existing GF web translation demo. In the best case, it could be installed as a set of plugins to some existing web platform like Mediawiki, Drupal and/or some open source CAT tool(s).
The demonstrator should be able to have at least the following plugins:
• GF translation editor (including autocompletion and syntax editing)
• GF grammar IDE
• TF ontology/lexicon manager
• Ontotext ontology tools (if separate from above)
• SMT translator (if separate from above)
• TM (translation memory)
The TM on the list is a stand-in for tools to support non-constrained editing. (It appears that some use cases will need to mix GF translation with manual (CAT or SMT supported) translation.
All or parts of some existing web translation/localization platform(s) could be taken as starting point. Or conversely, some existing CAT tool components could be plugged into ours. (The latter plan may now seem more promising.)
Translator’s tools promised by WP2 include
• text input + prediction (= autocompletion from grammar)
• syntax editor for modification
• disambiguation
• on the fly extension
The MOLTO worfklow and role play must be spelled out in the grammar tool manual (D 2.3) and the MOLTO translation tools / workflow manual (D 3.3). We should start writing these manuals now, to fix and share our ideas about the user interfaces.
The main claims to fame in MOLTO are to produce high automatic translation quality, particularly in view of faithfulness, into multiple languages from one pre-editable source, and as a way to that, practically (= economically) feasible multilingual online translation editing with a minimum of training:
[DoW]
The expertise needed for using the translation system will be minimal, due to the guidance provided by MOLTO.
| Feature | Current | Projected | 
 | 
| Learning (authors) | days | hours | 
 | 
These claims should then be among the items to evaluate.
Quantified evaluation of translation tool features make sense starting with the translation tool prototype developed in WP3 (M24). The tests can be developed and calibrated on the initial demonstrator at M18.
We distinguish below between evaluating the translation result and evaluating the translation process.
3a. Evaluating the translation result
We argue below that there is little sense for WP9 to quantitatively measure MOLTO translation quality with standard MT eval tools except at the end of MOLTO (D 9.2). On the way there, WPs (in particular the GF grammar and SMT WPs) should institute their own progress evaluation schedules. They may then outsource translation quality evaluations to WP9 when appropriate. What we want to avoid is an externally imposed evaluation drill during WP work which can produce skewed results and cause useless delays on the way.
We have created a UHEL MOLTO TWiki website to coordinate our workpackages internally (link). The website is open for other MOLTO partners as well.
We have installed standard SMT evaluation tools (hippo.csc.fi). A Pilot study on measuring translation fidelity have been conducted in PhD project associated to MOLTO (Maarit Koponen).
This is what MOLTO promised in the DoW about translation quality assessment:
To measure the quality of MOLTO translations, we compare them to
(i) statistical and symbolic machine translation (Google, SYSTRAN); and
(ii) human professional translation.
We will use both
automatic metrics (IQmt and BLEU; see section 1.2.8 for details (???)) and
TAUS quality criteria (Translation Automation Users Society1)
As MOLTO is focused on information-faithful grammatically correct translation in special domains, TAUS results will probably be more important.
Given MOLTO’s symbolic, grammar-based interlingual approach, scalability, portability and usability are important quality criteria.
These criteria are quantified in (D9.1) and reported in the final evaluation (D9.2).
In addition to the WP deliverables, there will be continuous evaluation and monitoring with internal status reports according to the schedule defined in D9.1.
The criteria (scalability, portability, and usability) mean that MOLTO should have wider coverage, be easier to extend and need less expertise than similar (symbolic, grammar-based, interlingual) solutions heretofore.
[12 Translation quality]
We will compare the results of MOLTO to other translation tools, by using both automatic metrics (BLEU, Bilingual Evaluation Understudy, Papineni & al. 2002) and, in particular, the human evaluation of “utility”, as defined by TAUS. The comparison is performed with the freely available general-purpose tools Google translate and Systran. While the comparison is “unfair” in the sense that MOLTO is working with special-purpose domain grammars, we want to perform measurements that confirm that MOLTO’s quality really is essentially better. Comparisons with domain-specific systems will be performed as well, if any such systems can be found. Domain-specific translation systems are still rare and/or not publicly available.
Regarding automatic metrics for MT, the usage of lexical n-gram based metrics (WER, PER, BLEU, NIST, ROUGE, etc.) represents the usual practice in the last decade. However, recent studies showing some limitations of lexical metrics at capturing certain kind of linguistic improvements and making appropriate rankings of heterogeneous MT systems Callison-Burch et al. (2006); Callison-Burch et al. (2007); Callison-Burch et al. (2008); Giménez (2008) have fostered research on more sophisticated metrics, which can combine several aspects of syntactic and semantic information. The IQmt suite1, developed by the UPC team, is one of the examples in this direction Giménez and Amigó (2006); Giménez and Màrquez (2008). In IQmt, a number of automatic metrics for MT, which exploit linguistic information from morphology to semantics, are available for the English language and will be extended to other languages (e.g., Spanish) soon. These metrics are able to capture more subtle improvements in translation and show high correlation with human assessments Giménez and Màrquez (2008); Callison-Burch et al. (2008). We plan to use IQmt in the development cycle whenever it is possible. For languages not covered in IQmt, we will rely on BLEU (Papineni et al. 2002).
Regarding human evaluation, the TAUS method is the more appropriate one for the MOLTO tasks, since we are aiming for reliable rendering of information. It consists of inspection of a significant number of source/target segments to determine the effectiveness of information transfer. The evaluator first reads the target sentence, then reads the source to determine whether additional information was added or misunderstandings identified.
The scoring method is as follows:
4. Complete: All of the information in the source was available from the target; reading the source did not add to information or understanding.
3. Useful: The information in the target was correct and clear, but reading the source added some additional information or understanding.
2. Marginal: The information in the target was correct, but reading the source provided significant additions or clarifications.
1. Poor: The information in the target was unclear and/or incorrect; reading the source would be necessary for understanding.
We aim to reach “complete” scores in mathematics and museum translation, and “useful” scores in patent translation.
Dimensions not mentioned in the TAUS scoring are “grammaticality” and “naturalness” of the produced text. The grammar-based method of MOLTO will by definition guarantee grammaticality; failures in this will be fixed by fixing the grammars. Some naturalness will be achieved in the sense of “idiomaticity”: the compile-time transfer technique presented in Section 1.2.3 will guarantee that forms of expression which are idiomatic for the domain are followed. The higher levels of text fluency reachable by Natural Language Generation techniques such as aggregation and referring expression selection have been studied in some earlier GF projects, such as (Burke and Johannisson 2005). Some of these techniques will be applied in the mathematics and cultural heritage case studies, but the main focus is just on rendering information correctly. On all these measures, we expect to achieve significant improvements in comparison to the available translation tools, when dealing with in-grammar input.
Applying BLEU and similar methods which compare MT output to human model translations promises to be laborious in the case of MOLTO because we have a large number of less-common target languages and lack use case related corpora. Though we have not full knowledge yet what corpora we shall have access to, they are not likely to provide a wealth of (preferably many parallel) human model translations for comparison in the special domains we have:
• We expect the mathematics WP to involve a small number (tens or hundreds) of short (one-paragaph) examples
• The museum corpus (at least so far) is not much larger (25K words in all). The largest subset is Swedish only.
• We do not know yet what to expect from the patent partner.
The main difficulty for automatic comparison measures are ambiguities in natural languages: Usually, there is more than one correct translation for a source sentences; there are ambiguities in the choice of synonyms as well as in the order of the words. Allowance for free variation through synonymy and paraphrase (free translation in general) is made with more comparison text. For instance, the NIST evaluation campaign uses four parallel translations (to the same language) of texts in the order of 15-20K words.
What is more to the point, BLEU results are not likely to prove MOLTO's strengths, because they are not sensitive to fidelity, being in this respect like the n-gram SMT methods they simplify. Preliminary tests to this effect have been conducted by Maarit Koponen (links).
BLEU and similar tests have been developed in the context of SMT and for the assimilation (gisting) scenario. Most of the weight in BLEU or WER like measures comes from matched words and shorter n-grams. These measures point in the right direction as long as translation quality is low (as long as long distance dependencies and fidelity do not matter).
The distinction between fluency and fidelity in human evaluation measures is not made for automatic evaluation measures. Each such measure is considered to judge the overall quality of a candidate sentence or system, rather than the quality with respect to certain aspects. Leusch (link) shows that some measures have preferences for certain aspects – the unigram PER correlates with adequacy to a higher degree than the bigram PER, whereas this is vice versa on the fluency, but the observation remains to be exploited.
To evaluate fidelity as well as fluency, more grammar sensitive measures are needed. In smaller use cases, human evaluation is likely to be the cost effective solution (link). An innovative approach suggested by work in Koponen (to appear) would to develop the MOLTO evaluation methodology using MOLTO's own technology. The idea is to use simplified (MOLTO or other) parsing grammars to test fidelity and domain ontologies to test fluency.
Fidelity (preservation of grammatical relations) would be gauged by using simplified grammars to parse summaries of text and comparing MOLTO translations of summaries with summaries of translations. The assumption is (like it implicitly is in BLEU) that the translator is more reliable with shorter bits (and there are more of them).
Acceptability of lexical variation in the target text would be checked (not against parallel human translations but) against multilingual domain ontologies (e.g., use vessel or boat instead of ship).
Note the analogy here to BLEU's use of n-grams as a simplification of SMT methods to compare SMT to human targets. Work developing these ideas is in progress in a PhD project associated to MOLTO (Koponen to appear). The planned GF/SMT hybrid system is interesting here. It suggests analogous ideas for hybridizing statistical and grammar based evaluation measures.
At the evaluation phase towards the end of MOLTO, a comparison of (say) the patent case output to competing methods using generic tools like the SMT evaluation tools and TAUS criteria is worth doing, and has been promised in the DoW. On the way there, however, we prefer developing and applying MOLTO specific evaluation methods.
UHEL needs to synchronise evaluation plans with the SMT workpackage.
3b. Evaluating the translation process
WP9 aims to set requirements and evaluate the MOLTO translation workflow from the beginning. We argue below that evaluating the translation workflow and translator productivity are particularly important in MOLTO. For related work in other projects, see (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/EvaluationCookbook) Our initial proposals follow below.
The MOLTO pre-editing strategy lets an author or technical editor modify the text, the translator enrich the vocabulary, and the grammarians perfect the grammar until the translation result is acceptable. Therefore the success criterion for the MOLTO approach must be how much effort it takes to get a translation from initial state to a break-even point (as defined by the use case). A translation can always be made better with more work on the tool, but the crux is when the result pays the effort. The DoW sets these quantitative expectations on source editing:
1. source authoring: the MOLTO tool for writing translatable controlled text can be learned in less than one hour, the speed of writing translatable controlled text is in the same order of magnitude as writing unlimited plain text
“Of the same order” mathematically means that writing with MOLTO is not ten times slower than writing without it. We should clock this.
We pick up this discussion again under WP2 in connection with measuring the vocabulary and grammar extension effort.
The description of this case study in Dow and the MOLTO website makes apparent that the math use case demonstrator is not so much a translation editor as natural language front end to computer algebra.
Leader: jordi.saludes
Timeline: July, 2010 - May, 2012
The ultimate goal of this package is to have a multilingual dialog system able to help the math student in solving word problems.
The UPC team, being a main actor in the past development of GF mathematical grammars and having ample experience in mathematics teaching, will be in charge of the tasks in this work package with help from UGot and UHEL on technical aspects of GF and translator’s tools, along with Ontotext on ontology representation and handling. We will start by compiling examples of word problems. In parallel, we will take the mathematical multilingual GF library which was developed in the framework of the WebALT project and organize the existing code into modules, remove redundancies and format them in a way acceptable for enhancement by way of the grammar developer’s and translator’s tools of work packages 2 and 3 (D6.1). The next step will be writing a GF grammar for commanding a generic computer algebra system (CAS) by natural language imperative sentences and integrating it into a component (D6.2) to transform the commands issued to the CAS (Maybe as a browser plugin). For the final deliverable (D6.3), we will use the outcome of work package 4 to add small ontologies describing the word problem: We will end with a multilingual system able to engage the student into a dialog about the progress being made in solving the problem. It will also help in performing the necessary computations.
The impression is confirmed by an email From Jordi Saludes:
"The simplest implementation will be a terminal-based question/answer
system like ELIZA, but focused on solving word problems. It will start by
giving the statement of the problem, then it will do computations for the
student/user, list unknowns, list relations between unknowns, state the
progress of the resolution and, maybe, give hints. 
We are thinking about the kind of word problems which require solving a system
of (typically two) linear equations. In Spain these are addressed to first or
second year high school students." 
On the way to the demonstrator, the plan is to devise small ontologies describing math word problems and verbalise them using the MOLTO platform and WebAlt project math GF grammars. These phases of the work can be evaluated on the lines indicated under WP2-3. Since the corpus is small, manual quality evaluation using TAUS criteria is appropriate. We need to buy TAUS criteria if we are not getting them from the patent partner.
| ID | 
 | Task leader | Status | New comments | 
| 6.0 | Hold | 
 | ||
| 6.1 | Planned | 
 | ||
| 6.2 | Planned | 
 | ||
| 6.3 | Ongoing | 
 | ||
| 6.4 | Planned | 
 | ||
| 6.5 | Planned | 
 | ||
| 6.6 | Planned | 
 | ||
| 6.7 | Planned | 
 | ||
| 6.8 | Planned | 
 | 
| ID | 
 | Due date | Dissemination level | Nature | Publication | 
| D6.1 | 1 June, 2011 | Public | Prototype | 
 | 
The description of this use case is on hold pending a new partner. There is another EU project about translating patents. One way to assess MOLTO could be to compare our results to them.
PLuTO will develop a rapid solution for patent search and translation by integrating a number of existing components and adapting them to the relevant domains and languages. CNGL bring to the target platform a state-of-the-art translation engine, MaTrEx, which exploits hybrid statistical, example-based and hierarchical techniques and has demonstrated high quality translation performance in a number of recent evaluation campaigns. ESTeam contributes a comprehensive translation software environment to the project, including server-based, multi-layered, multi-domain translation memory technology. Information retrieval expertise is provided by the IRF which also provides access to its data on patent search use-cases and a large scale, multilingual patent repository. PLuTO will also exploit the use-case holistic machine translation expertise of Cross Language, who have significant experience in the evaluation of machine translation, while WON will be directly involved in all phases of development, providing valuable user feedback. The consortium also intends to collaborate closely with the European Patent Office in order to profit from their experience in this area.
WP No 8 Leader UGOT Start M13 End M30
WP Title Case Study: Cultural Heritage
The objective is to build an ontology-based multilingual grammar for museum information starting from a CRM ontology for artefacts at Gothenburg City Museum[1], using tools from WP4 and WP2. The grammar will enable descriptions of museum objects and answering to queries over them, covering 15 languages for baseline functionality and 5 languages with a more complete coverage. We will moreover build a prototype of a cross-language retrieval and representation system to be tested with objects in the museum, and automatically generate Wikipedia articles for museum artefacts in the 5 languages with extensive coverage.
The work is started by a study of the existing categorizations and metadata schemas adopted by the museum, as well as a corpus of texts in the current documentation which describe these objects (D8.1, UGOT and Ontotext). We will transform the CRM model into an ontology aligning it with the upper-level one in the base knowledge set (WP4) and modeling the museum object metadata as a domain specific knowledge base. Through the interoperability engine from WP4 and the IDE from WP2, we will semi-automatically create the translation grammar and further extend it (D8.2, UGOT, UHEL, UPC, Ontotext). The final result will be an online system enabling museum (virtual) visitors to use their language of preference to search for artefacts through semantic (structured) and natural language queries and examine information about them. We will also automatically generate a set of articles in the Wikipedia format describing museum artefacts in the 5 languages with extensive grammar coverage (D8.3, UGOT, Ontotext).
| Del. no | Del. title | Nature | Date | 
| D 8.1 | Ontology and corpus study of the cultural heritage domain | O | M18 | 
| D 8.2 | Multilingual grammar for museum object descriptions | P | M24 | 
| D 8.3 | Translation and retrieval system for museum object descriptions | P,Main | M30 | 
CIDOC Conceptual Reference Model (CRM), a high-level ontology to enable information integration for cultural heritage data and their correlation with library and archive information. The CIDOC CRM is now in the process to become an ISO standard.
The CIDOC CRM analyses the common conceptualizations behind data and metadata structures to support data transformation, mediation and merging. It is property-centric, in contrast to terminological systems. It is now in a very stable form, and contains 80 classes and 130 properties, both arranged in multiple isA hierarchies.
Semantic Computing Research Group (SeCo, Eero Hyvönen) has an Ontology for museum domain (MAO). MAO is an ontology for the museum domain, used for describing content such as museum items. MAO is ontologically mapped to the Finnish General Upper Ontology YSO and has been created as part of the FinnONTO-project. The most important application of MAO is The Semantic Portal for Finnish Culture Kulttuurisampo. Seco is specialised in indexing websites with ontologies. They are currently translating their ontologies into Finnish and Swedish.
To be completed...
The deliverables promised from WP2:
| ID | 
 | Due date | Dissemination level | Nature | Publication | 
| D2.1 | 1 March, 2011 | Public | Prototype | 
 | |
| D2.2 | 1 September, 2011 | Public | Prototype | 
 | |
| D2.3 | 1 March, 2012 | Public | Regular Publication | 
 | 
[this comes from the MOLTO website:]
The objective is to develop a tool for building domain-specific grammar-based multilingual translators. This tool will be accessible to users who have expertise in the domain of translation but only limited knowledge of the GF formalism or linguistics. The tool will integrate ontologies with GF grammars to help in building an abstract syntax. For the concrete syntax, the tool will enable simultaneous work on an unlimited number of languages and the addition of new languages to a system. It will also provide linguistic resources for at least 15 languages, among which at least 12 are official languages of the EU.
The top-level user tool is an IDE (Integrated Development Environment) for the GF grammar compiler. This IDE provides a test bench and a project management system. It is built on top of three more general techniques: the GF Grammar Compiler API (Application Programmer’s Interface), the GF-Ontology mapping (from WP4), and the GF Resource Grammar Library. The API is a set of functions used for compiling grammars from scratch and also for extending grammars on the fly. The Library is a set of wide-coverage grammars, which is maintained by an open source project outside MOLTO but will be via MOLTO efforts made accessible for programmers on lower levels of linguistic expertise. Thus we rely on the available GF resource grammar library and its documentation, available through digitalgrammars.com/gf/lib. The API is also used in WP3, as a tool for limited grammar extension, mostly with lexical information but also for example-based grammar writing. UGOT designs APIs and the IDE, coordinates work on grammars of individual languages, and compiles the documentation. UHEL contributes to terminology management and work on individual languages. UPC contributes to work on individual languages. Ontotext works on the Ontology-Grammar interface and contributes to the ontology-related part of the IDE.
Here we try to make a bit clearer what the functionalities of the WP2 tools are, and how they relate to the translator's tool.
We surmise that the grammar compiler's IDE is meant primarily for grammarian/engineer roles, i.e. for extending the system to new domains and languages. But it may contain facilities or components which are also relevant for the translation tool. In many scenarios, we must allow the translator to extend the system, i.e. switch to some of the last four roles. Just how the translation tool is linked to the grammar IDE needs specifying.
What the average user can do to fix the translation depends on how user friendly we can get. Minimally, a translator only supplies a missing translation on the fly, and all necessary adaptation is handled by the system. Maximally, an ontology or grammar needs extending as a separate chore by hand, using the grammar IDE.
An author/editor/translator can be expected to translate with the given lingware. The next level of involvement is extending the translation. This may cause entries or rules to be added to a text, company, or domain specific ontology/lexicon/grammar. If the tool is used in an organization, roles may be distributed to different people and questions of division of labor and quality control (as addressed in TF) already arise.
For it is not only, even in the first place, a question of being able to change the grammar technically, but managing the changes. A change in the source may cause improvement in some languages, deterioration in others. The author can't possibly check the repercussions in all languages. Assume each user site makes its own local changes. How many different versions of MOLTO lingware will there be? One for each website maintained with MOLTO? – how can sites share problems and solutions? A picture of a MOLTO community not unlike the one envisaged for multilingual ontology management TF starts to form. The challenge is analogous to ontology evolution. There are hundreds of small university ontologies in Swoogle. Quality can be created in the crowd, but there must be an organisation for it (cf. Wikipedia).
The MOLTO worfklow and role play must be spelled out in the grammar tool manual (D 2.3) and the MOLTO translation tools / workflow manual (D 3.3). We should start writing these manuals now, to fix and share our ideas about the user interfaces.
The way disambiguation now works is that translation of a vague source against a finer grained target generates the alternative translations with disambiguating metatext to help choose the intended meaning. (try I love you in http://www.grammaticalframework.org/demos/phrasebook/. Compare to Boitet et al.'s 1993 dialogue based MT system Lidia e.g. http://www.springerlink.com/content/kn8029t181090028/)
This facility could link to the ontology as a source of disambiguating metatext, either from meta comments or directly verbalised from ontology).
Some of the GF 3.2 features, like parse ranking and example based grammar generation, have consequences to front end design, as enabling technology.
[11 Productivity and usability]
Our case studies should show that it is possible to build a completely functional high-quality translation system for a new application in a matter of months—for small domains in just days.
The effort to create a system dynamically applicable to an unlimited number of documents will be essentially the same as the effort it currently takes to manually translate a set of static documents.
The expertise needed for producing a translation system will be low, essentially amounting to the skills of an average programmer who has practical knowledge of the targeted language and of the idiomatic vocabulary and syntax of the domain of translation.
1. localization of systems: the MOLTO tool for adding a language to a system can be learned in less than one day, and the speed of its use is in the same order of magnitude as translating an example text where all the domain concepts occur
The role requirements for extending the system remain quite high, not because of the requirements on the individual skills, but because it is less common to find their combination in one person.
The user requirements entail an important evaluation criterion: the guidance provided by MOLTO. It should also lead to system requirements, like online help, examples, profiling capabilities.
One part of MOLTO adaptivity is meant to come from the grammar IDE. Another part should come from ontologies. While the former helps extending GF “internally”, the latter should allow bringing in semantics and vocabulary from OWL ontologies. We discuss these two parts in this order.
[8 Grammar engineering for new languages in DoW]
In the MOLTO project, grammar engineering in GF will be further improved in two ways:
• An IDE (Integrated Development Environment), helping programmers to use the RGL and manage large projects.
• Example-Based Grammar Writing, making it possible to bootstrap a grammar from a set of example translations.
The former tool is a standard component of any library-based software engineering methodology. The latter technique uses the large-coverage RGL for parsing translation examples, which leads to translation rule suggestions.
The task of building a new language resource from scratch currently is described in http://grammaticalframework.org/doc/gf-lrec-2010.pdf. As this is largely a one-shot language engineering task outside of MOLTO (MOLTO was supposed to have its basic lingware done ahead of time), it should not call for evaluation here.
Building a multilingual application for a given abstract domain grammar by way of applying and extending concrete resource grammars can use a lighter process. The proposed example-based grammar writing process is described in the Phrasebook deliverable (http://www.molto-project.eu/node/1040). The tentative conclusions were:
• The grammarian need not be a native speaker of the language. For many languages, the grammarian need not even know the language, native informants are enough. However, evaluation by native speakers is necessary.
• Correct and idiomatic translations are possible.
• A typical development time was 2-3 person working days per language.
• Google translate helps in bootstrapping grammars, but must be checked. In particular, we found it unreliable for morphologically rich languages.
• Resource grammars should give some more support e.g. higher-level access to constructions like negative expressions and large-scale morphological lexica.
Effort and Cost
Based on this case study, we roughly estimated the effort used in constructing the necessary sources for each new language and compiled the following summarizing chart.
| Language | Language skills | GF skills | Informed development | Informed testing | Impact of external tools | RGL Changes | Overall effort | 
| Bulgarian | ### | ### | - | - | ? | # | ## | 
| Catalan | ### | ### | - | - | ? | # | # | 
| Danish | - | ### | + | + | ## | # | ## | 
| Dutch | - | ### | + | + | ## | # | ## | 
| English | ## | ### | - | + | - | - | # | 
| Finnish | ### | ### | - | - | ? | # | ## | 
| French | ## | ### | - | + | ? | # | # | 
| German | # | ### | + | + | ## | ## | ### | 
| Italian | ### | # | - | - | ? | ## | ## | 
| Norwegian | # | ### | + | - | ## | # | ## | 
| Polish | ### | ### | + | + | # | # | ## | 
| Romanian | ### | ### | - | - | # | ### | ### | 
| Spanish | ## | # | - | - | ? | - | ## | 
| Swedish | ## | ### | - | + | ? | - | ## | 
The phrasebook deliverable is one simple example what can be done to evaluate the grammar workpackage's promises. The results from the Phrasebook experiment may be positively biased because the test subjects were very well qualified. But this and similar tests can be repeated with more “ordinary people”, and changes in the figures followed as the grammar IDE is developing.
It could be instructive to repeat the exact same test with different subjects and compare the solutions, to see how much creativity was involved in the solutions. The less there is variation the better the chances to automate the process. Even failing that, analysis of the variant solutions could help suggest guidelines and best practices to the manual. Possible variation here also raises the issue of managing changes in a community of users.
Ontotext contributions to MOLTO through WP4 are
• Semantic infrastructure
• Ontology-grammar interoperability
The semantic infrastructure in MOLTO will also act as a central multi-paradigm index for (i) conceptual models—upper-level and domain ontologies; (ii) knowledge bases; (iii) content and metadata as needed by the use cases (mathematical problems, patents, museum artefact descriptions); and provide NL-based and semantic (structured) retrieval on top of all modalities of the data modelled.
In addition to the traditional triple model for describing individual facts,
<subject, predicate, object>
the semantic infrastructure, will build on quintuple-based facts,
<subject, predicate, object, named graph, triple set>
The infrastructure will include: inference engine (TRREE7), semantic database (OWLIM8), semantic data integration framework (ORDI9) and a Multi-paradigm semantic retrieval engine, all of which are previous work, resulting from private (Ontotext) and public funding (TAO10. TripCom11). This approach will enable MOLTO’s baseline and use case driven knowledge modelling with the necessary expressivity of metadata-about-metadata descriptions for provenance of the diverse sources of structured knowledge (upper-level, domain specific and derived (from grammars) ontologies; thesauri; domain knowledge bases; content and its metadata)
From Ontotext webpages, we can guess that the infrastructure builds on the following technologies:
• KIM is a platform for semantic annotation, search, and analysis
• OWLIM is the most scalable RDF database with OWL inference
• PROTON is a top ontology developed by Ontotext.
Milestone MS2 says the knowledge representation infrastructure is opened for retrieval access to partners at M6. The infrastructure deliverable D4.1 is due at M8.
[7 Grammar-ontology interoperability for translation and retrieval in DoW]
At the time of the TALK project, an emerging topic was the derivation of dialogue system grammars from OWL ontologies. A prototype tool for extracting GF abstract syntax modules from OWL ontologies was thereby built by Peter Ljunglöf at UGOT. This tool was implemented as a plug-in to the Protégé system for building OWL ontologies3 and intended to help programmers with OWL background to build GF grammars. Even though this tool remained as a prototype within the TALK project, it can be seen as a proof of concept for the more mature tools to be built in the MOLTO project.
A direct way to map between ontologies and GF abstract grammars is a mapping between OWL and GF syntaxes.
In slightly simplified terms, the OWL-to-GF mapping translates OWL’s classes to GF’s categories and OWL’s properties to GF’s functions that return propositions. As a running example in this and the next section, we will use the class of integers and the two-place property of being divisible (“x is divisible by y”). The correspondences are as follows:
Class(pp:integer ...) <==> cat integer ;
ObjectProperty(pp:div <==> fun div :
domain(pp:integer) integer -> integer -> prop ;
range(pp:integer))
Less syntax-directed mappings may be more useful, depending on what information is relevant to pass between the two formalisms. The mapping is then also less generic, as it depends on the intended use and interpretation of the ontology. The mapping through SPARQL queries below is one example. A mapping over TF could be another one.
The GF-Protégé plug-in brings us to the development cost problem of translation systems. We have noticed that in the GF setting, building a multilingual translation system is equivalent to building a multilingual GF grammar, which in turn consists of two kinds of components:
• a language-independent abstract syntax, giving the semantic model via which translation is performed;
• for each language, a concrete syntax mapping abstract syntax trees to strings in that language.
In MOLTO, GF abstract syntax can also be derived from sources other than OWL (e.g. from OpenMath4 in the mathematical case study) or even written from scratch and then possibly translated into OWL ontologies, if the inference capabilities of OWL reasoning engines are desired. The CRM ontology (Conceptual Reference Model5) used in the museum case study is already available in OWL.
MOLTO’s ontology-grammar interoperability engine will thus help in the construction of the abstract syntax by automatically or semi-automatically deriving it from an existing ontology. The mechanical translation between GF trees and OWL representations then forms the basis of using GF for translation in the Semantic Web context, where huge data sets become available in RDF and OWL in initiatives like Open Linked Data (LOD).
The interoperability between GF and ontologies will also provide humans with natural ways of interaction with knowledge based systems in multiple languages, expressing their need for information in NL and receiving the matching knowledge expressed in NL as well:
Human -> NL -> GF -> ontology -> GF -> NL -> Human
providing an entirely new dimension to the usability of semantics-based retrieval systems, and opening extensive structured bodies of knowledge in human understandable ways.
Note also that the OWL to GF mapping also allows a wider human input to GF. OWL ontologies are written by humans (at present at least, by many more humans than GF grammars).
MOLTO website gives detail what is going to delivered first by way of ontology-GF interoperability. The first round uses GF grammar to translate NL questions to SPARQL query language (http://www.molto-project.eu/node/987).
The ontology-GF mapping here is a NL interface to PROTON ontologies, by way of parsing (fixed) NL to (fixed) GF trees and transforming the trees into SPARQL queries to run on the ontology DB.
Indirectly, this does define a mapping between (certain) GF trees and RDF models, using SPARQL in the middle. SPARQL is not RDF but a SPARQL query does retrieve a RDF model given a dataset, but the model depends on the dataset. With an OWL reasoner thrown in, we can get OWL query results.
What WP3 had in mind is a tool to translate between OWL models and GF grammars, i.e. convert OWL ontology content into GF abstract syntax. This tool is forthcoming next according to the MOLTO presentation slides (http://www.molto-project.eu/node/1008).
This was confirmed by email from Petar (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/MoltoOntologyEvaluationPlanWP4).
The translation tools WP3 will consider using TermFactory multilingual ontology model and tools
as middleware between (non-linguistic) ontology and GF grammar. The idea is to (semi)automatically match or bridge third party ontologies to TF, a platform for collaborative development of ontology-based multilingual terminology. It then remains to define an automatic conversion between TF and GF.
The Varna meeting should adjudicate between WP3 and WP4 here.
A concrete subtask that arises here is to define an interface between the knowledge representation infrastructure (due Nov 2010) and TF (finished in ContentFactory project end of 2010).
Since the aims are more related to use cases and framework development, than enhancing performance of existing technologies, the evaluation to be done during the project will be more of a qualitative than quantitative kind.
The evaluation of these features should reflect and demonstrate the multiple possibilities of GF that are gained through inter-operation with external ontologies. The evaluation of progress will exploit proof-of-concept demos and plans for further development. For further discussion, see https://kitwiki.csc.fi/twiki/bin/view/MOLTO/MoltoOntologyEvaluationPlanD91
[From DoW]
The goal is to develop translation methods that complete the grammar-based methods of WP3 to extend their coverage and quality in unconstrained text translation. The focus will be placed on techniques for combining GF- based and statistical machine translation. The WP7 case study on translating Patents text is the natural scenario to test the techniques developed in this package. Existing corpora for the WP7 will be used to adapt SMT and grammar- based systems to the Patents domain. This research will be conducted on a variety of languages of the project (at least three).
| Del. no | Del. title | Nature | Date | 
| D 5.1 | Description of the final collection of corpora | RP | M18 | 
| D 5.2 | Description and evaluation of the combination prototypes | RP | M24 | 
| D 5.3 | WP5 final report: statistical and robust MT | RP,Main | M30 | 
[10 Robust and statistical translation methods in DoW]
The concrete objectives in this proposal around robust and statistical MT are:
Most of the objectives depend on the Patents corpus. Even the languages of study depend on the data that the new partner provide. In order to compensate the delay due to this both in WP5 and mainly in WP7 we started working here on hybrid approaches. The methodology now is to develop hybrid methods in a way independent of the domain and data sets used, so that they can be later adapted to patents.
Bilingual corpora are needed to create the necessary resources for training/adapting statistical MT systems and to extend the grammar-based paradigm with statistical information (1 and 2). We will compile and annotate general-purpose large bilingual and monolingual corpora for training basic SMT systems. This compilation will rely on publicly available corpora and resources for MT (e.g., the multilingual corpus with transcriptions of European Parliament Sessions).
Domain specific corpora will be needed to adapt the general purpose SMT system to the concrete domain of application in this project (Patents case study). This corpora will come from the compilation to be made at WP7, leaded by Mxw.
We already have the European Parliament corpus compiled and annotated for English and Spanish. Languages will probably finally be English, German, and Spanish or French, so as soon as this is confirmed the final general-purpose corpus can be easily compiled. The depth of the annotation will depend on the concrete languages and the available linguistic processors.
Combination of grammar-based and statistical paradigms is a novel and active research line in MT. (...) We plan explore several instantiations of the fallback approach. From simple to complex:
• Independent combination: in this case, the combination is set as a cascade of independent processors. When Grammar-based MT does not produce a complete translation, the SMT system is used to translate the input sentence. This external combination will be set as the baseline for the rest of combination schemes.
• Construction of a hybrid system based on both paradigms. In this case, a more ambitious approach will be followed, which consists of constructing a truly hybrid system which incorporates an inference procedure able to deal with multiple proposed fragment translations, coming from grammar-based and SMT systems. Again we envision several variants:
• Fix translation phrases produced by the partial GF analyses in the SMT search. In this variant we assume that the partial translations given by GF are correct so we can fix them and let SMT to fill the remaining gaps and do the appropriate reordering. This hard combination is easy to apply but not very flexible.
• Use translation phrase pairs produced by the partial GF analyses, together with their probabilities, to form an extra feature model for the Moses decoder (probability of the target sentence given the source).
• Use tree fragment pairs produced by the partial GF analyses, together with their probabilities, to feed a syntax based SMT model, such as the one by Carreras and Collins (2009) . In this case the search process to produce the most probable translation is a probabilistic parsing scheme.
The previous text describes the hybrid MT systems we consider to include. The baseline is clear. In fact, one can define three baselines: a raw GF system, a raw SMT system and the naïve combination of both. Regarding real hybrid systems there is much more to explore. Here we list four approaches to be pursued:
Hard integration. Force fixed GF translations within a SMT system.
Soft integration I. Led by SMT. GF partial output, as phrase pairs, is integrated as a discriminative probability feature model in a phrase-based SMT system.
Soft integration II. Led by SMT. GF partial output, as tree fragment pairs, is integrated as a discriminative probability model in a syntax-based SMT system.
Soft integration III. Led by GF. Complement with SMT options the GF translation structure and perform statistical search to find the final translation.
At the moment, we are able to obtain phrases and alignments from a GF-generated synthetic corpus. This is a first step for the hard integration of both paradigms, and also for the soft integration methods led by SMT. We are currently going deeper into the latter, as it is a domain independent study.
In the evaluation process, these families of methods will be compared to the baseline(s) introduced above according to several automatic metrics.
WP5 is going to have its own internal evaluation complementary to that of WP9. Since statistical methods need of fast and frequent evaluations, most of the evaluation within the package will be automatic. For that, one needs to define the corpora and the set of automatic metrics to work with.
Statistical methods are linked to patents data. This is the quasi-open domain where the hybridization is going to be tested. The languages of the corpus are not still completely defined, but by looking at other works with patents we guess they will probably be English, German, and French or Spanish.
Besides the large training corpus, we need at least two smaller data sets, one for development purposes and another one for testing. The order of magnitude of these sets is usually around 1,000 aligned segments or sentences. We expect to reach this size, but the final amount will depend on the available data.
BLEU (Papineni et al. 2002) is the de facto metric used in most machine translation evaluation. We plan to use it together with other lexical metrics such as WER or NIST in the development process of the statistical and hybrid systems.
Lexical metrics have the advantage of being language-independent, since most of them are based on n-gram matching. However, they are not able to catch all the aspects of a language and they have been shown not to always correlate well with human judgements. So, whenever it is possible, it is a good practice to include syntactic and/or semantic metrics as well.
The IQmt package1 provides tools for (S)MT translation quality evaluation. For a few languages, it provides metrics to do this deep analysis. At the moment, the package supports English and Spanish, but other languages are planed to be included soon. We will use IQmt for our evaluation on the supported language pairs.
1http://www.lsi.upc.es/~nlp/IQMT/
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D9.1A Appendix to MOLTO test criteria, methods and schedule | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | April 2012 | 
| Actual date of delivery: | April 2012 | 
| Type: | Report | 
| Status & version: | Final | 
| Author(s): | Lauri Carlson, Inari Listenmaa, Seppo Nyrkkö et al. (UHEL) | 
| Task responsible: | UHEL | 
| Other contributors: | 
During the review on March 20, 2012, an appendix was requested to better specify the methodology that MOLTO intends to adopt to carry evaluation of the work and results related to each workpackage. This document tries to clarify the goals and how they will be achieved in Workpackage 9.
Requirements of the addendum:
The first year review recommendation states:
The second year review recommendation adds:
The scope of applicability for MOLTO translation is a function of the domain and language coverage. The locale and grammar coverage at the start of the project was fixed by the apported GF resource grammar library. One of the main tasks of the MOLTO project is to provide tools for extending domain coverage and the associated lexical coverage by MOLTO translation users themselves. The tools should make it feasible for user communities to extend MOLTO translation to new domains and vocabularies. The market segment that can be targeted by MOLTO tools by the end of the project is in turn a function of the availability and efficiency of these tools and thereby the potential coverage of MOLTO translation. We are aiming at making it feasible to build and use domain specific grammars with lexicons in the order of thousands of words (instead of hundreds).
The two properties: restricted coverage and predictable input, restrict the market segment to production (dissemination). The constrained language property means MOLTO will not offer a replacement for CAT, i.e. translation tools that help human translators with complex third party authored documents which they are not allowed to modify. But MOLTO translation can be added an additional facility in the CAT toolkit. Conversely, traditional TMS facilities may add value to the application and extension of MOLTO methods. These ideas are explored in WP 3.
MOLTO remains at the core a tool for constrained language multilingual generation. Its potential strengths are 1) multiple simultaneous target languages and 2) reliable enough quality for blind translation (translation from a known language to unknown languages). 2) can only be obtained if the quality is higher than human translation. In practice, some level of human revision is probably going to be needed, but the need can be significantly less than in current workflows.
From this, we conclude that the most promising market segment for MOLTO translation is constrained language content localization. In current translation industry, there is a more or less clear split between interface localization, which involves translation of fixed short strings from a list of interface messages by professional or volunteer translators, and content translation, which is mostly done outside of the website using CAT tools.
MOLTO targets an as yet less explored and little exploited niche between them, viz. multilingual content localization of constrained language content. Typical use cases are a webstore inventory, a museum guide, rule generated correspondence, or formulaic parts of a more complex document type (say descriptions of chemical formulas in a patent). Here the content is already regulated and predictable.There are further such scenarios beyond those included in MOLTO use cases, typically involving some database generated information (e.g. product descriptions, user guides, chemical manufacturer's data sheets, job tickets, medical reports). In some such scenarios. real time blind translation to multiple languages would be a major selling point.
In the MOLTO translation scenario related to this market segment, there is a close interaction between some database/ontology and a human/ruleset that generates the text to translate, and the translation process itself. The content to translate can co-evolve with the grammar by which it is translatable. Such use cases will be tested in the MOLTO semantic wiki platform.
If or as the vision of Linked Data becomes reality, there is bound to be a growing demand for natural language verbalization of the web of linked data ontologies. The Web of Data is supposed to become an additional layer of the web that is tightly interwoven with the classic document Web and has many of the same properties:
In particular,
The growing linked data cloud can create a growing market segment for a matching linked cloud of multilingual MOLTO ontology verbalizers. Ontotext's GF based natural language query interface into Ontotext linked data is a first application of MOLTO resources in this direction. As the review points out, a generalization of the ad hoc ontology/GF mappings the KRI and museum cases gets a high priority here.
The new workpackage 11 aims to use GF to extend AceWiki to a multilingual constrained language semantic wiki. Like the original AceWiki, it allows users to express in natural language logical constraints that are subject to automated reasoning. AceWiki already has facilities for extending the lexicon. A subset of the constraints expressible in ACE are interatranslatable with the OWL ontology language. In the scenario envisaged here, the multilingual semantic wiki works as a tool for extending a special domain ontology through natural language verbalization. This platform supports the scenario where a special domain ontology and its verbalizations are extended simultaneously.
In one natural scenario, a special domain expert expresses the constraints in unconstrained natural language as comments in the wiki. One or more ontology experts refine the description into a set of simpler statements in a  constrained subset that maps to OWL, using already existing ontologies as base and creating the missing ontology resources and their verbalizations in a common natural language using the lexicon editor. The domain experts can test the conceptualization by asking questions of the ontology. The questions are answered in natural language using the wiki's reasoners. When the coverage of the ontology and its verbalization in the chosen language/s is sufficient, the lexicon is extended for the remaining languages, using existing term ontologies as a base, by target language experts.
More traditional translation projects can also contain parts which can be handled with constrained language translation. The MOLTO patents case has shown that certain sections of patent text, in particular complex chemical compound descriptions, are not well covered by SMT. The MOLTO translator tools workpackage looks into ways of embedding MOLTO constrained language translation as one tool in the toolkit of a more traditional CAT platform. In this use case, we also test the ability of a translation community (company) to collaboratively extend coverage of the fragment handled with MOLTO tools. This sort of a hybrid SMT+MOLTO+CAT workflow is tested with the patents use case in the MOLTO Translators tools platform as described in D 3.1. Note that the two scenarios are not exclusive. In an overarching scenario, a domain translation is developed in the first scenario and it is applied in production translation in the MOLTO CAT scenario. Actors in some of the supporting roles of the MOLTO CAT scenario may use the wiki tool in their work.
The CAT scenario is described in more detail below under WP 3.
The following details the MOLTO use cases relating them to the scenarios above. Each section lists the evaluation criteria, measures and methods applied in the use case.
The grammar developers tools promise to enable quick development of a new domain and language. This promise is best tested directly by measuring the time and expertise taken
The measures are taken for a system with a coverage in the order of a) 100 concepts b) 1000 concepts. The platforms used in carrying out the tests include the multilingual semantic wiki (tasks 1 and 2), the TermFactory? platform (tasks 1 and 4) and the grammar editing tools (tasks 2,3,4). To test these claims, we need to fix one or more domains to create/extend. We haven't got a great many domains to choose from yet. We would do well to extend in the direction of known 'good' ontologies.
Baseline evaluation figures prior to the use of MOLTO tools for a domain of smaller size were obtained in the phrasebook exercise reported in Ranta et al.2010 [9]. For comparability, the same criteria and measures are to be applied in subsequent evaluations.
The MOLTO CAT scenario is designed to serve a translation community that carries out translation projects using MOLTO tools as an additional CAT tool. The translation community members are assigned different roles. What they may do depends on the role. Roles are assigned in the translation management system. In the MOLTO demonstration system, the TMS is Globalsight. The TMS manages the resources of a project. The resources include
A MOLTO CAT translation project is composed by a collection of resources and a community of actors playing different roles in the project. One actor can bear more than one role.
The roles include
The TMS manages the project workflow, that is, routes documents through different steps between the actors. The actions include
The typical envisaged workflow is this. A translator in a multilingual translation project works on a structured multipart document, some of whose parts are marked as amenable to translation with the MOLTO editor. The rest is translated with traditional CAT tools. A subsection appropriate for MOLTO translation is opened in the MOLTO translation editor. The appropriate GF grammar and terminology are specified in the project resources. If the section is properly within the fragment covered by the grammar, the section should parse and translate correctly without translator intervention. This is the default if the MOLTO marked section has been created in scenario A. However, until the domain grammar has been fully tested for blind translation in all target languages, a target language translator or revisor must check that the target text is correct.
If the grammar coverage is not complete, the translation editor shows some parts of the section marked as untranslatable.
In the easy case, the coverage problem can be fixed by a conservative paraphrase or, if the translator's brief permits pre-editing, by a more creative rewrite of the section source to bring it under the coverage of the MOLTO grammar. The original source and its paraphrase get stored in the translation memory as an instance of source rewrite, and will be available for other translators as a model solution of the coverage problem. If a rewrite is not possible, the next move depends on the workflow.
As indicated in the MOLTO CAT system design, the MOLTO translation editor is integrated as a plugin to the translation management system alongside more traditional CAT editors. The MOLTO CAT scenario sets the following requirements on the editor and its integration to the TMS.
The development of the translation editor to satisfy these requirements is taken over by UGOT, as it is closely coupled to the ongoing development of the GF robust parsing and grammar extension services.
These requirements remain the responsibility of UHEL.
The TermFactory? term management specification and query/editing API is a Tomcat Axis2 webservice API for querying, editing, and storing small RDF/OWL ontologies representing concepts and multilingual expressions/terms associated with the concepts. TermFactory? contains a term ontology schema that follows professional terminology standards, but the tools can also be used to edit any RDF/OWL ontologies through an XHTML representation RDF. The XHTML representation is extremely configurable. It can be parametrized for the presentation layout (concept oriented, lemma oriented), filtered for content, and even localized with another TF term ontology so that names of properties and classes shown to the user are chosen from the localization ontology. The term ontology editor is a pluggable javascript editor that is offered as a standalone Tomcat servlet as well as a MediaWiki? extension. A simpler tabular editor exists for the common task of adding different language equivalents to an existing ontology term.
TermFactory? is to be integrated with the MOLTO KRI over the JMS transport interface provided in the KRI. Besides the Ontotext repositories, TermFactory? also talks to Jena RDB and triple set repositories. TermFactory? user management is planned to happen through the GlobalSight? API.
The GlobalSight? translation management system forms a platform to test the MOLTO TT scenario that combines traditional CAT tools with the MOLTO translation editor. The best dataset for testing the full MOLTO CAT scenario should be the patents, since it already uses hybrid methods and generates a translation of less than 100% coverage. To have a complete use case of the mixed scenario, a pure GF grammar for chemical compounds could be applied to translate chemical compound definitions in the patent text.
The MOLTO CAT review workflow will be used manage translation quality evaluation of the multilingual translations produced in the other use cases. This exercise in itself also serves to test the usability of MOLTO scenario B.
The second year review considered Deliverable 4.2 and Deliverable 4.3 insufficient and they were not approved by the reviewers in their current status. The objectives of WP4 are, as stated in the DoW? :
(i) research and development of two-way grammar-ontology interoperability bridging the gap between natural language and formal knowledge; (ii) infrastructure for knowledge modeling, semantic indexing and retrieval; (iii) modeling and alignment of structured data sources; (iv) alignment of ontologies with the grammar derived models.
D4.2 should contain a report on the Data Models, Alignment Methodology, Tools and Documentation. More specifically, it should contain information about the aligned semantic models and instance bases. While D4.2. contains information about Reason-able views and the key principles constituting these views are stated in the document, it does not state how these key principles have been implemented in the MOLTO-project. D4.2 does not comply with the key principle stating “Clean up, post-process and enrich the datasets if necessary, and do this in a clearly documented and automated manner.” D4.2 should contain exactly all details about the automation process of multiple ontologies. so that this knowledge and technique can be re-used to integrate new ontologies with the existing ones.
D4.3. should clear out the issue of the two-way interoperability between ontologies and GF grammars. This is still unclear, although objective (i) of WP4 is clear that this is a research-intensive part of MOLTO. Based on the WP4 presentation given in the review, this process requires the manual writing of mapping rules (NL Query -> GF, GF-> SPARQL query), which means limited potential for further re-use. The partners must clear the degree of automation that can be performed. What is required for porting this to a new application? Concrete steps should be provided making clear what can be automated and what cannot with the provided infrastructure. Details about mapping rule induction etc. should be provided.
As for the ontology/grammar mappings, here is what we have concretely got so far:
The examples show that the owl to GF mapping need not be difficult in any given case. What seems open is how to generalize these examples for the general case of generating a mapping for a new domain. In particular, we want a solution that allows the reuse of ontology to GF mappings to create more complex grammars from existing parts. The modularity of both OWL and GF suggest ways of approaching this goal.
One approach to a more general solution is to use the term ontologies developed in TermFactory? to also store parts of mappings needed for GF verbalization. In a TermFactory? term ontology, a term is a pair of a general language expression and a special language concept. In this approach, an ontology concept would map to an abstract grammar term. Individual language expressions and terms associated with the concept map to concrete grammar terms. A term or expression would inherit GF grammar properties from classes to which it belongs (say, exp:Noun). Grammatical properties common to all uses of a given general language expression would be stored as properties of the expression. GF terms or grammatical properties that are specific to a domain GF grammar would stored as properties of a domain specific term.
Instead of having to define a new grammar and create concept to grammar associations from scratch, a grammar would be compiled from appropriate choices of resource from the term ontology plus a language and/or domain specific syntactic base. To extend a vocabulary, we add a new term (expression, concept) instance, typed in the appropriate categories, and add to it any further GF properties that are relevant to its correct linearization. The concrete expression associated to a compositional abstract grammar term need not be specified in the ontology, if it can be compositionally derived from the GF abstract syntax associated to the concept and other resources in the ontology. The above does not claim to do more than propose a way to decompose the ontology to grammar mapping into reusable parts.
If the approach seems useful, UHEL is prepared to invest effort to building a test case using the museum case as a starting point.
The research goal was to develop translation methods that complement the grammar-based methods of WP3 to extend their coverage in unconstrained text translation. Specifically, WP 5 promised to create a commercially viable prototype of a system for MT and retrieval of patents in the bio-medical and pharmaceutical domains, (ii) allowing translation of patent abstracts and claims in at least 3 languages, and (iii) exposing several cross-language retrieval paradigms on top of them.
WP5 is has its own internal evaluation complementing that of WP9. Since statistical methods need fast and frequent evaluations, most of the evaluation within the package is automatic. The WP7 case study on translating Patents text is the use scenario to test the techniques developed in this package. Ultimately, Ontotext will examine the feasibility of the prototype as a part of a commercial patent retrieval system (D7.3).
Statistical methods are linked to patents data. This is the quasiopen domain where the hybridization is going to be tested. The languages of the corpus are English, German, and French, the official languages of the European Patent Office (EPO).
Besides the large training corpus, we need at least two smaller data sets, one for development purposes and another one for testing. The order of magnitude of these sets is usually around 1,000 aligned segments or sentences. For this, we have used a subset of MAREC patents (http://www.ir-facility.org/prototypes/marec), and a collection of 66 patents provided by the EPO. The concrete figures are explained in WP5 and summarised in the table below.
Seg DE-EN Seg FR-EN Seg FR-DE dev MAREC 993 993 993 test MAREC 1,008 1,008 1,008 test EPO 847 858 831
BLEU [3] is the de facto metric used in most machine translation evaluation. We plan to use it together with other lexical metrics such as WER or NIST in the development process of the statistical and hybrid systems. Lexical metrics have the advantage of being language-independent, since most of them are based on n-gram matching. However, they are not able to catch all the aspects of a language and they have been shown not to always correlate well with human judgments. So, whenever it is possible, it is a good practice to include syntactic and/or semantic metrics as well. The Asiya package provides tools for (S)MT translation quality evaluation. For a few languages, it provides metrics to do this deep analysis. At the moment, the package supports English and Spanish, but other languages are planed to be included soon. We will use Asiya for our evaluation on the supported language pairs.
Final translations will be also manually evaluated. This is the most realiable way to quantify the quality of a translation since automatic metrics cannot capture all the aspects that a human evaluator takes into account as said in the previous section.
We now propose to follow the ranking for evaluation that is used in patent offices such as EPO. It can be applied to sentences but also to full patents. So, automatic metrics will also be adpated to deal with full patent evaluation and see how they correlate. This way we will be able to perform a deep study.
Quality level: Ranking for human evaluation
The translation is understandable and actionable, with all critical information accurately transferred. Most of the text is well written using a language consistent with patent literature.
The translation is understandable and actionable, with all most critical information accurately transferred. Some text is well written using a language consistent with patent literature.
The translation is not entirely understandable and actionable, with some critical information accurately transferred. The text is of the text is well written using a language consistent with patent literature.
Possibly understandable and actionable (given enough context and/or time to work it out), with some information stylistically or grammatically odd, but the language may still reflect a sound content to a patent professional. Most of the text written using a language consistent with patent literature.
Absolutely not comprehensible and/or little or no information is transferred accurately.
The math use case remains as it was, except that the use case may assume that premises requiring encyclopedic knowledge needed to frame word problems are given. Assuming that the math scenario will be embedded in the semantic wiki, the background premises may be given by the author of the problem in the facts database where the problems are formulated.
The mathematics use cases involve a problem author, a student and a teacher. The usability of the scenario is tested with realistic subjects playing each of these roles and the evaluation collected with a questionnaire and/or a journal. In addition, we should try estimate the savings from the system when scaled up to a larger use base and variety of languages, since these are the novelties in the MOLTO solution.
WP 6 has developed a treebank based method for doing regression testing on the translations produced by the math grammar. A treebank entry consists of:
A Changeset has:
A defect is a difference between the actual linearization of an entry and the sample in the last changeset.
The procdure is as follows.
See http://www.molto-project.eu/wiki/living-deliverables/d61-simple-drill-grammar-library/5-testing for further discussion.
The first year review recommended that WP7 work should focus on the major issues examined in MOLTO, especially in relation to the grammar-ontology interoperability rather than chemical compound splitting. Specific scenarios are needed for the exploitation of MOLTO tools in this case study. It was recommended to include such scenarios in a new version of deliverable D9.1.
In response, two use case scenarios were described: UC-71 and UC-72.
WP7 corresponds to the Patents Case Study. Its objective is to build a multilingual patents retrieval prototype. The prototype consists of three main modules: the multilingual retrieval system, the patents translation and the user interface. This document proposes a methodology to evaluate these modules within the MOLTO framework.
The automatic translations included in the retrieval database have been produced by the machine translation systems developed within the WP5. Hence, the evaluation related to this module is the same as the one described for the WP5 systems.
Nowadays, the IR-facility organizes the TREC Chemical IR Evaluation campaign (http://www.ir-facility.org/trec-chem-2011-cfp) The evaluation campaign has three different tracks. One of them is very related to our objective in this WP. - Technology Survey - Given an information need (from the bio-chemistry domain) expressed in natural language, retrieve all patents and scientific articles which may satisfy this need.
Following the guidelines described in the TREC campaign, the methodology proposed to evaluate the patents retrieval system is as follows.
User interfaces are usually evaluated by means of their Usability. According to the ISO 9241-11, usability must measure the "Extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.".
Hence, to get a complete picture of the usability, we need to measure the user satisfaction (users reaction to the interface), effectiveness (can people complete their tasks?) and efficiency (how long do people take?).
The three measures of usability are effectiveness, efficiency and satisfaction. They are independent and it must be measured all three to get a rounded measure of usability.
The experiment setting may consist of two scenarios: a closed one (i.e., specifying the information that must be obtained) and an open one (i.e., let the user search any type of information). The users are requested to complete both scenarios, and the order in which they are done must be balanced (i.e., Half of them will do the open scenario first). They must answer the questionnaire twice, just after each scenario.
The potential users might be of two types: MOLTO participants and related people (internal) and external users. The internal users can be used as the control test. External participants can be engaged from tools like the Mechanical Turk Requester [8].
D8.2 (AR, DD, RE 2012) -->
The museum grammar creates multilingual descriptions from a museum ontology using GF grammar for the verbalization. The GF grammar provides a direct verbalization of the triples and different types of complex discourse patterns: a text generated by the grammar has necessary elements painting, painting type and painter, and as optional information year, museum, colour, size and material. For a detailed description, see D8.2 (Ranta et al. 2012).
An abstract syntax for the direct verbalization grammar can be generated automatically from the ontology. The discourse patterns have been human-generated, and they can be reused for different language versions and for more objects. For example, the type of a complete painting is described in an abstract syntax as following:
cat CompletePainting Painting PaintingType Painter OptYear OptMuseum OptColour OptSize OptMaterial ;
CompletePainting is a type constructor that takes type parameters to construct a type for a painting. A painting from Gothenburg City Museum has a following type:
data GSM940042ObjPainting : CompletePainting GSM940042Obj MiniaturePortrait JKFViertel (MkYear (YInt 1814)) (MkMuseum GoteborgsCityMuseum) (MkColour Grey) (MkSize (SIntInt 349 776)) (MkMaterial Wood) ;
In the concrete syntax all this complexity is hidden. Porting the grammar to a new language requires only writing the concrete syntax. However, the underlying ontology makes sure that the grammar generates only valid descriptions and not random combinations of paintings, painters and other properties.
As of March 2012, the translation of the museum objects and the additional lexicon (painting materials, colours) needs to be done manually. The future plan is to combine tools developed in WP3 to make the lexicon extension automatic, by using multilingual lexicon harvesting from term ontologies or other reliable sources (DBPedia, TermFactory? ).
D8.2 has promised to increase the coverage from 5 languages to 10 languages, and extend the grammar and the lexicon for at least 5 languages. The GF grammar can be tested continuously, while developing, with the treebank method described earlier in this document. A grammar developer should be fluent in the language she is developing the concrete syntax, and the treebank testing should be thorough. If the testing is done properly in the grammar development phase, there shouldn't be need to have specific translation quality evaluation experiments. The best way to spot problems is through real usage, so UHEL is offering a bug tracking platform, where users can report all kinds of issues, including language errors.
The idea is not to translate existing texts, but to generate descriptions in response to user queries. As described in D8.2,
D8.2: The grammar presented here allows to generate well-formed multilingual natural language descriptions about museum artefacts with the aim of empowering users who wish to access cultural heritage information through different computing devices.
Other question is to evaluate the use of the queries. Currently the grammar has one discourse pattern with optional elements; the variety comes from adding or leaving out some information. One possibility discussed in D8.2 is to include more variety in the generated text. A qualitative evaluation study with non-expert human subjects would serve this purpose. The aspects to test in this experiment would be the ease of querying and whether the results answer the query. However, as long as this plan is not certain, we are not designing any concrete test methods.
A third question is the ease of the grammar writing and the reusability of the grammar -- is it possible for other museums to use the grammar, if they have their own standards? Currently a prerequisite for the museum grammar is an ontology that follows Cidoc-CRM standard. This is an important aspect, if we are to make MOLTO tools used outside the test cases within the project. The step from a specified format to verbalizations are well defined, now it should be given more thought how to cover the first step of the process: whichever type of museum database to a CRM format. We could, as a part of evaluation, interview some domain specialists and survey the needs and interests for this kind of system, and whether the first step is a big enough threshold to prevent them to use the system.
The main goal of the proposed work-package is to build an engine for a multilingual semantic wiki, where the involved languages are precisely defined (controlled) subsets of the 15 languages that are studied in the MOLTO project.
The wiki engine would allow the input and presentation of the wiki content in all the languages, and perform formal logic based reasoning on the content in order to enable e.g. natural language based question answering. The users of the wiki can contribute to the wiki in any of the supported languages by adding statements to the wiki, as well as extending its concept lexicon. The wiki would integrate a "predictive editor" that helps the user cope with the restricted syntax of the input languages, so that explicit learning of the syntactic restrictions is not required. Ideally, the wiki would also integrate semantics-support, e.g. a paraphraser and a consistency-checker that could be used to enhance the quality of the wiki articles. The wiki engine is going to be implemented by combining the resources and technologies developed in the MOLTO project (GF grammar library, tools for translation and smart text input) with the resources and technologies developed in the Attempto project (Attempto Controlled English, AceWiki? ).
The task of WP11 will be to combine the technologies developed in the MOLTO project with ACE and AceWiki? , concretely:
In this document, the list of application domains to evaluate multilingual semantic wiki becomes longer, since we envisage using the multilingual wiki as a common testbed for those MOLTO use cases where an ontology and its verbalization are developed in parallel. This can include some or all of the following cases:
It is too early to describe evaluation of this case in detai pending a description of the use case itself. But we can suggest that the beInformed use case could be framed and tested as an instance of the multilingual semantic wiki scenario, if the business logic reasoning rules can be expressed in the semantic wiki database.
[1] AP. E. M. Voorhees and D. K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.
[2] NDCG. K. Kärvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002.
[3] BLEU. Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: a method for automatic evaluation of machine translation" in ACL-2002: 40th Annual meeting of the Association for Computational Linguistics pp. 311--318
[4] IR Metrics. http://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision
[5] IBM CSUQ. http://hcibib.org/perlman/question.cgi?form=CSUQ
[6] SUS. http://www.usabilitynet.org/trump/methods/satisfaction.htm
[7] Word Cloud. Usability. http://www.userfocus.co.uk/articles/satisfaction.html
[8] Mechanical Turk Requester. https://requester.mturk.com/
[9] Ranta, Aarne, Enache Ramona, and Détrez Grégoire, Controlled Language for Everyday Use: the MOLTO Phrasebook. Controlled Natural Languages Workshop (CNL 2010) http://www.molto-project.eu/sites/default/files/everyday.pdf
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D9.2 MOLTO evaluation and assesment report | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M36 | 
| Actual date of delivery: | March 2013 | 
| Type: | Report | 
| Status & version: | Draft | 
| Author(s): | Jussi Rautio, Maarit Koponen | 
| Task responsible: | UHEL | 
| Other contributors: | UPC | 
Abstract
The impact of MOLTO is not about just individual use cases. During the 3 years of the project, we have developed methods of efficient grammar writing, dividing the task such that grammar experts and domain experts get to do what they can best. These guidelines are documented in D2.3, Best practices.
We have conducted a grammar evaluation survey for people who have written grammars. The results of the survey and an overview of the practices are documented in Part 1.
We have also noted the time and measures for correcting grammars. Since the release of the first MOLTO demo (D10.2, tourist phrasebook), we have collected feedback and bug reports, and corrected the bugs. Part 2 describes these bugs and the effort that has been needed to fix them.
The impact of MOLTO is not about just individual use cases. During the 3 years of the project, we have developed methods of efficient grammar writing, dividing the task such that grammar experts and domain experts get to do what they can best. These guidelines are documented in D2.3, Best practices.
Best practices document was published in October 2012, but many of the grammars are written before that. Here is first an overview of the best practices and whether the grammars are written accordingly.
(This summary is copypaste from the document.)
The following tools are standard and well-tested in MOLTO’s and other applications:
It has two modules: Sentences, which contains phrases that can be defined by a functor over the resource grammar API. The phrases that are likely to have different implementations are in the module Words.
Semantic validity is handled with simple, restrictive abstract syntax. For example, an abstract syntax tree like
HowFarBy : Place -> ByTransport -> Question
guarantees that we can say "How far is the church by taxi" but not "How far is John by beer": the arguments need to be a place and a transport.
Module structure: Common constructions with a functor
Starting point for the grammar was a test corpus of sentences we want to express in the grammar. These sentences are used as a documentation for the abstract syntax:
AHasAge : Person -> Number -> Action ; -- I am seventy years AHasChildren: Person -> Number -> Action ; -- I have six children AHasName : Person -> Name -> Action ; -- my name is Bond
ACE-GF: based on Attempto Controlled English. (ACE is ____.)
Acewiki working on ACE (acewiki subset), grammars for Cat, Dan, Dut, Eng (not ACE), Est, Fin, Fre, Ger, Ita, Lav, Nor, Pol, Ron, Rus, Spa, Swe, Urd (https://github.com/Attempto/ACE-in-GF/tree/master/grammars/acewiki_aceowl).
Grammar modules: ACE base, in addition domain lexicons (Geography).
(in AceWiki also normal grammars, not ace. But unrelated to ACE grammar.)
Questionnaire Basic information: Use of development tools: Diagnostic tools Compilation diagnostics: Grammar display modes: Testing Tools for generation and testing: RGL Resource grammar tools: Grammar writing Starting point for your grammar: Basic unit of the grammar: Semantic control: Module structure: Concrete syntax:
Analysis of answers: ....
Some things answered in "Other", not in Best practices(?):
Other method for treebanks: Haskell code to store, edit and show differences in treebanks.
Other development tool: Haskell and shell scripts generating grammars
Examples of grammar modification
case study: Phrasebook
Phrasebook was published as deliverable 10.2 in June 2010, third month of MOLTO. Initially it translated between 14 European languages (now 20 languages) and was written by 8 authors. These include people with varied GF skills, from 2-day GF course to major developers of GF. Some of the language versions were written by people with actually no skills in the language, using example-based grammar writing (see the report for more information).
During the 2.5 years, we have gotten feedback and bug reports. The issues can be divided in Phrasebook errors and resource grammar library (RGL) errors. Both of course show as errors in the application grammar, but the error needs to be fixed at a different level. Also the time spent fixing the problem and the expertise of the grammar writer is different between the two error types.
Feedback has been given various ways. There is a feedback button in the demo for anonymous feedback; this has gone to ____ (WHERE) and has been assigned to ____ (WHO). The Phrasebook demo has been shown in various presentations, and sometimes during the presentation an audience members or the presenter has noticed a problem. The problem has been either fixed by the presenter, or in a case where the presenter lacks time, language skills or GF skills to fix the bug, it has been given to someone with skills and time.
Initially there was no project-wide reporting system, but since autumn 2012, UHEL has set up one in http://tfs.cc/trac. Each application grammar has an owner who gets a notification about new tickets, and can fix the bug or assign the job to someone.
Crowdsourcing is another possible source for bug detection. However, in order to profit from that we would need a large number of people browsing the site and our apps, which is not realistic. Most of the bug reports come from people already involved in MOLTO.
Here I list issues that I know of. This is not necessarily a complete list.
The difference between application grammar issue and RGL issue can be unclear; for instance, an incorrect morphology in the application grammar may result in using wrong RGL functions or there not being a correct RGL function in the first place. In a case where there exists a correct RGL function but the user has chosen a wrong one, I have classified the error as application grammar issue, as the fix has been made in the application grammar.
Spanish:
1) HowFar, HowFarFrom, HowFarBy ja HowFarFromBy
2) Plane
mkN "avión" masculine.3) Fish
fish_N, and its meaning is live fish.mkN "pescado".4) Adjectives ending in consonant inflect wrong
mkA. With smart paradigms this means choosing the right number of arguments, which in this case is 5 as opposed to 1. Applied to 8 adjectives in the application grammarCatalan:
1) HowFar, HowFarFrom, HowFarBy ja HowFarFromBy
Finnish:
1) Locative cases for geographical names
Spanish and Catalan:
1) Negative imperatives
ImpNeg function in Spanish and Catalan RGL and used it in the application grammar.2) Adjectives ending in consonant inflect wrong
French:
1) Wrong agreement in French superlative forms
DetNP, which only produces masculine versions.DetNPFem for all Romance languages, have the application grammar a construction based on the gender of the nounFinnish:
1) Vowel harmony of possessive suffixes
2) Wrong word forms in Finnish genetive+possessive suffix http://tfs.cc/trac/ticket/34 3) Pronoun problems with the modal verb "must" in Finnish http://tfs.cc/trac/ticket/23 4) Incorrect plural stem for "children" in Finnish http://tfs.cc/trac/ticket/27 5) Translation of modal verb + a location not working for Finnish. Modal verb problems also in Italian, Catalan and Russian. http://tfs.cc/trac/ticket/15
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D10.1 Dissemination plan with Monitoring and Assessment | 
| Security (distribution level): | Confidential | 
| Contractual date of delivery: | M3 | 
| Actual date of delivery: | 1 Jun 2010 | 
| Type: | Report | 
| Status & version: | Draft | 
| Author(s): | Olga Caprotti and Aarne Ranta | 
| Task responsible: | UGOT ( WP10 ) | 
| Other contributors: | Lluís Màrquez, Borislav Popov, and Jordi Saludes | 
This deliverable described the range of dissemination activities planned for MOLTO. It also formally introduces the Advisory Board, the Steering Group members, and their deputies as ratified during the kick off meeting of the project.
List dissemination activities per year. This list is intended for planning purposes. According to our workplan: '' Dissemination on conferences, symposiums and workshops will be in the areas of language technology and translation, semantic technologies, and information retrieval and will include papers, posters, exhibition booths and sponsorships (by Ontotext at web and semantic technology conferences like ISWC, WWW, SemTech), and academic/professional events such as the Information Retrieval Facility Symposium. We will also organize a set of MOLTO workshops for the expert audience, featuring invited speakers and potential users from academy and industry''
MOLTO research and results will be published in conferences in the fields of computational linguistics, statistical machine translation, artificial intelligence and machine translation in general but also in specialized areas related to the domain case studies. We envision the possibility to showcase the results of the MOLTO studies in meetings on mathematical user interfaces, patent translation (information retrieval symposium for patent and scientific content), semantic web and OWL technologies, and natural language processing.
A small sample of dissemination activities has already taken place at the beginning of MOLTO:
Possible future conferences include:
The Association for Computational Linguistics and the European Association for Machine Translation maintain lists of relevant events at http://www.eacl.org and at http://www.eamt.org.
MOLTO also plans to disseminate at a regional level by presenting the work in meetings organized by national organization, or at the university and professional level in each partner country, e.g.:
As example, MOLTO will be presented at the meeting La Indústria de la Traducció entre Llengües Romàniques, 1st Workshop on The Industry of Translation of Romance Languages, organized by the Polytechnical University of Valencia (UPV) on September 8, 2010. See http://www.upv.es/contenidos/JORTRAD/info/indexnormalc.html.
Aside from proceedings of conferences and special issues arising in connection with presentations given at international conferences, MOLTO expects to publish results of the work in scientific journals such as:
We are monitoring also aggregation sites such as eLanguageNet.
Events organized by MOLTO in primis include the project meetings. A preliminary schedule is the following:
| Title | Date | Location | 
|---|---|---|
| Kickoff meeting | 8-11 March, 2010 | Barcelona, Spain | 
| 1st Project Meeting | Sept. 2010 | Varna, Bulgaria | 
| 2nd Project Meeting | 2nd week Mar 2011 | UGOT | 
| 3rd Project Meeting | Sept. 2011 | UHEL | 
| 4th Project Meeting | Mar. 2012 | UZH | 
| 5th Project Meeting | Sept. 2012 | BI | 
| 6th Project Meeting | Mar. 2013 | UGOT | 
MOLTO also envisions the possibility to organize specific events targeted to its user groups: either because of the case studies, or because of the scientific results. Training activities such as hands-on sessions and special courses will be organized during the final year of the project's lifetime. Likely venues for such events are the major conferences as well as graduate schools in linguistics, e.g. GSLT, Graduate School in Language Technology (http://www.gslt.hum.gu.se). Project members visiting partners' nodes will be encouraged to present their work to a wider audience in departmental seminars, tutorials and intensive courses. For example, A. Ranta and R. Enache from UGOT gave a GF tutorial during the exchange visit to UHEL on May 4-5 2010.
In addition, MOLTO is planning to organize a high-profile scientific meeting on machine translation in connection to a major event, attracting prominent speakers from the field.
The target audience of press releases should be the general public. Press releases need to address the goals and results of the project in a way as to popularize and inform about the area of machine translation.
Press releases will be produced on yearly basis and circulated using each partner's channels, in addition to publication on the website. The yearly release will also be circulated in bulletins of professional associations, such as that of the European Chapter of the ACL (EACL), the primary professional association for computational linguistics in Europe.
Sample activities that have already taken place include:
UGOT has a specific office in charge of public relations and will be contacted to distribute the news of the project. Helena Åberg keeps us informed of the coverage of MOLTO in the media. The project has been prominently featured at its beginning as the following list shows:
MOLTO plans to establish contact with related projects in machine translation such as EuroMatrix and T4ME to organize joint meetings in the future. Initial discussions with members of these projects have taken place during LREC 2010, last May. Initial communication with representatives of EuroMatrix has taken place during LREC2010 and identified as common interest the development of hybrid translation systems. UPC is a partner both to MOLTO and to the FAUST - Feedback for User adaptive Statistical Translation project. The ICT-FY project HATS: Highly Adaptable and Trustworthy Software using Formal Models has approached MOLTO to discuss possible future cooperation in the area of translation between formal and informal software specifications. The project ATLAS, ATLAS (Applied Technology for Language-Aided CMS), is another EU project whose aims are similar to MOLTO. We will monitor their work to evaluate possible overlaps and areas of cooperation.
Furthermore, as a research and technology development project, MOLTO is entitled to join META-SHARE, an effort to setup a pool of language resources and technologies. META is the Multilingual Europe Technology Alliance network, see http://www.meta-net.eu.
A number of personal contacts and email communication has already taken place during this first trimester. Here below a short summary aimed at giving an overview of what target group is interested in the project's results.
MOLTO will use the web as a main channel of dissemination. The web site will initially be designed mainly to assist the management of the project with a section for registered users not available to anonymous readers.
News of the project will appear regularly and are available either per subscription (member-only) or using the RSS feed: http://www.molto-project.eu/news/rss.xml. The RSS feed is public and will distribute the public news. To be able to read the internal news, the members have to be authenticated.
Subscriptions allow members to fine tune the type of information that is sent automatically from the website. It can be personalized from the profile pages.
Registered project members can post news items by either:
All these types of content appear in the news flow, while the Event is also added to the calendar. If some item is meant to be only for the Consortium, then its access controls can me modified accordingly.
Events of interest will be advertised via newsletters (examples ....) and social sites, in particular:
Partners are featuring the MOLTO projects on their websites.
The project can be reached for questions by a contact form accessible online and questions will be answered in the FAQ: http://www.molto-project.eu/view/faq. Any registered user can add questions and answers to the FAQ. Current categories include: - Goals and Promises - Technology - People and Organization
Access and usage of the MOLTO website is monitored via Google Analytics. Reports are available to interested project members.
The MOLTO management structure comprises two bodies whose function is monitoring the progress of the project: and internal and an external one.
MOLTO has setup a Steering Group to help the management of the project. The Steering Group is composed by the Coordinator, assisted by the Project Manager and by a representative of each site. During the kickoff meeting, each partner has nominated a Site Leader to be active in the Steering Group as follows:
Work package leaders that have also been nominated during the kickoff meeting, are listed in the online work plan.
The Steering Group holds monthly calls (usually during the 3rd week of the month) via Skype and extraordinary calls when necessity arise. The minutes of these calls is posted on the confidential pages of the web site at http://www.molto-project.eu/node/867.
The Steering Group convenes at project meetings, during which a Business Meeting is called to ratify major decisions and to resolve conflicts.
The task of the MOLTO advisory board is to perform independent quality assurance and assess the progress of the work. It is composed by leading scientists that are outside the MOLTO Consortium. This choice serves two purposes: to obtain an independent opinion on the research and approaches taken by MOLTO, and to disseminate the work done by the project to related scientific communities.
The final composition of the Advisory Board is the following:
Members of the Advisory Board are expected to attend the second yearly meeting, namely project meetings 2, 4, and 6 to learn about the yearly outcome of MOLTO. Their travel costs will be funded by MOLTO.
The Advisory Board will write an assessment report which will be delivered as part of the yearly report to the Commission. This report will evaluate the results and, if desirable, suggest ways to improve them.
In the MOLTO workplan, workpackage WP9 Requirements and evaluation runs throughout the entire project's lifetime. In the beginning it will define the requirements for both the generic tools and the case studies, later it performs evaluation and delivers feedback including bug fixing.
The liaison person from UHEL is Mirka Hyvärinen, he will be in contact with other project members. UHEL has also setup an internal working wiki "MOLTO kitwiki" (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/WebHome), open to all project members who request access.
D9.1 MOLTO test criteria, methods and schedule due on 1st September, 2010 will contain the detailed schedule and plan for the quality evaluation workflow.
Planning exploitation of MOLTO results.
See e.g. the text describing the monument in this screenshot of an Art Guide for Genova. It is rather simple and one can easily imagine the information content that should be carried by the underlying ontology. This is a generalization of the work done in WP8. Note that to produce this the ontology must contain also information on location, opening hours, etc.

and as it happens, there is no such info in English.

In general, I think this would be a contribution to the area of Geographical Information Systems.
| Attachment | Size | 
|---|---|
| Screen Shot 2012-10-07 at 3.12.27 PM.png | 245.4 KB | 
| Screen Shot 2012-10-07 at 3.19.06 PM.png | 213.85 KB | 
A very interesting scenario of exploitation of WP8 cultural heritage description is for online auctions, see for instance the catalogue listing on such a site in Skåne:

It is not clear what kind of knowledge base the auction houses adopt, namely whether they use the same metadata as the museums. In any case, these are potential customers for WP8 results.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D10.2 MOLTO web service, first version | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M3 | 
| Actual date of delivery: | 2 June 2010 | 
| Type: | Prototype | 
| Status & version: | Final | 
| Author(s): | Krasimir Angelov, Olga Caprotti, Ramona Enache, Thomas Hallgren, Inari Listenmaa, Aarne Ranta, Jordi Saludes, Adam Slaski | 
| Task responsible: | UGOT | 
| Other contributors: | UPC, UHEL | 
This phrasebook is a program for translating touristic phrases between 14 European languages included in the MOLTO project (Multilingual On-Line Translation): Bulgarian, Catalan, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Romanian, Spanish, Swedish. A Russian version is not yet finished but will be added later. Also other languages may be added.
The phrasebook is implemented by using the GF programming language (Grammatical Framework). It is the first demo for the MOLTO project, released in the third month (by June 2010). The first version is a very small system, but it will extended in the course of the project.
The phrasebook is available as open-source software, licensed under GNU LGPL, at http://code.haskell.org/gf/examples/phrasebook/.
</br/></p/>
The MOLTO phrasebook is a program for translating touristic phrases between 14 European languages included in the MOLTO project (Multilingual On-Line Translation):
The phrasebook is implemented in the GF programming language (Grammatical Framework). It is the first demo for the MOLTO project, released in the third month (by June 2010). The first version is a very small system, but it will be extended in the course of the project.
The phrasebook has the following requirement specification: - high quality: reliable translations to express yourself in any of the languages - translation between all pairs of languages - runnable in web browsers - runnable on mobile phones (via web browser; Android stand-alone forthcoming) - easily extensible by new words (forthcoming: semi-automatic extensions by users)
The phrasebook is available as open-source software, licensed under GNU LGPL. The source code resides in ftp://code.haskell.org/gf/examples/phrasebook/
We consider both the end-user perspective and the content producer perspective.
The phrasebook is available as open-source software, licensed under GNU LGPL. The source code resides in http://code.haskell.org/gf/examples/phrasebook/. Below a short description of the source files.
Sentences: general syntactic structures implementable in a uniform way. Concrete syntax via the functor SencencesI.Words: words and predicates, typically language-dependent. Separate concrete syntaxes.Greetings: idiomatic phrases, string-based. Separate concrete syntaxes.Phrasebook: the top module putting everything together. Separate concrete syntaxes.DisambPhrasebook: disambiguation grammars generating feedback phrases if the input language is ambiguous.Numeral: resource grammar module directly inherited from the library.
The module structure image is produced in GF by
    > i -retain DisambPhrasebookEng.gf
    > dg -only=Phrasebook*,Sentences*,Words*,Greetings*,Numeral,NumeralEng,DisambPhrasebookEng
    > ! dot -Tpng _gfdepgraph.dot > pgraph.png
The abstract syntax defines the ontology behind the phrasebook.  Some explanations can be found in the 
ontology document, which is produced from the abstract syntax files
Sentences.gf
and
Words.gf by make doc.
Based on this case study, we roughly estimated the effort used in constructing the necessary sources for each new language and compiled the following summarizing chart.
| Language | Language skills | GF skills | Informed development | Informed testing | Impact of external tools | RGL Changes | Overall effort | 
|---|---|---|---|---|---|---|---|
| Bulgarian | ### | ### | - | - | ? | # | ## | 
| Catalan | ### | ### | - | - | ? | # | # | 
| Danish | - | ### | + | + | ## | # | ## | 
| Dutch | - | ### | + | + | ## | # | ## | 
| English | ## | ### | - | + | - | - | # | 
| Finnish | ### | ### | - | - | ? | # | ## | 
| French | ## | ### | - | + | ? | # | # | 
| German | # | ### | + | + | ## | ## | ### | 
| Italian | ### | # | - | - | ? | ## | ## | 
| Norwegian | # | ### | + | - | ## | # | ## | 
| Polish | ### | ### | + | + | # | # | ## | 
| Romanian | ### | ### | - | - | # | ### | ### | 
| Spanish | ## | # | - | - | ? | - | ## | 
| Swedish | ## | ### | - | + | ? | - | ## | 
Legend
Language skills
GF skills
Informed Development/Informed testing
Impact of external tools
RGL changes (resource grammars library)
Overall effort (including extra work on resource grammars)
The figure presents the process of creating a Phrasebook using an example-based approach for a language X, in our case either Danish, Dutch, German, Norwegian, for which we had to employ informed development and testing by a native speaker, different from the grammarian.

Remarks : The arrows represent the main steps of the process, whereas the circles represent the initial and final results after each step of the process. Red arrows represent manual work and green arrows represent automated actions. Dotted arrows represent optional steps. For every step, the estimated time is given. This is variable and greatly influenced by the features of the target language and the semantic complexity of the phrases and would only hold for the Phrasebook grammar.
Initial resources :
The first step assumes an analysis of the resource grammar and extracts the information needed by the functions that build new lexical entries. A model is built so that the proper forms of the word can be rendered, and additional information, such as gender, can be inferred. The script applies these rules to each entry that we want to translate into the target language, and one obtains a set of constructions.
The generated constructions are given to an external translator tool (Google translate) or to a native speaker for translation. One needs the configuration file even if the translator is human, because formal knowledge of grammar is not assumed.
The translations into the target language are further more processed in order to build the linearizations of the categories first, decoding the information received. Furthermore, having the words in the lexicon, one can parse the translations of functions with the GF parser and generalize from that.
The resulting grammar is tested with the aid of the testing script that generates constructions covering all the functions and categories from the grammar, along with some other constructions that proved to be problematic in some language. A native speaker evaluates the results and if corrections are needed, the algorithm runs again with the new examples. Depending on the language skills of the grammar writer, the changes can be made directly into the GF files, and the correct examples given by the native informant are just kept for validating the results. The algorithm is repeated as long as corrections are needed.
The time needed for preparing the configuration files for a grammar will not be needed in the future, since the files are reusable for other applications. The time for the second step can be saved if automatic tools, like Google translate are used. This is only possible in languages with a simpler morphology and syntax, and with large corpora available. Good results were obtained for German and Dutch with Google translate, but for languages like Romanian or Polish, which are both complex and lack enough resources, the results are discouraging.
If the statistical oracle works well, the only step where the presence of a human translator is needed is the evaluation and feedback step. An average of 4 hours per round and 2 rounds were needed in average for the languages for which we performed the experiment. It is possible that more effort is needed for more complex languages.
Further work will be done in building a more comprehensive tool for testing and evaluating the grammars, and also the impact of external tools for machine translation from English to various target languages will be analysed, so that the process could be automated to a higher degree for the future work on grammars.
Words by hand or (semi)automatically for items related to the categories of food, places, and actions  will result in immediate increase of the expressiveness of the phrasebook.The basic things "everyone" can do are:
Words and greetings in GreetingsThe missing concrete syntax entries are added to the WordsL.gf files for each language L. The morphological paradigms of the GF resource library should be used. Actions (prefixed with A, as AWant) are a little more demanding, since they also require syntax constructors. Greetings (prefixed
with G) are pure strings.
Some explanations can be found in the implementation document, which is produced from the concrete syntax files SentencesI.gf and
WordsEng.gf by make doc.
Here are the steps to follow for contributors:
darcs pull.make present in gf/lib/src/.gf/examples/phrasebook/.make pgf.darcs record . (in the phrasebook subdirectory).darcs send -o my_phrasebook_patch, which you can send to GF maintainers.gf/src/server/ and follow the instructions in the
project Wiki.
b. Make sure that Phrasebook.pgf is available to you GF server (see project wiki).
c. Launch lighttpd (see project wiki).
d. How you can open gf/examples/phrasebook/www/phrasebook.html and use your phrasebook!Finally, a few good practice recommendations:
The grammarian need not be a native speaker of the language. For many languages, the grammarian need not even know the language, native informants are enough. However, evaluation by native speakers is necessary.
Correct and idiomatic translations are possible.
A typical development time was 2-3 person working days per language.
Google translate helps in bootstrapping grammars, but must be checked. In particular, we found it unreliable for morphologically rich languages.
Resource grammars should give some more support e.g. higher-level access to constructions like negative expressions and large-scale morphological lexica.
Acknowledgments
The user interface is kept slim so as to also be usable from portable devices, e.g. mobile phones. These are the buttons and their functionality:
The symbol &+ means binding of two words. It will disappear in the complete translation.
The translator is slightly overgenerating, which means you can build some semantically strange phrases. Before reporting them as bugs, ask yourself: could this be correct in some situation? is the translation valid in that situation?
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - Multilingual Online Translation | 
| Deliverable: | D10.3 MOLTO web service, final version | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | M39 | 
| Actual date of delivery: | May 2013 | 
| Type: | Prototype | 
| Status & version: | Final | 
| Author(s): | Thomas Hallgren, Olga Caprotti et al. | 
| Task responsible: | UGOT | 
| Other contributors: | UPC, UHEL, Ontotext | 
In this deliverable we document the web services that have been provided by the MOLTO project. Many of them have been released with dedicated deliverables, for those we do not enter into the specific details. Instead we focus on the web services powering some of the MOLTO flagships at the end of the project's lifetime.
The GF Cloud Service API exposes any PGF compiled grammar as a web service via the PGF web service API, it provides additional functionality of some commands in the GF shell and some services for grammar compilation and persistent storage of files in the cloud. These features are used for instance in the implementation of the GF Simple Editor (http://cloud.grammaticalframework.org/gfse/) developed as a translators' tool during MOLTO.
The service is available from http://cloud.grammaticalframework.org/. The source code for hosting a local version of the webservice is distributed in the GF distribution, hence users that have GF installed on their own computer can also run the service locally by starting GF with the parameter
--server[=port]           Run in HTTP server mode on given port (default 41296).
Requests are made via HTTP with the GET or POST method. (The examples below show GET requests, but POST is preferred for requests that change the state
on the server.) Data in requests is in the application/x-www-form-urlencoded format (the format used by default by web browsers when submitting form data). Data in responses is usually in JSON format. The HTTP response code is usually 200, but can also be 204 (after file upload), 404 (file to download or remove was not found), 400 (for unrecognized commands or missing/unacceptable parameters in requests) or 501 (for unsupported HTTP request methods). Unrecognized parameters in requests are silently ignored.
More details on how to run the service are given in Deliverable 2.3 on page 7 under "Building a web application".
The GF Cloud Service supports a set of PGF service requests, for example, a request like
http://cloud.grammaticalframework.org/grammars/Foods.pgf?command=random
might return a result like
[{"tree":"Pred (That Pizza) (Very Boring)"}]
The PGF Service in the GF Cloud is the application which exposes the PGF API as Web Service. The application uses FastCGI as communication protocol to talk with the web server. The data protocol that we use is JSON. Information for how to compile and install the service could be found here.
A compiled GF grammars could be used in web applications in the same way as JSP, ASP or PHP pages are used. The compiled PGF file is just placed somewhere in the web site directory. When there is a request for access to a .pgf file then the web server redirects the request to the GF web service. The service knows how to load the grammar and interpret the parameters given in the URL.
If my_grammar.pgf is a grammar placed in the root folder of the web site for localhost then the grammar could be accessed using this URL:
http://localhost/my_grammar.pgf
Since there aren't any parameters passed in this case, the web service will respond with some general information about the grammar, encoded in JSON format. To perform specific command you have to tell what command you want to perform. The command is encoded in the parameter command i.e.:
http://localhost/my_grammar.pgf?command=cmd
where cmd is the name of the command. Usually every command also requires specific list of other arguments which are encoded as parameters as well. The list of all supported commands follows:
This command provides some general information about the grammar. This command is also executed if no command parameter is given.
| Parameter | Description | Default | 
| command | grammar | - | 
Object with three fields:
| Field | Description | 
| name | the name of the abstract syntax in the grammar | 
| userLanguage | the concrete language in the grammar which best matches the default language, set in the user's browser | 
| categories | list of all abstract syntax categories defined in the grammar | 
| functions | list of all abstract syntax functions defined in the grammar | 
| languages | list of concrete languages available in the grammar | 
Every language is described with object having this two fields:
| Field | Description | 
| name | the name of the concrete syntax for the language | 
| languageCode | the two character language code according to the ISO standard i.e. en for English, bg for Bulgarian, etc. | 
The language codes should be specified in the grammar because they are used to identify the user language. The web service receives the code of the language set in the browser and compares it with the codes defined in the grammar. If there is a match then the service returns the corresponding concrete syntax name. If no match is found then the first language in alphabetical order is returned.
This command parses a string and returns a list of abstract syntax trees.
| Parameter | Description | Default | 
| command | parse | - | 
| cat | the start category for the parser | the default start category for the grammar | 
| input | the string to be parsed | empty string | 
| from | the name of the concrete syntax to use for parsing | all languages in the grammar will be tried | 
| limit | limit how many trees are returned (gf>3.3.3) | no limit is applied | 
List of objects where every object represents the analyzes for every input language. The objects have three fields:
| Field | Description | 
| from | the concrete language used in the parsing | 
| brackets | the bracketed string from the parser | 
| trees | list of abstract syntax trees | 
| typeErrors | list of errors from the type checker | 
The abstract syntax trees are sent as plain strings. The type errors are objects with two fields:
| Field | Description | 
| fid | forest id which points to a bracket in the bracketed string where the error occurs | 
| msg | the text message for the error | 
The current implementation either returns a list of abstract syntax trees or a list of type errors. By checking whether the field trees is not null we check whether the type checking was successful.
The command takes an abstract syntax tree and produces string in the specified language(s).
| Parameter | Description | Default | 
| command | linearize | - | 
| tree | the abstract syntax tree to linearize | - | 
| to | the name of the concrete syntax to use in the linearization | linearizations for all languages in the grammar will be generated | 
| Field | Description | 
| to | the concrete language used for the linearization | 
| tree | the output text | 
The translation is a two step process. First the input sentence is parsed with the source language and after that the output sentence(s) are produced via linearization with the target language(s). For that reason the input and the output for this command is the union of the input/output of the commands for parsing and the one for linearization.
| Parameter | Description | Default | 
| command | translate | - | 
| cat | the start category for the parser | the default start category for the grammar | 
| input | the input string to be translated | empty string | 
| from | the source language | all languages in the grammar will be tried | 
| to | the target language | linearizations for all languages in the grammar will be generated | 
| limit | limit how many parse trees are used (gf>3.3.3) | no limit is applied | 
The output is a list of objects with these fields:
| Field | Description | 
| from | the concrete language used in the parsing | 
| brackets | the bracketed string from the parser | 
| translations | list of translations | 
| typeErrors | list of errors from the type checker | 
Every translation is an object with two fields:
| tree | abstract syntax tree | 
| linearizations | list of linearizations | 
Every linearization is an object with two fields:
| Field | Description | 
| to | the concrete language used in the linearization | 
| text | the sentence produced | 
The type errors are objects with two fields:
| Field | Description | 
| fid | forest id which points to a bracket in the bracketed string where the error occurs | 
| msg | the text message for the error | 
The current implementation either returns a list of translations or a list of type errors. By checking whether the field translations is not null we check whether the type checking was successful.
This command generates random abstract syntax tree where the top-level function will be of the specified category. The categories for the sub-trees will be determined by the type signatures of the parent function.
| Parameter | Description | Default | 
| command | should be random | - | 
| cat | the start category for the generator | the default start category for the grammar | 
| limit | maximal number of trees generated | 1 | 
The output is a list of objects with only one field:
| Field | Description | 
| tree | the generated abstract syntax tree | 
The length of the list is limited by the limit parameter.
Word completion is a special case of parsing. If there is an incomplete sentence then it is first parsed and after that the state of the parse chart is used to predict the set of words that could follow in a grammatically correct sentence.
| Parameter | Description | Default | 
| command | complete | - | 
| cat | the start category for the parser | the default start category for the grammar | 
| input | the string to the left of the cursor that is already typed | empty string | 
| from | the name of the concrete syntax to use for parsing | all languages in the grammar will be tried | 
| limit | maximal number of trees generated | all words will be returned | 
The output is a list of objects with two fields which describe the completions.
| Field | Description | 
| from | the concrete syntax for this word | 
| text | the word itself | 
This command renders an abstract syntax tree into image in PNG format.
| Parameter | Description | Default | 
| command | abstrtree | - | 
| tree | the abstract syntax tree to render | - | 
| format | output format (gf>3.3.3) | PNG | 
Byy default, the output is an image in PNG format. The content-type is set to image/png, so the easiest way to visualize the generated image is to add HTML element <img> which points to URL for the visualization command i.e.:
<img src="http://localhost/my_grammar.pgf?command=abstrtree&tree=..."/>
The output can also be in GIF ('image/gif'), SVG ('image/svg+xml') or GV (graphviz) format by setting the 'format' option
This command renders the parse tree that corresponds to a specific abstract syntax tree. The generated image is in PNG format.
| Parameter | Description | Default | 
| command | parsetree | - | 
| tree | the abstract syntax tree to render | - | 
| from | the name of the concrete syntax to use in the rendering | - | 
| format | output format (gf>3.3.3) | png | 
| options | additional rendering options (gf>3.4) | - | 
The additioal rendering options are: noleaves, nofun and nocat (booleans, false by default); nodefont, leaffont,nodecolor, leafcolor, nodeedgestyle and leafedgestyle (strings, have builtin defaults).
By default, the output is an image in PNG format. The content-type is set to 'image/png', so the easiest way to visualize the generated image is to add HTML element <img> which points to URL for the visualization command i.e.:
<img src="http://localhost/my_grammar.pgf?command=parsetree&tree=..."/>
The output can also be in GIF ('image/gif'), SVG ('image/svg+xml') or gv (graphviz) format by setting the format option
This command renders the word alignment diagram for some sentence and all languages in the grammar. The sentence is generated from a given abstract syntax tree.
| Parameter | Description | Default | 
| command | `alignment` | - | 
| tree | the abstract syntax tree to render | - | 
| format | output format (gf>3.3.3) | PNG | 
| to | list of languages to include in the diagram (gf>3.4) | all languages supported by the grammar | 
By default, the output is an image in PNG format. The content-type is set to 'image/png', so the easiest way to visualize the generated image is to add HTML element ``  which points to URL for the visualization command i.e.:
<img src="http://localhost/my_grammar.pgf?command=alignment&tree=..."/>
The output can also be in GIF ('image/gif'), SVG ('image/svg+xml') or GV (graphviz) format by setting the 'format' option
This service lets you execute arbitrary GF shell commands. Before you can do this, you need to use the /new command to obtain a working directory (which also serves as a session identifier) on the server, see below.
/gfshell?dir=...&command=i+Foods.pgf
  /gfshell?dir=...&command=gr
  Pred (That Pizza) (Very Boring)
  /gfshell?dir=...&command=ps+-lextext+%22That+pizza+is+very+boring.%22
  that pizza is very boring .
For documentation of GF shell commands, see:
/new
  /tmp/gfse.123456. Most of the cloud service commands require that a working directory     is specified in the dir parameter. The working directory is persistent, so clients are expected
    to remember and reuse it. Access to previously    uploaded files requires that the same working directory is used.
  /parse?path=source
  /cloud?dir=...&command=upload&path1=source1&path2=source2&...
  /cloud?dir=...&command=make&path1=source1&path2=source2&...
  
    { "errorcode":"OK", // "OK" or "Error"
    
  "command":"gf -s -make FoodsEng.gf FoodsSwe.gf FoodsChi.gf",
    
  "output":"\n\n" // Warnings and errors from GF
    
}
  /cloud?dir=...&command=remake&path1=source1&path2=source2&...
  command=make, except you can leave
    the sourcei parts empty to reuse previously uploaded
    files.
  /cloud?dir=...&command=download&file=path
  /cloud?dir=...&command=ls&ext=.pgf
  ["Foods.pgf","Letter.pgf"].
  /cloud?dir=...&command=rm&file=path
  /cloud?dir=...&command=link_directories&newdir=...
  GF can be used interactively from the GF Shell. Some of the functionality availiable in the GF shell is also available via the GF web services API.
The GF Web Service API page describes the calls supported by the GF web service
API. Below, we illustrate these calls by examples, and also show how to make these calls from JavaScript using the API defined in <a href="js/pgf_online.js" rel="nofollow">pgf_online.js</a>.
Note that  pgf_online.js was initially developed with one particular web application in mind (the minibar), so the server API was incomplete. It was simplified and generalized in August 2011 to support the full API.
These boxes show what the calls look like in the JavaScript API defined in pgf_online.js. These boxes show the corresponding URLs sent to the PGF server. These boxes show the JSON (JavaScript data structures) returned by the PGF server. This will be passed to the callback function supplied in the call.
// Select which server and grammars to use:
var server_options = {
                             grammars_url: "http://www.grammaticalframework.org/grammars/",
                             grammar_list: ["Foods.pgf"] // It's ok to skip this
                             }
var server = pgf_online(server_options);
// Get the list of available grammars
server.get_grammarlist(callback)
http://localhost:41296/grammars/grammars.cgi
["Foods.pgf","Phrasebook.pgf"]
// Select which grammar to use
server.switch_grammar("Foods.pgf")
// Get list of concrete languages and other grammar info
server.grammar_info(callback)
http://localhost:41296/grammars/Foods.pgf
       {"name":"Foods",
        "userLanguage":"FoodsEng",
        "startcat":"Comment",
        "categories":["Comment","Float","Int","Item","Kind","Quality","String"],
        "functions":["Boring","Cheese","Delicious","Expensive","Fish","Fresh",
                           "Italian","Mod","Pizza","Pred","That","These","This","Those","Very",
                           "Warm","Wine"],
        "languages":[{"name":"FoodsBul","languageCode":""},
                            {"name":"FoodsEng","languageCode":"en-US"},
                            {"name":"FoodsFin","languageCode":""},
                            {"name":"FoodsSwe","languageCode":"sv-SE"},
                            ...]
          }
// Get a random syntax tree
server.get_random({},callback)
http://localhost:41296/grammars/Foods.pgf?command=random
       [{"tree":"Pred (That Pizza) (Very Boring)"}]
// Linearize a syntax tree
server.linearize({tree:"Pred (That Pizza) (Very Boring)",to:"FoodsEng"},callback)
http://localhost:41296/grammars/Foods.pgf?command=linearize&tree=Pred+(That+Pizza)+(Very+Boring)&to=FoodsEng
       [{"to":"FoodsEng","text":"that pizza is very boring"}]
       server.linearize({tree:"Pred (That Pizza) (Very Boring)"},callback)
 http://localhost:41296/grammars/Foods.pgf?command=linearize&tree=Pred+(That+Pizza)+(Very+Boring)
     [{"to":"FoodsBul","text":"онази пица е много еднообразна"},
     {"to":"FoodsEng","text":"that pizza is very boring"},
     {"to":"FoodsFin","text":"tuo pizza on erittäin tylsä"},
     {"to":"FoodsSwe","text":"den där pizzan är mycket tråkig"},
     ...
     ] 
// Parse a string
server.parse({from:"FoodsEng",input:"that pizza is very boring"},callback)
http://localhost:41296/grammars/Foods.pgf?command=parse&input=that+p...
       [{"from":"FoodsEng",
         "brackets":{"cat":"Comment","fid":10,"index":0,
         "children":[{"cat":"Item","fid":7,"index":0,
         "children":[{"token":"that"},{"cat":"Kind","fid":6,"index":0,
         "children":[{"token":"pizza"}]}]},    
        {"token":"is"},{"cat":"Quality","fid":9,"index":0,
          "children":[{"token":"very"},{"cat":"Quality","fid":8,"index":0,
          "children":[{"token":"boring"}]}]}]},
          "trees":["Pred (That Pizza) (Very Boring)"]}]
// Translate to all available languages
server.translate({from:"FoodsEng",input:"that pizza is very boring"},callback)
...
// Translate to one language
server.translate({input:"that pizza is very boring", from:"FoodsEng", to:"FoodsSwe"}, callback)
http://localhost:41296/grammars/Foods.pgf?command=translate&input=th...
      [{"from":"FoodsEng",
        "brackets":{"cat":"Comment","fid":10,"index":0,
        "children":[{"cat":"Item","fid":7,"index":0,
        "children":[{"token":"that"},{"cat":"Kind","fid":6,"index":0,
        "children":  [{"token":"pizza"}]}]},{"token":"is"},{"cat":"Quality","fid":9,"index":0,
        "children":[{"token":"very"},{"cat":"Quality","fid":8,"index":0,"children":[{"token":"boring"}]}]}]},
        "translations":
        [{"tree":"Pred (That Pizza) (Very Boring)",
          "linearizations":
           [{"to":"FoodsSwe",
              "text":"den där pizzan är mycket tråkig"}]}]}]
// Get completions (what words could come next)
server.complete({from:"FoodsEng",input:"that pizza is very "},callback)
http://localhost:41296/grammars/Foods.pgf?command=complete&input=tha...
      [{"from":"FoodsEng", "brackets":{"cat":"_","fid":0,"index":0,
        "children":[{"cat":"Item","fid":7,"index":0,
        "children":[{"token":"that"},{"cat":"Kind","fid":6,"index":0,
        "children":[{"token":"pizza"}]}]},{"token":"is"},{"token":"very"}]},
        "completions":["boring","delicious","expensive","fresh","Italian","very","warm"],
        "text":""}]
// Get info about a category in the abstract syntax
server.browse({id:"Kind"},callback)
http://localhost:41296/grammars/Foods.pgf?command=browse&id=Kind&...
      {"def":"cat Kind", "producers":["Cheese","Fish","Mod","Pizza","Wine"],
       "consumers":["Mod","That","These","This","Those"]}
// Get info about a function in the abstract syntax
server.browse({id:"This"},callback)
http://localhost:41296/grammars/Foods.pgf?command=browse&id=This&...
       {"def":"fun This : Kind -> Item","producers":[],"consumers":[]}
// Get info about all categories and functions in the abstract syntax
server.browse({},callback)
http://localhost:41296/grammars/Foods.pgf?command=browse&format=json
       {"cats":{"Kind":{"def":"cat Kind",
             "producers":["Cheese","Fish","Mod","Pizza","Wine"],
             "consumers":["Mod","That","These","This","Those"]},
     ...},
         "funs":{"This":{"def":"fun This : Kind -> Item","producers":[],"consumers":[]},
     ...}
     }
// Convert an abstract syntax tree to JSON
server.pgf_call("abstrjson",{tree:"Pred (That Pizza) (Very Boring)"},callback)
http://localhost:41296/grammars/Foods.pgf?command=abstrjson&tree=Pred+(That+Pizza)+(Very+Boring)
       {"fun":"Pred","fid":4,
        "children":[{"fun":"That","fid":1,
          "children":[{"fun":"Pizza","fid":0}]},
         {"fun":"Very","fid":3,
          "children":[{"fun":"Boring","fid":2}]}]}
At the beginning of the project, we have published the MOLTO Phrasebook as example application grammar. For the final version of our online service, we show all the relevant GF application grammars that have been developed in various work-packages as supporting grammars for larger applications. Each example in this collection can be used by a new GF grammar developer as a starting point that can be further extended. In this deliverable we briefly document the grammars, the online applications that use them, and give quick hints on where extension can occur in future work.
This grammar has been developed originally for the semantic multilingual wiki system AceWiki-GF, as documented in Deliverable D11.3. The grammar can be used online at http://attempto.ifi.uzh.ch/acewiki-gf/.
It currently supports 3 languages: ACE, German and Spanish, where ACE is a formal language used for automated reasoning. A 500-word geography domain vocabulary has been created to describe Europe.
ACE is represented by two languages, Ace and Ape. Ape linearizations contain explicit lexical entries so that the ACE parser (APE) can be used to map the sentences of this grammar to OWL. The wiki shows how this mapping works.
A snapshot of the grammar is available at http://www.molto-project.eu/biblio/software/geographypgf.
The MOLTO Phrasebook has been the first demonstrator of the features of the Grammatical Framework technology, online since M3 of the project's lifetime. The application grammar was designed to serve as model for best practices. It shows a modular approach to the definition of abstract types and functions from the domain of travelers' phrasebooks , covering natural language for giving directions, ordering a meal, and greeting friends. It has categories for Citizenship, Country, Currency, Date and week Day, Digits, DrinkKind and MassKind, Languages, Greetings and many more. Eng, Bul, Cat, Dan, Dut, Fin, Fre, Ger, Hin, Ita, Lav, Nor, Pes, Pol, Ron, Rus, Spa, Swe, Tha, Urd. It has a module that handles disambiguation in Eng and in Ron.
The final version is online at http://www.molto-project.eu/cloud/gf-application-grammars by selecting as application Phrasebook.pgf.
The repository for the grammar file itself is at http://www.molto-project.eu/biblio/software/phrasebookpgf.
MathBar.pgf is the application grammar developed for the mathematical natural language domain.  It supports the following languages: Fre, Cat, Spa, Eng and Fin. More languages are available but have not been checked against quality. The Mathematical Grammar Library  (MGL) is a specialized language in which textual fragments are interspersed with formal fragments represented in the typesetting language LaTeX.
The source files are distributed via svn at URL: svn://molto-project.eu/mgl Repository Root: svn://molto-project.eu Repository UUID: 54d65b75-f25a-4862-968f-dc0a3298bc6b Revision: 2432
The compiled PGF grammar is available from http://www.molto-project.eu/biblio/software/mathbarpgf.
Commands.pgf is the application grammar developed for natural language I/O to the Sage computer algebra system. It translates input queries  and output answers into natural language of mathematical nature. Users can ask for computations related to arithmetic, domain and range of functions, differentiation and integration. It also supports the usage of referential mechanism by the pronoun it, which will link to the previous result in a session of sequential computations.  English, German and Spanish are currently supported.
Dialog.pgf  translates natural language interactions of the word problems prototype documented in Deliverable D6.3. It is used to give hints in the student's language and to formalize the students' answers or commands as Prolog statements that can be reasoned automatically with. It is an example of how a description of a specific world situation (owing fruits, animal in a farm) can be interpreted and formalized.  Catalan, English, Spanish and Swedish are currently supported. The programming language Prolog is also supported. SVN info for compilation from source: URL: svn://molto-project.eu/mgl/wproblems Repository Root: svn://molto-project.eu Repository UUID: 54d65b75-f25a-4862-968f-dc0a3298bc6b Revision: 2432 GF version compilation: Grammatical Framework (GF) version 3.4-darcs.
The version archived and deployed on the MOLTO cloud is http://www.molto-project.eu/biblio/software/dialogpgf.
The work-package dealing with the domain of cultural heritage has focused on the description of museum artefacts, in particular paintings. While the description of the subject matter of a painting is an open domain, the other characteristics of a painting can be described by a constrained natural language tightly coupled with the underlying knowledge representation used by museum curators. The design of this grammar has been based on sample descriptions of paintings retrieved from the Gothenburg City Museum and has been further applied to generate descriptions of artefacts stored on public web pages, such as DBPedia.
One major discussion has concerned the identification of entity names, museum names, as well as famous painters' or masterpieces are often translated ad hoc. For such cases, it is hard to create grammar-based translation rules, consider for instance Mona Lisa, in Italian often referred to as La Gioconda. The approach taken in this work package has been that of not translating the entity names found in the knowledge base while investigating whether historically there could be a given title or name that could be taken as a universally valid identifier for that entity. Since to our knowledge, there seems to be no agreement by museum curators on unique resource identifiers (whereas for instance, in the publishing world, there have been efforts of uniquely indexing published material), we have adopted a naming based on the resource descriptors we retrieved in our samples. In terms of future web application building, we are aware that resource identification and/or retrieval by the common name is not as sound as by unique ID.

This grammar is also modularly designed and assembled categories that are used to represent location, material, color, dimension, type of work, and painter's biographical data. The most relevant feature of this grammar is the construction of a description as a sequence of phrases related to the same artefact, using referential chains to build up a coherent discourse. Please see the list of publications tagged with WP8 for further information about the comparative study of texts in the cultural heritage domain and about the background knowledge base underlying the ontology from which texts in 15 languages are generated.
The grammar files are avliable on svn: molto-project.eu/wp8/d8.3/grammars/
The demo webpage is avaliable at: http://museum.ontotext.com/
The version of the grammar on display at the MOLTO Application Grammar web service (TextPainting.pgf) features:
The following start categories: Main category: Description 9 semantic categories which represent the ontology classes: Colour, Material, Museum, Painter, Painting, PaintingType, Size, Title, and Year. Of these 8 categories, 5 are optional, hence the additional 'Opt' categories. 3 category types: String, Int, Float 1 grammatical category for creating nested colour strings: ListColour
Support for 15 languages: Bulgarian (Bul), Catalan (Cat), Danish (Dan), Dutch (Dut), English (Eng), Finnish (Fin), French (Fre), Hebrew (Heb), Italian (Ita), German (Ger), Norwegian (Nor), Romanian (Rom), Russian (Rus), Spanish (Spa), Swedish (Swe).
Up to three sentence long text generation where each sentence may be constructed with different semantic categories. For example, consider the first sentence of a description:
Forest[PAINTING] was painted by Paul Cezanne[PAINTER] in 1902[YEAR].
Forest[PAINTING] was painted on canvas[MATERIAL] by Paul Cezanne[PAINTER] in 1902[YEAR].
Change of the syntactic element of the reference entity in sentence initial, i.e.
Forest was painted by Paul Cezanne in 1902. It[Pronoun] is painted in green and blue.
Forest was painted by Paul Cezanne in 1902. This painting[NounPhrase] is displayed at the National Gallery  of Canada.
As mentioned above, the names of the paintings and painters have been left untranslated. Since museum names have been translated automatically, some translations are missing. Therefore two or three words names contain underscores.
Hebrew texts with names that are missing translations cause wrong ordering of the words in a sentence.
This grammar is used to translate user queries into SPARQL. It contains 4 languages: English, German, French and a concrete syntax corresponding to SPARQL. Since the grammar is adapted to the patents domain, the constructors from the abstract syntax describe individual queries that depend on the domain. So, the SPARQL mappings are written in a gap-filling fashion, by specifying the query with spaces for the arguments.
Mode details from deliverables released by Work-package 7.
The sources are in the svn://molto-project.eu/wp7/query/grammars.
The Words300-grammar was produced to evaluate the correctness of the multilingual translation of ACE sentences offered by the ACE-in-GF grammar. The grammar contains ~300 words from the GF resource grammar library (RGL), namely the words from the ACE word classes common noun, transitive verb and proper name. Currently, most of the RGL languages are included, altogether 21 languages. For the description of the evaluation, see D11.3.
Note that the English sentences that this grammar produces are not always valid ACE sentences, due to "spaces in content words" which is not allowed in ACE. For example, the grammar supports For which computer does John wait? while ACE requires Which computer does John wait-for?.
The grammar can be used in a wiki at: http://attempto.ifi.uzh.ch/acewiki-gf/gf/Words300/main/
A snapshot of the grammar is available at http://www.molto-project.eu/biblio/software/words300pgf.
The Web Application Description Language, WADL (http://www.w3.org/Submission/wadl/), is a specification language of HTTP-based Web applications that can be read and processed automatically to generate web service clients. In combination with an API platform, such as Apigee (http://apigee.com), it is possible to expose the API of a web service to developers of third-party web applications so they can quickly integrate with further services, for instance authentication, logging data, performance monitoring.
For a GF grammar developer, writing a WADL specification for the grammar is a quick way to expose the translation command invocation details in a machine processable way. Any PGF compiled GF grammar can be fed to the GF Web Service along with specific commands and query parameters to provide for instance parsing, linearization, and random tree generation according to the the GF Web Service API. The documentation is available at http://code.google.com/p/grammatical-framework/wiki/GFWebServiceAPI. The web application running the GF web service is distributed in the regular GF distribution. A Java frontend was developed during the MOLTO project, http://www.molto-project.eu/biblio/software/gf-java-master, and is being maintained at Github, https://github.com/Kaljurand/GF-Java.
The example WADL specification file for web services powered by the TextPainting.pgf grammar hosted on the Grammatical Framework cloud server and deployed on Apigee, as seen in the figure below, is available at http://www.molto-project.eu/biblio/web-service/textpaintingpgf. It exposes the GET command for retrieving the grammar information and the GET command for retrieving a random production in any of the available categories.

The designer of the web service for translating painting descriptions might decide to expose a very specific command, for instance only parsing of descriptions in Italian. This is possible by selecting what to describe in the WADL specification in a careful way, by not exposing the full generality of the grammar. Grammars that are stable only in certain categories, for instance because of increasing complexity in their modular stepwise development, can in this way be deployed while under development, provided the only web services exposed are the stable ones.
GF compiled grammars deployed as web services seem to be able to offer valuable translation and parsing functionality to developers of online applications. With the work done during the MOLTO project we have only begun to experiment with the usage of GF powered web services and the results have been positive.
To further the adoption of GF and MOLTO technologies for high-quality translation of web applications, it would be important to be able to obtain the machine processable specification of the services, for instance as WADL or SOAP, directly available as an export command in the GF Web Service API. The client applications for the web services exposed by the application grammar would then be generated automatically allowing very fast prototyping. Software that generates web clients based on SOAP or WADL is already existing.
| Contract No.: | ICT-FP7-ICT-247914 and 288317 | 
|---|---|
| Project full title: | MOLTO - Enlarged EU, Multilingual Online Translation | 
| Deliverable: | 10.4 | 
| Security (distribution level): | PU | 
| Contractual date of delivery: | M39 | 
| Actual date of delivery: | 30 May 2013 | 
| Type: | Report | 
| Status & version: | Final | 
| Author(s): | O. Caprotti, B. Popov, J. van Aart | 
| Task responsible: | UGOT | 
| Other contributors: | 
The final dissemination and explotation report discusses how the project MOLTO has informed the public of the results. The industrial partners of the Consortium, Ontotext and Be Informed are the main contributors to the exploitation plan for the technologies developed by MOLTO. Exploitation of MOLTO aims to pursue sustainability for the tools and technologies and to further their uptake.
In the MOLTO initial plan for dissemination, we proposed to carry out the task in the following way:
> Dissemination on conferences, symposiums and workshops will be in the areas of language technology and translation, semantic technologies, and information retrieval and will include papers, posters, exhibition booths and sponsorships (by Ontotext at web and semantic technology conferences like ISWC, WWW, SemTech), and academic/professional events such as the Information Retrieval Facility Symposium. We will also organize a set of MOLTO workshops for the expert audience, featuring invited speakers and potential users from academy and industry.
Here we report on what has been done to popularize the work done in MOLTO and make the language translation community aware of the project.
Additionally, this deliverable contains a plan of further exploitation of the project's results. In the longer run, as outlined by the Strategic Research Agenda for Multilingual Europe in 2020 by the META Technology Council, Language Technology is expected to enable forms of knowledge evolution, knowledge transmission, and knowledge exploitation that speed up scientific, social, and cultural development. Any exploitation of MOLTO results will have to take into account the themes of this research agenda. It is already clear that the trends have started. For instance, Theme 1, the translation cloud is the fitting trend for the MOLTO web services living in the cloud. Some of the MOLTO application grammars in the cloud do indeed provide "services for instantaneous reliable spoken and written translation among all European and major non-European languages".
During the lifetime of the project we have pursued many ways of informing the relevant stakeholders about the progress of the research and development of MOLTO tools. The user community for MOLTO technologies comprises academicians, working the areas of computational linguistics and semantic web, but also members of industry offering services such as translations of web pages and of online content, from e-Government and business logics to cultural heritage, patents in pharmacology and creators of resources for e-learning of mathematics.
Here below the ways in which this broad user community has been targeted.
Because of the Open Access Clause, we had to make sure that the copyright policy for the proceedings of chosen conferences and meetings would allow distribution of the publication also on the partners' Open Access Servers. This is the list of Open Access servers that are also distributing the MOLTO publications:
Here below is the list obtained from the web pages by fetching publications registered by the authors as Conference Papers.
Journal publication is a longer process than publication in conference proceedings so one might expect that it occurs after the end of a project's lifetime as archival medium for those results which are considered long lasting and of permanent value. In MOLTO we have already succeeded to list the following journal publication:
Books and proceedings has also been published and is currently being translated to Chinese,
Free/Open-Source Rule-Based Machine Translation, Online Proceedings of the FreeRBMT12, the Third International Workshop on Free/Open-source Rule-based Machine Translation, June 2012 Gothenburg, Sweden. Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Technical Report Number 2013:03, (2013). 1652-926X (ISSN).
Controlled Natural Language, Third International Workshop, CNL 2012, Zurich, Switzerland, August 29-31, 2012. Proceedings Editors: Tobias Kuhn, Norbert E. Fuchs ISBN: 978-3-642-32611-0 (Print) 978-3-642-32612-7 (Online)
The following work has appeared as chapters of books:
Jeroen van Grondelle, Christina Unger: A 3-Dimensional Paradigm for Conceptually-scoped Language Technology [to appear], Towards the Multilingual Semantic Web, Springer, Autumn 2013.
Several junior members of the MOLTO team have completed their studies and written master or dissertation work related to tasks carried out as part of some work package. Some are continuing work that began in MOLTO as part of their thesis research. They include:
Enache, Ramona (2011, Licentiate). Automating the development of multilingual grammars. Göteborg : University of Gothenburg.
Angelov, Krasimir (2011, PhD). The Mechanics of the Grammatical Framework. Göteborg: Chalmers University of Technology. Diss. ISBN/ISSN: 978-91-7385-605-8
Virk, Shafqat (2012, Licentiate). Computational Grammar Resources for Indo-Iranian Languages. Göteborg : University of Gothenburg.
Dannélls Dana (2012, PhD). Multilingual text generation from structured formal representations. Data Linguistica. University of Gothenburg. [pdf]
Listenmaa I. 2012. Ontology-based lexicon management in a multilingual translation system – a survey of use cases. Department of Modern Languages, University of Helsinki. Download: listenmaa_masters_thesis_2012.pdf (1.05 MB)
Ramona Enache (PhD, Forthcoming). Frontiers of Multilingual Grammar Development (prelim.). Göteborg : University of Gothenburg.
The project meetings were held every six months and always included a day which was open to participants from outside the Consortium: the MOLTO Open Day. The talks delivered by the MOLTO project members were targeted to a generic audience with no specific background assumed except for interest in the goals of MOLTO. The presentations are all available from the project's web site.
Additionally the project organized focused meetings:
The GF Summer School is a biannual event, these were partly sponsored by the project:
A number of GF tutorials were held during the past years during which the MOLTO tools are shown and actively used:
The press releases were done at the beginning of the project and have been already reported in Deliverable 10.1. Each public event has been publicized by the local organizers via their official channels, so that announcements have appeared in calendars, bulletins and mailing lists.
Liaison with other EU-funded projects in the area of Computational Linguistics took place at international meetings where MOLTO was presented. The most relevant result to report however is joint work carried out with the MONNET project after organizing a joint meeting and a joint workshop reported before.
On December 13 and 14, 2012, PortDial members from Bielefeld met with Aarne Ranta (Grammatical Framework), Jeroen van Grondelle, Frank Smit and Jouri Fledderman from Be Informed, and John McCrae (lemon) in order to discuss the mapping from ontology-lexica to grammars, as well as the modular combination of induced domain grammars with dialog task grammars. The meeting gave rise to new ideas for the top-down grammar induction process being implemented. Moreover, the MOLTO-MONNET cooperation crystallized in the joint project proposal 611008- ADOPT coordinated by MONNET's coordinator Paul Buitelaar for combining the approaches, submitted as FP7-ICT-2013-SME-DCA but not granted.
MOLTO is a member of META-NET (http://www.meta-net.eu/) and more specifically of META-Share. The MOLTO language technologies, resources, and tools are being distributed to members of the computational linguistics community under LGPL, consistent with the collaboration agreement signed with META-Share. As part of the liaison activities within META-NET, MOLTO also gave feedback for the final version of the strategy document for the META-NET agenda for Multilingual Europe 2020 (http://www.meta-net.eu/sra-en).
In January 2012, A. Ranta presented MOLTO at Xerox Research Centre Europe, Grenoble, in a seminar that has been video recorded and is published online at http://videos.xrce.xerox.com/index.php/videos/index/618.
MOLTO has been hosting the FreeRBMT conference in June 2012, with a special workshop day devoted to explore the possible cooperation between Apertium (http://www.apertium.org/) and MOLTO: results are already tangible, especially with respect to the adoption of the lexicons from Apertium.
MOLTO used the World Wide Web as its main channel for continuous dissemination and archiving. The project's web site, registered at http://www.molto-project.eu, has been designed mainly to support the internal management of the project, several sections are open only to registered members of the Consortium. Gradually, as work progressed and results became available, we added some public sections however we have leveraged the possibility to be present on popular social sites to push news to the readers outside the Consortium, most prominently Twitter and LinkedIn. Recently we added a Consortium-only Google community page, which could in the future help maintain informal ties among the Consortium members and those interested in the future of the MOLTO technologies. It is not yet clear for how long the URL of the project's website will be maintained but we plan to freeze the contents shortly after the end of the project and to produce an archival version. The most important documents will be stored also as multimedia showcase, as required in Appendix X to Annex I.
In addition to the project's site, MOLTO has published multimedia content on:
Screencasts for some of the MOLTO tools have appeared on Screenr (http://www.screenr.com/user/MOLTOproject).
Events of interest have been advertised via newsletters and mailing lists (MT, EAMT) and social sites, in particular:
Partners have featuring the MOLTO work on their websites (searching link:molto-project.eu yields about 45 hits).
The project coordinator and the workpackage leaders have been reachable for questions by a contact form accessible online. Recurring questions have been answered in the FAQ: http://www.molto-project.eu/view/faq, commonly edited by all registered users.
The publication list appearing on the website is an extensive reference list of the results of the project. It includes also software and other media. The RSS feed for the publications appearing in MOLTO is http://www.molto-project.eu/biblio and currently lists 224 items, many of which are the slide presentations delivered during the project's related events.
Given the general direction of the field of language technologies, as outlined in the strategic research agenda for LT2020, the exploitation of MOLTO results will focus on the high-quality translation services in the cloud. These cloud services may serve the public sector as well as a more technical audience. The case studies have shown the versatility of the MOLTO technologies in terms of domain of application, scalability, and target audience.
Exploitation of the project's results and acquired experience also goes towards furthering the field of language technologies. This has already been demonstrated by e.g. the work on the lexicon resources also with respect to usability of publicly available semantic web resources.
Several of the project's deliverable are of interest for further upkeep and will be maintained and developed in the future. MOLTO tools and technologies that have been released include several multilingual translation web services, grammar writing IDEs, guidelines and tutorials, a translation platform integrating the editing tools, and sample multilingual software such as a dialog system, query interfaces, a multilingual semantic wiki. In addition, the events that have been organized during the project's lifetime aimed at capacity building, both in academia and in industry. Young academicians worked with the commercial partners on very concrete problems and had to learn how to communicate with non-experts. Similarly, the industrial partners had to identify the tasks and issues that could best be solved by asking to the academic partners.
In what follows, the commercial partners outline the areas in which these newly created cooperation ties may in future be consolidated.
We identified three different strategies concerning exploitation:
Open Source Strategy: the project has adopted this strategy for the release of the final products. All software and tools are available under LGPL. Some of the technologies are under continuous development, as it is usual in the open source community, and can be adopted and commercially further developed by branching the repositories.
Spin off Strategy: this strategy is currently under discussion, interested parties are evaluating whether to provide a spinoff consultancy firm to provide GF and grammar/multilinguistic knowledge that companies might not have readily available.
Commercialization Strategy: this is the strategy by the commercial partners, outlines in the following sections.
The project members discussed and agreed upon the use of the method from Stähler(1) to develop commercial exploitation plans. This results in the following contributions from each industrial partner:
For each promising opportunity the method from Stähler was followed to develop a plan to exploit outcomes. This results in the following sections corresponding to the method phases depicted in the figure below.

(1)Stähler, Patrick; Geschäftsmodelle in der digitalen Ökonomie: Merkmale, Strategien und Auswirkungen. Josef Eul Verlag, Lohmar, 2001
Be Informed is an internationally operating, independent software vendor. The Be Informed business process platform transforms administrative processes. Thanks to Be Informed’s unique semantic technology and solutions, business applications become completely model-driven, allowing organizations to instantly execute on new strategies and regulations. Organizations using Be Informed often report cost savings of tens of percents. Further benefits include a much higher straight-through processing rate leading to vastly improved productivity, and a reduction in time-to-change from months to days.
The role of Be Informed in MOLTO is to make sure that the solutions developed in the project can indeed be readily integrated into their solutions (the Be Informed Business Process Platform in particular). Be Informed will build on its strong expertise in its domain to guide the project and make sure that the results are exploitable from a commercial point of view in the mid-term. Dissemination to, and feedback from, its client base, as part of the use case development in WP12, will increase the degree of suitability for exploitation.
Be Informed's exploitation strategy is tightly linked to its goal of quickly commercializing MOLTO results, and calls for a rapid and continuous flow of information to its sales force, existing client base and potential future customers. In addition, as an innovative company, Be Informed plans academic talks and publications.
The outcome of MOLTO is relevant for Be Informed's Business Process Platform. For both client and server product components the translation services based on the GF based prototype can offer translation support at design time as well as runtime. This would enable several usage scenarios to deal with verbalization activities of customers business models and others artefacts.
For more detailed information about this product and its solutions see www.beinformed.com.
The main approach of Be Informed Research and Innovation is based on co-innovation with customers, partners, and other third parties. These activities usually result in a working prototype. Prototypes which seem promising to get enough traction with customers are handed over to Be Informed Product and Solution development.
The MOLTO deliverables will be promoted to our clients and partner in the public sector. The prototype of the MOLTO multilingual verbalization component for integration with Be Informed Business Process Platform will be made available as an optional product component.
In this section we present a concise overview of relevant public sector trends and views within and across European Union Member States on future public services. This background information is not only necessary to understand the societal and political context in which multilingual public sector services take place, but also to detect synergies (and potential divergences) between visions about ontology driven services, language aspects and current developments within the public sector. The presented overview is inter alia based upon recent studies by the OECD (Towards Smarter and more transparent Government, e-government status spring 2010; OECD e-Government project; 25 March 2010; GOV/PGC/EGOV(2010)) and research results from the CROSSROAD Project (A Participative Roadmap for ICT Research in Electronic Governance and Policy Modelling; a support action under the European Commission 7th Framework Programme. http://crossroad.epu.ntua.gr/the-project/objectives/FP7-ICT-4-248458).
Within the context of this project we are dealing with public sector services that provide information and advice and perform transactions between citizens or companies and administrations. By using ontologies which contain concepts, their relations and respective rules, public sector services become decision centric and goal driven. This enables the public sector to become more agile, customer centric, efficient, effective and accountable as well.
In this section we will use the concepts of Governments and Public Sector interchangeably. Political institutions and administrative structures of counties are diverse, but regardless of their shape, they are all part of the Public Sector ecosystem that provides public sector services to citizens and companies or institutions. Governments in Europe face an increasing number of challenges such as ageing populations, immigration, climate change and globalization, further reinforced by the financial crisis. The globalization trend has limited the freedom of governments to manage their national economies and new challenges such as immigration and an ageing population seem to fundamentally affect the scope of public sector activities. At the same time, society’s expectations of public service delivery have by no means diminished as citizens from the 1980s onwards have become more concerned with choice and service quality. The paradox faced is one of open-ended demand versus a capped or falling resource share for actual delivery. Consequently, public administrations are under constant pressure to modernize their practices to meet new societal demands with reduced budgets.
In the Visionary Scenarios Design of the CROSSROAD Project, the researchers present a summary of the main trends with respect to ICT for governance and policy making in the wider context of an evolving public sector. They define a set of core policy trends across the governance and policy modelling domain, which also resonate with the use case settings of the MOLTO project.
Within the context of this project we are dealing with public sector services which provide information and advice and perform transactions between citizens or companies and administrations. This type of services is decision centric by nature. They are dealing with rights, permissions and obligations, for instance in the domain of permits and grants. The activities that have to be supported by the services are knowledge intensive. Another characteristic is that they are event driven. This makes them perfect candidates for semantic enabled services. Ontologies are situated at the core of this kind of services.
We have to take into account that ontology support for public services is not only positioned at the end of the service chain, where government and citizen meet each other, but throughout the whole service chain. Treating a request for a permit and deciding upon this request is based upon the same rules as getting advice whether one is entitled to acquire the permit. So, the concepts and rules that are used in ontologies apply as well to the citizens interactions as to the administrative officials interactions. The need for localization can however differ between these two target groups. In a traditional view public sector services are positioned at the execution and enforcement layer of the public sector infrastructure. This layer deals with policy implementation. For reasons of scoping we will focus in this stage of the project also on this policy implementation layer.
We foresee however a trend in which the use of ontologies will go more upstream towards the policy making process, since this will leverage the best outcome.
Main beneficiaries of the Molto outcomes are domain experts using Be Informed in an international context in the public sector. These public sector services provide information and advice and perform transactions between citizens or companies and administrations. This type of services relies by nature highly on interaction and communication on the one hand and the execution of regulations on the other hand. The quality of both aspects must be guaranteed. We will describe in brief scenarios of public sector actors like domain experts that are confronted with localization aspects for the services they are providing or intend to provide. These scenarios are:
In all scenarios we can see that, although policy making and implementation seems to be mostly a local (national) issue, there are very often also international issues/aspects that have to be taken into account.
A very common pattern in the world is the provision of public sector services in the field of immigration. Immigration services have to be provided to immigrants who want to work and/or live in another country and to companies or organizations who want to hire labour resources from another country. A specific kind of stakeholder is the group that wants to bring family members to the country they live in. The main process is the issuing of permanent or temporary/provisional permits for admission and residence. A crucial characteristic of this process is that the rules for admission and residence are changing frequently and sometimes with short notice. Since immigrations offices are communicating with ‘the whole world’, one cannot expect them to translate their services into all languages. Normally they will use the language or languages of their own country and maybe one or a few other languages that can be understood by the majority of their customers. And, in specific cases, they will want to translate a part of their information to a specific target language. This can be the case for instance as due to a certain incident a new group of immigrants from an individual country ‘threatens’ to flood the country. So they need a process that supports the translation of services to the current languages on a regular and flexible basis and an approach to deal with incidents that require instant translations in the non-current languages. In all cases it is a challenge to translate the complicated immigration laws and procedures into comprehensive services for national and international users.
An example of a government agency that has to cooperate internationally is the Dutch Emission Authority (NEA). Emissions trading is a flexible policy instrument which governments use to improve the living environment. In the Netherlands there are two emissions trading systems, one for emissions of carbon dioxide (CO2) and one for emissions of nitrogen oxides (NOx). Emission trading requires an infrastructure for issuing permits, monitoring and allocating emission allowances. Emissions trading is inevitably an international business that requires cross boundary cooperation, information and communication. The public services of the Dutch Emission Authority must therefore be available and accessible in more than one language. In this case NEA wants to make its service also available in the English language.
Trading requires international agreement on standards and preferably also on service patterns. By using one information concept it becomes easier to exchange information and to innovate. In such a case the ontology supported infrastructure of a frontrunner in the specific domain, such as NEA, could be used as a basis for internationalisation and standardisation.
The times of splendid isolation are over (if they ever existed); we are living in a dynamic international world and an increasingly more global market. One of the government parties that is affected daily by this trend is Customs. They have to deal not only with local laws, but also with common market regulations, international trade regulations etcetera. The regulations, they have to comply with, and have to enforce, change frequently, based upon incidents, new insights and political developments. And within a set of regulations, the priorities for enforcement can change too.
Customs have to deal with international treaties about traffic of goods between countries and the limitation thereof. For example for importing certain goods from China, one has to apply for an export license in China which is transformed to an import permit in the country of destination. This leads to multilingual public services that are delivered in different countries of the world. Depending on the types of goods there might be an additional import tax to protect a country’s internal market from being ‘flooded’ with low price goods from low cost countries.
In order to be able to levy additional tax on certain goods one must be able to classify these goods. The EU defined the Combined Nomenclature, which is in fact a taxonomy of goods and their codes that can be used to classify goods that enter a country. This taxonomy is available in all official countries of the European Union. The taxonomy is based on the Harmonized Commodity Description and Coding System7 which is run by the World Customs Organization. The harmonized system is used by 137 countries and the European Union
Many countries are bi-lingual or multilingual. This means that all official publications and services have to be provided in more than one language. Often the pilot language, the language in which a document is written first, depends on the preferred language of the author. By using an ontology, the meaning of the document in the pilot language can be expressed abstractly and unambiguously in concepts and rules. They can then be translated into a particular language to express the meaning using the vocabulary and syntax of that language.
Be Informed captures policy in ontologies. These ontologies are used throughout the policy lifecycle from choosing/deciding on policy, communicating the agreed upon policy to all stakeholders to running the supporting applications. As a consequence, verbalizations of these ontologies could be used in a number of scenarios throughout that policy lifecycle.
For the ontologies to be used as the basis of actual applications, it is crucial they contain a correct representation of the requirements and constraints. Review and validation before deploying and the ability to provide feedback on the model after deployment is very important. A natural language representation of the models can help stakeholders to exercise these tasks. Special verbalization choices might have to be made to create texts that are effective in this specific scenario.
The most effective way of business user involvement is of course allowing them to create models themselves or, often more realistic, to maintain and alter existing models. In [EKAW2010] we explored editors that do use a textual metaphor to present models to the users, but that do not use typing text as editing metaphor.
Typically, systems need to be well documented for IT organizations to be able to support production use and perform maintenance. The online, navigational access to the models is then often not acceptable, and conventional documentation sets need to be generated.
Classically, business applications have used tables of data to present detailed information that is available in a business process. When involving customers in business processes, they find it hard to interpret the data. Verbalization into natural language can be a great way to present, for instance, process progress data to laymen, as the data can be presented in a self explanatory way.
The ontologies capturing legislation and policy are used to drive decision services, applying the policy to actual cases. These decisions taken are communicated to the stakeholders and need to be documented and explained. Verbalization of the model could be extended to verbalization of the decisions based on the models.
The proposed exploitation path would increase revenues of existing products like the Be Informed Business Process Platform. Be Informed will offer the Molto verbalization engine as an optional product component. It is difficult to predict the size of the increase at this stage of development.
Ontology translation systems are usually created using general-purpose programming languages, such as LISP or Java, and the mappings between expressions in the source and target languages are neither well-documented nor explained. Integrated tooling as part of Be Informed’s Business Process Platform is at this stage unique.
Ontotext’s business model combines the development of products (including some open source versions) with the provision of research, consultancy and development services. Many commercial projects combine all four elements. For Ontotext MOLTO will bring the unique opportunity to strengthen its position in the semantic technologies and knowledge-driven text analytics market, with development and adoption of intelligence methods that support ontology-based multilinguality. This will be possible due to the fact that MOLTO adds to the semantic technologies the GF formalism, which operates as an interlingua on language level and thus, localizes the ontologies in appropriate ways. More precisely, the main directions of future development will be as follows:
The business strategies will be as follows:
Ontotext AD is the strongest semantic technologies company in Europe and a world-leading supplier of core semantic technology, text mining and web mining solutions.
We have unmatched portfolio of world-class technology and expertise in:
The main differentiator between Ontotext and other semantic technology vendors is that we deliver robust technology, proven in multiple high-profile projects that justify its maturity and usability. The best example in this direction is the usage of OWLIM (our RDF database engine) in the BBC FIFA World Cup 2010 website where most of the pages were generated dynamically through queries to OWLIM – millions of requests per day, hundreds of updates per hour, handled by a cluster of few servers. Following the success of this project, BBC extended the use of Ontotext technology for the BBC Sport website and for the London Olympics 2012 website.
Ontotext’s clients span across several sectors:
Considering the substantial number of clients of Ontotext in UK, we are running in London regular open training courses “Semantic Technologies with OWLIM”, usually scheduled at roughly once per quarter.
On the one hand, the research goes into products through the traditional ways:
On the other hand, the developed technology within the project is applicable to other related areas of NLP services application. It can be either used as stand-alone applications, or be integrated into larger and more complex architectures. Both business opportunities have significant added value.
The first direction is exemplified by the envisaged use case domains: Patents in medical domain and artefacts in Cultural heritage domain. The second direction goes to areas that apply strongly Question Answering, Information Retrieval and MT. Such areas are: Publishing, Social Media and Pharma. The related products are highly commercial and thus, precision and relevance of the retrieved information are crucial features for the clients. GF formalism would be useful for the smoothing of the multilingual retrieval and translation results. It must be noted that the component shared by all targeted products of Ontotext is the ontology-based knowledge that relates to LOD and multilingual settings.
All the EU research projects that Ontotext has been involved in, have lead to the improvement of the current technology as well as to the creation of new products, that have been explored in commercial projects. In this way, we might view the Research as an Investigation, Preparation and Compilation phase, while the applications in Industry – as Adaptation, Harmonization and Real Setting evaluation phase. Below some synergies of the aforementioned kind are given:
RENDER is an ongoing project that aims at providing a comprehensive conceptual framework and technological infrastructure for enabling, supporting, managing and exploiting information diversity in Web-based environments. It also would leverage very large amounts of content and metadata: news, blog and microblog streams, content and logs from Wikipedia, news archives, multimedia content and reader comments, discussion forums, etc. This data is managed by a highly scalable data management infrastructure, and enriched with machine-understandable descriptions and links referring to the Linked Open Data Cloud. This development would lead OWLIM and KIM technologies to handle diverse data, which would widen their data coverage and management.
CUBIST is an ongoing project that aims at Combining and Uniting Business Intelligence and Semantic Technologies with a special focus on unstructured data mining. Being central to the project goals, the semantic technology supports a persistent layer – a semantic Data Warehouse. The project adds to the better Visual Analytics, whose improved characteristics would be important for providing more competitive user interfaces in industry.
MediaCompaign is innovative in Ontology creation for cross-media modelling of media presence and campaigns; Semantic cross-market product data interlinking; Identification and tracking of new media campaigns in different media and countries. MediaCompaign focused mainly on advertisement campaigns and their impact on attitudes and opinions. Thus, the publishing services, provided by Ontotext, will be enriched with sentiment analysis additionally to the knowledge-based analysis. Thus, Ontotext will have a social-aware service.
NoTube project concentrated on personalized semantic news; personalized TV guide with adaptive advertising as well as Internet TV in the Social Web. It relied on the key role of the semantic technologies, taking into account the community aspects and is built on multilinguality. The results strengthened the personalized component in the retrieved information in commercial publishing platforms.
PHEME project (will start in October 2013) has as its main goal the development of scalable methods for Social Semantic Intelligence, across media and languages. I also aims at modeling not only facts and opinions, but also the parameters of reliability of the information sources. Additionally, PHEME focuses on more concrete and socially crucial cases in recent years, such as crowdsourcing, citizen journalism and bioinformatics. PHEME goes beyond official media campaigns - to social network dimensions and beyond the opinions – to rumour and misinformation detection. This project will lead to a large-scale social media bound OWLIM and KIM platforms. Also, it will add to its services the identification of misinformation, which would be very valuable facility in the personalized component for the end users.
In all productizing areas, listed below, the following underlying NLP technology is assumed:
With the globalization processes and harmonization of large groups of documents in EU, the requirements for particular data management systems is rapidly growing. Additionally, virtual space has become more populated, shared, explored and multilingual. For example, virtual tours in famous museums; virtual storage and access to EU legislation; interactive online digests; electronic government; digital preservation storages; social networks etc.
For these reasons, it is not surprising that some of the most active business domains at the moment are the Cultural heritage stakeholders (DARIAH, CLARIN, EUROPEANA); Pharma (Astra Zeneca); Media Publishing (BBC, NDP, Press Association) and Social Media (Pheme project).
Ontotext is involved in all of the aforementioned domains through research projects and commercial projects.
The cross-media analytics is a typical case of business intelligence, developed at Ontotext. Ontotext’s technology covers preferably (but not only) publishing agencies (such as, Press Association, NDP, Oxford, etc.) and government data management (US government). From a language point of view, the company has been working systematically on commercial project for Dutch and English. However, lately, it started to expand the multilingual set to Bulgarian, German, Chinese, etc. Having in mind these facts, GF formalism as well as RDF-GF interoperability from MOLTO would be the natural extension of the information extractors, thus facilitating the interaction between the users’ queries and their machine processing. More precisely, the following extensions are envisaged: embedded translator service, tuned to the domain (sports, finance, politics, etc.); embedded converter from RDF representation to GF and then to language, and vice versa.
Also, internally, the semantic annotation tool will be augmented with language localization modules that would support the annotators.
Related markets: publishing; electronic government
SWOT Analysis:
Strengths: Improvement of the existing multilingual modules; creation of new functionalities to the customers, such as viewing the same result in various languages; improving the annotation process and text analytics; better communication between the ontology and user queries.
Weaknesses: Domain adaptation of the MOLTO modules might be needed, when addressing a new domain or even a subdomain of a specific domain.
Opportunities: There might be the possibility to create a publishing platform of new generation, which provides a typological core for many languages and thus – is easily adaptable to new languages. Additionally, to see Dutch news in English within the publishing system itself, for example, would extremely facilitate the customers.
Threats: The online real time applications might be unstable initially due to the complex architecture.
In Ontotext projects the existing LOD resources (such as, Linked Life Data) are applied for different socially aware domains and across languages. For example, the entity extraction tool LUPedia as well as the linked data concept store FactForge will be used in enhancing the socially marked knowledge. These modules will be extended by the language generation tool from MOLTO in order to improve the accuracy of the extracted information. This step is manageable, since the MOLTO rule-based translation technology is extended with the help of statistical approaches.
Related markets: education, tourism
SWOT Analysis:
Strengths: MOLTO gives the possibility of applying a structured approach to unstructured data for the purposes of good understanding of big amounts of data.
Weaknesses: MOLTO might support better some forms of Social Media (publicly available), while some others (restricted) – not so well.
Opportunities: The social media might be viewed as a network of subdomains and addressed by MOLTO technology in a step-by-step way.
Threats: No visible possibility is foreseen at the moment for using MOLTO modules directly in sentiment and opinion analysis.
Ontotext regularly participates in projects that consider health care and life sciences data management (there is Life Science project running now). Here the available domain ontologies are explored together with the NLP processing. GF will be extremely useful since both the prescriptive and diagnosis languages are controlled. There is an additional level of translation here, namely: from the specialized prescription and anamnesis language of doctors into the common natural language of the users.
Related markets: Medical producs sales, health care
SWOT Analysis:
Strengths: MOLTO is best performing in controlled and structured domains. Pharma is a good example of such a combination. In addition, there is an already working prototype on Patents in this domain.
Weaknesses: Pharma would be better manageable from doctors’ production point of view, rather than from patient perspective, since professional language is better controlled.
Opportunities: Improvement of multilingual search and relevance of the search results.
Threats: Pharma is one of the well elaborated domains from a processing point of view. Thus, the real added value of MOLTO is to be tested in the future.
Part of the commercial projects, carried out within Ontotext, are connected to pharmacies. In this respect, the developed prototype in bio-medical and pharmaceutical domains will be employed directly in the workflow processes.
Related markets: Administration, government, science, businesses
SWOT Analysis:
Strengths: The usage of patents is a common and necessary activity in industry. Thus, the created structure model for handling patents in one specific domain would be applicable to patents in other domans, too. Retrieval services in a strongly cross-lingual context would be also very attractive features for exploitation.
Weaknesses: If the MOLTO modules are used in another domain of patents (for example science) some adaptation will be needed, although the patent structure itself would be stable beyond specific domains.
Opportunities: The usage of the patent service would facilitate and speed up the process of managing Pharma policies with respect with new development in healthcare.
Threats: The patent service might not cover all the query requirements of the users due to the limitations of the controlled language or the incompleteness of the corpus.
There are many stakeholders in this area, since lately the related initiatives have grown considerably. Here we have in mind the specific ones: Europeana, British Museum, ConservationSpace and CLARIN.
Europeana already provides search facilities. However, they cover only metadata and are not connected to ontologies. Also, the translation from one language to another is done via machine translation(MT) only, without any grammatical formalism behind it. British museum is a partner which would try the MOLTO services, profiled specially for museum objects. They already use Ontotext’s semantic repository OWLIM. This service supports semantic search, semantic RDF data sources, Web Publication. MOLTO will add to the better search functionality as well as to the multilingual information extraction. Similar projects are: Gothenburg City Museum (Sweden); Polish Digital National Museum; Yale Center for British Art (USA): Linked Open Data publishing of museum collection. ConservationSpace project is managed by the National Gallery of Art (USA) and 7 other institutional partners from the USA, UK and Denmark. It handles the data management. MOLTO services might also contribute to the better preservation of the documents through adopting the GF formalism as a mediator between the users' queries and SPARQL queries. Similar projects are: FP7 CHARISMA: Synergy for a Multidisciplinary Approach to Conservation/Restoration; FP7 3D-COFORM: 3D documentation and collection formation of tangible cultural heritage; CLARIN is a pan-European initiative, which aims at elaborating also a globally shared service, among other services, for exploration of cultural artefacts. Ontotext is a participant in this initiative. It might provide the same facility to the consortium as in the above opportunity. Similar projects: FP7 V-MUST: Virtual Museum Transnational Network, a Network of Excellence.
Related markets: tourism, education
SWOT Analysis:
Strengths: coverage of many languages with language specific mediated filtering (through GF formalism); high precision of the retrieved content due to the controlled language; easy adaptability to other areas of cultural artefacts and languages.
Weaknesses: Since one of the use cases in MOLTO considers museums, the application to other subdomains of Cultural heritage might need adaptation of the grammars and ontologies.
Opportunities: The service can be adopted by various virtual cultural databases and adapted to them.
Threats: There might not be available resources for certain languages or language variants; language generation might not be efficient enough for all languages.
The project dissemination activity has focused on three major stakeholders' groups: researchers in NLP, public sector, semantic web technologists. These have been reached by organizing events at international meetings, on online social platforms, and face to face. A major outcome of the project is the ongoing discussion between the academic developers of the MOLTO technologies and the commercial partners based on the work carried out. This discussion concerns the future mechanisms that should be created so that the MOLTO results can be successfully adopted for exploitation. It has become clear during the case studies and the evaluation of the project that the fast developing technologies used in MOLTO need to become mature before they can be used commercially. Moreover, it would be desirable to be able to offer professional support, consultancy services and training in order to promote the uptake of the project's translation services.
| Contract No.: | FP7-ICT-247914 | 
|---|---|
| Project full title: | MOLTO - EEU Multilingual Online Translation | 
| Deliverable: | D 12.2 User studies for BI's explanation engine | 
| Security (distribution level): | Public | 
| Contractual date of delivery: | 31 May 2013 (M39) | 
| Actual date of delivery: | 31 May 2013 (M39) | 
| Type: | Report | 
| Status & version: | Draft 0.9 | 
| Author(s): | Joris van Aart, Jouri Fledderman, Jeroen van Grondelle | 
| Task responsible: | Be Informed | 
| Other contributors: | Jeroen Daanen, Menno Gulpers, Emiel van Haandel, Herko ter Horst, Frank Smit, Xander Uiterlinden | 
| Attachment | Size | 
|---|---|
| D12.2 User studies for BI explanation engine.pdf | 755.43 KB |