Harvesting big text data for under-resourced languages

Project facts

Project promoter:
Masaryk University, Brno
Project Number:
CZ09-0002
Target groups
Researchers or scientists
Status:
Completed
Initial project cost:
€965,800
Final project cost:
€913,500
From Norway Grants:
€ 776,475
The project is carried out in:
Jihomoravský kraj

Description

The project offers to develop tools for the conservation and development of linguistic studies and the opportunity to innovate knowledge of languages. In addition to new corpora, long-term outcomes also include further dissemination of the project results via a dedicated website. The outcomes include large and smaller in scale but equally important annotated corpora. The existing software tools and results will be improved. Shallow processing applications would be built, for investigating and separating multiple senses of the words in the corpora. The results will make it possible to acquire information technologies in a less-developed country and contribute to its cultural development. Norwegian University of Science and Technology team will carry out dissemination and exploitation of the results through many channels. Partners expect to benefit academically and economically, in particular, by developing novel models to leverage machine learning and advanced linguistic representations.

Summary of project results

The main objective of the HaBiT project was to gather large-scale text data (corpora) from web and to process them so they can be used in language applications for e.g. information extraction or machine translation. The project focused on Norwegian, Czech and Amharic, Afaan Oromo, Tigrinya and Somali (these being four major Ethiopian languages). Large annotated corpora were created for all given languages and software modules, such as taggers, parsers and sketch grammars were created. The results were presented to the scientific community via conference and journal papers as well as the project web page. The outputs are freely accessible for further research via the HaBiT system created with cooperation of project partners -- Masaryk University and Norwegian University of Technology -- and University of Oslo and two Ethiopian universities, which cooperate with NTNU on NORHED project. Intelligent language processing applications (such as information extraction or machine translation) need processed (annotated and parsed) large text data. Big languages (English, German, ...) have enough resources available, however, less-covered languages such as the Ethiopian languages but also hundreds of other languages, are in urgent need of methodologies and techniques for linguistic resource creation. The HaBiT project developed such methodologies and techniques and presented them with the case of four main Ethiopian languages. The expected outcomes were large annotated corpora for the participating under-resourced languages and Part-of-Speech annotation framework as well as taggers, parsers and sketch grammars for the involved languages. All the outcomes were achieved as planned and are accessible via the publicly available interface of the HaBiT system. The HaBiT project leveraged on the NORHED project, thoroughly testing the technologies and thus addressing the call topics on technology assessment, verification and testing, as well as on ICT meeting societal challenges, hence obtaining a relevant added value also in the political respect through cooperation with a less-developed country. The two international workshops organized within the HaBiT project received high ratio of attendants outside the project partners from all over the world. The interim results of the project have been published in 40 scientific conference papers and 2 international journal articles. The application results are publicly available in the form of software tools and linguistic databases. www.habit-project.eu

Summary of bilateral results

The objective of the project was to create annotated corpora, annotation framework and to help acquiring information technology in a less-developed country. This objective was fulfilled in both aspects: a) general methodologies for obtaining linguistic resources (corpora, taggers, sketch grammars) for a new language were developed and published, and b) the methodologies were explicated in the form of the respective linguistic resources for the four Ethiopian languages, where they represent the state-of-the-art for these languages. All of planned outputs were delivered: annotated corpora and sketch grammars for all involved languages were created. Semantic search interface was developed as well as dynamic concept matching. The PoS annotation framework was developed. The HabiT system was finished. Scientist from the University of Addis Ababa participated within the evaluation of the newly built resources and they highly appreciate the unbeatable quality of these project outcomes. Masaryk university offered the world-leading expertise in corpus preparation and natural language processing. The Norges teknisk-naturvitenskapelige universitet introduced new results within the area of word-level semantic matching and disambiguation and multi-lingual word spaces. The cooperating institutions of the University of Oslo arched over the linguistic evaluation of the project results. Within the partnership, three research meetings in Oslo and Brno were organized, an information meeting for the students of the Faculty of Informatics and the Faculty of Arts about the Czech-Norwegian project was held and two international workshops devoted to building new language resources for languages with no or too little existing language resources and to practical evaluation of the efficient annotation framework were supported. The cooperation between MU and NTNU continued from the previous EU project PRESEMT and it was institutionalized within the HaBiT project leading to new state-of-the-art results in the area of linguistic resources of under-resourced languages, which are being exploited by the beneficiaries even after the actual end of the project.