Previous Next

Creating and Annotating a Parallel Corpus for the Recognition of Translation Equivalents

 

 

 

Project No. 12

 

 

Project Leader: Professor Anna Sågvall Hein,

Dept. of Linguistics, Uppsala University,

Box 513, S-751 20 Uppsala, Sweden.

 

 

In Swedish

mailto: Anna.Sagvall_Hein@ling.uu.se

 

Fax: +46 18 18 14 16

Collaborators:

Lars Borin

Erik Tjong Kim Sang

Per Starbäck

Bengt Dahlqvist

Klas Prytz, Ph.D. student

Students enrolled in the Language Engineering Master's Programme

Funding 1996-1997: 810,000 SEK.

Languages: Swedish, Dutch, English, Finnish, French, German,

Italian and Spanish.

 

 

The basic aim of the project is to develop a computerized multilingual corpus that can be used in contrastive lexicographic work and in methodological studies directed towards the automatic recognition and extraction of translation equivalents from text. The corpus will comprise Swedish source texts representing different styles and domains with translations into several languages. A basic requirement of the corpus is to have it word class tagged and aligned, primarily sentence by sentence.

As of October 1996, the project had resulted in two parallel, aligned subcorpora, the Scania Corpus and the Swedish Statement of Government Policy Corpus.

The Scania Corpus, Scania 9606, is a collection of truck maintenance manuals from the Swedish truck manufacturer Scania CV AB in Södertälje. It consists of 80 documents in eight languages: Swedish (source), Dutch, English, Finnish, French, German, Italian and Spanish. The total size of the corpus is 1.6 million words (63mB).

The Swedish Statement of Government Policy Corpus, Regeringsförklaringen 9607, is a collection of Government Statements, made in 1988 (Swedish, English, German, and French), 1994 (Swedish), 1995 (Swedish), and 1996 (Swedish, English, German, French, and Spanish). The total size of the corpus is 26,709 tokens (371kB). It is available at http://strindberg.ling.uu.se/~corpora/rf/.

The text structure of the documents in the two corpora has been automatically marked up with TEI Lite conformant SGML (by means of software developed in the project). The sentences (or sentence fragments) in the different language versions have been aligned with each other. Software has been developed for accessing the parallel corpora. A demo of the software can be found at the location for the Regeringsförklaringen 9607 corpus.

A first step in the tagging of the Swedish part of the Scania Corpus has been taken with the morphological analysis of its word forms, and word forms that are ambiguous with regard to part of speech have been tentatively disambiguated by heuristic means. Accordingly, there are 178,355 tokens (single words and lexicalised phrases), 19,360 types (word forms), and 9,549 lemmas. Frequency of sentence length has also been generated.

The morphological analysis was carried out by means of Sve.Ucp, a morphological analyser developed at the Department of Linguistics. Sve.Ucp uses a stem dictionary, and this dictionary was extended to cover the vocabulary of the Scania Corpus. As regards the Regeringsförklaringen Corpus, the words were analysed once, and the stem dictionary is currently being updated to account for missing words.

Current methodological work is concentrated on the second step in the tagging process, in specific, the implementation, exploration and evaluation of different methods for the disambiguation of the alternative analyses that are produced by the morphological analyser.

Another methodological issue in focus is the design and implementation of an adequate corpus format for structuring and searching a multilingual, parallel corpus in bilingual lexical acquisition.

 

 

 

 

 

 

 

 

 

Detailed counts for the two corpora:

SCANIA 9606

language

files

words

bytes

German

80

186293

8004331

English

80

220827

7886082

Spanish

80

250730

8090916

Finnish

80

148348

7833990

French

80

244239

8156457

Italian

80

228631

8127121

Dutch

80

216424

8072128

Swedish

80

172259

7792597

total

640

1667751

63963622

REGERINGSFÖRKLARINGEN 9607

language

files

words

bytes

German

2

4259

67650

English

2

4492

63522

Spanish

1

2318

63522

French

2

5221

67769

Swedish

4

10419

140924

total

11

26709

370795


Go back to: Main page

Go back to page: Projects_without_funding