mailto: Anna.Sagvall_Hein@ling.uu.se
Fax: +46 18 18 14 16
Collaborators:
Lars Borin
Erik Tjong Kim Sang
Per Starbäck
Bengt Dahlqvist
Klas Prytz, Ph.D. student
Students enrolled in the Language Engineering Master's Programme
Funding 1996-1997: 810,000 SEK.
Languages: Swedish, Dutch, English, Finnish, French, German,
Italian and Spanish.
The basic aim of the project is to develop a computerized multilingual corpus that can be used in contrastive lexicographic work and in methodological studies directed towards the automatic recognition and extraction of translation equivalents from text. The corpus will comprise Swedish source texts representing different styles and domains with translations into several languages. A basic requirement of the corpus is to have it word class tagged and aligned, primarily sentence by sentence.
As of October 1996, the project had resulted in two parallel, aligned subcorpora, the Scania Corpus and the Swedish Statement of Government Policy Corpus.
The Scania Corpus, Scania 9606, is a collection of truck maintenance manuals from the Swedish truck manufacturer Scania CV AB in Södertälje. It consists of 80 documents in eight languages: Swedish (source), Dutch, English, Finnish, French, German, Italian and Spanish. The total size of the corpus is 1.6 million words (63mB).
The Swedish Statement of Government Policy Corpus, Regeringsförklaringen 9607, is a collection of Government Statements, made in 1988 (Swedish, English, German, and French), 1994 (Swedish), 1995 (Swedish), and 1996 (Swedish, English, German, French, and Spanish). The total size of the corpus is 26,709 tokens (371kB). It is available at http://strindberg.ling.uu.se/~corpora/rf/.
The text structure of the documents in the two corpora has been automatically marked up with TEI Lite conformant SGML (by means of software developed in the project). The sentences (or sentence fragments) in the different language versions have been aligned with each other. Software has been developed for accessing the parallel corpora. A demo of the software can be found at the location for the Regeringsförklaringen 9607 corpus.
A first step in the tagging of the Swedish part of the Scania Corpus has been taken with the morphological analysis of its word forms, and word forms that are ambiguous with regard to part of speech have been tentatively disambiguated by heuristic means. Accordingly, there are 178,355 tokens (single words and lexicalised phrases), 19,360 types (word forms), and 9,549 lemmas. Frequency of sentence length has also been generated.
The morphological analysis was carried out by means of Sve.Ucp, a morphological analyser developed at the Department of Linguistics. Sve.Ucp uses a stem dictionary, and this dictionary was extended to cover the vocabulary of the Scania Corpus. As regards the Regeringsförklaringen Corpus, the words were analysed once, and the stem dictionary is currently being updated to account for missing words.
Current methodological work is concentrated on the second step in the tagging process, in specific, the implementation, exploration and evaluation of different methods for the disambiguation of the alternative analyses that are produced by the morphological analyser.
Another methodological issue in focus is the design and implementation of an adequate corpus format for structuring and searching a multilingual, parallel corpus in bilingual lexical acquisition.
Detailed counts for the two corpora:
SCANIA 9606
|
language |
files |
words |
bytes |
|
German |
80 |
186293 |
8004331 |
|
English |
80 |
220827 |
7886082 |
|
Spanish |
80 |
250730 |
8090916 |
|
Finnish |
80 |
148348 |
7833990 |
|
French |
80 |
244239 |
8156457 |
|
Italian |
80 |
228631 |
8127121 |
|
Dutch |
80 |
216424 |
8072128 |
|
Swedish |
80 |
172259 |
7792597 |
|
total |
640 |
1667751 |
63963622 |
REGERINGSFÖRKLARINGEN 9607
|
language |
files |
words |
bytes |
|
German |
2 |
4259 |
67650 |
|
English |
2 |
4492 |
63522 |
|
Spanish |
1 |
2318 |
63522 |
|
French |
2 |
5221 |
67769 |
|
Swedish |
4 |
10419 |
140924 |
|
total |
11 |
26709 |
370795 |
Go back to page: Projects_without_funding