\documentclass[a4paper,11pt]{article}

\usepackage[utf8]{inputenc}
% unicode input encoding is also allowed
% use the following lines instead of the previous:
%\usepackage{ucs}
%\usepackage[utf8x]{inputenc}

\usepackage{times} 
\usepackage[T1]{fontenc}         
\usepackage{pslatex}
\pagestyle{empty}

\bibliographystyle{plain}

\title{What kinds of trees grow in Swedish soil?}

\author{John Doe and John Smith
\\[0.5cm]Department of Linguistics
\\University of Franeker
\\E-mail: \texttt{email@email}}

\date{}

\begin{document}
\maketitle

\begin{abstract}
\noindent
This workshop concerns the relationship 
between the syntactic properties of a given language and the choice of 
linguistic theory for annotation purposes. In this paper, I will discuss
and compare four different annotation schemes that have been proposed for 
Swedish in terms of their suitability for Swedish syntax as well as their 
relationship to linguistic theory and annotation schemes proposed for other 
languages.
\end{abstract} 

\thispagestyle{empty}

\section{Introduction}

One of the issues brought up in this workshop concerns the relationship 
between the syntactic properties of a given language and the choice of 
linguistic theory for annotation purposes. Our Swedish treebank consortium,
consisting of researchers from Växjö University, KTH and Stockholm University, 
is currently facing a specific instance of
this issue in trying to define an annotation standard for a large-scale 
treebank of Swedish written and spoken language. 

In this paper, I will discuss
and compare four different annotation schemes that have been proposed for 
Swedish in terms of their suitability for Swedish syntax as well as their 
relationship to linguistic theory and annotation schemes proposed for other 
languages. Other aspects that will be touched upon are 
the availability of parsers and/or annotated training data for developing 
parsers, the different requirements for annotation of spoken and written 
language, and the different needs of different user groups.

By way of background, I will start by reviewing some basic facts about the 
syntax of Swedish, a Germanic verb second language with moderately fixed word 
order. In doing this I will also introduce the Scandinavian tradition of 
descriptive grammar, in particular the influential field model
due to Diderichsen \cite{did46}. The background section also contains
a brief discussion of existing annotation schemes for other languages 
and their relation to current linguistic theory.

The main part of the paper will be devoted to a discussion and comparison
of the following four annotation schemes for Swedish: 
\begin{itemize}
\item MAMBA (Teleman \cite{tel74})
\item SynTag (Järborg \cite{jar86})
\item SWECG (Birn \cite{bir98})
\item S-CLE (Gambäck \cite{gam92})
\end{itemize}
The four schemes fall naturally into two groups, MAMBA and SynTag being 
standards designed for manual annotation of corpus material, while 
SWECG and S-CLE are primarily general purpose parsing systems which have corpus 
annotation as one of their (potential) applications.

\section{Treebanks and Linguistic Theory}
\label{tlt}

The number of treebanks available for different languages is growing steadily
and with them the number of different annotation schemes. This makes it very
difficult to say something general about the relation between annotation schemes
and linguistic theory, but broadly speaking I think we may distinguish three 
main kinds of annotation in current practice:
\begin{itemize}
\item Annotation of constituent structure
\item Annotation of functional structure
\item Theory-specific annotation
\end{itemize}
This is obviously not a proper taxonomy, since theory-specific annotation may
concern both constituent structure and functional structure. Rather, the first
two categories are meant to cover more or less theory-neutral annotation schemes,
focusing on constituent structure or functional structure, respectively. It should
also be pointed out immediately that the annotation found in many if not most of 
the existing treebanks actually combines two or even all three of these categories.
Still, I believe that the categories may be useful in discussing existing annotation
schemes and their relation to linguistic theory. I will treat the categories in the
order in which they are listed above, which I think roughly corresponds to the 
historical development of treebank annotation schemes.

The annotation of \emph{constituent structure}, often referred to as 
\emph{bracketing}, is the main kind of annotation found in pioneering projects
such as the Lancaster Parsed Corpus (Garside et al.\ \cite{gar92}) and the 
original Penn Treebank (Marcus et al.\ \cite{mar93}). Normally, this kind
of annotation consists of part-of-speech tagging for individual word tokens
and annotation of major phrase structure categories such as NP, VP, etc.
Figure \ref{ibm} shows a representative example, taken from the IBM Paris 
Treebank using a variant of the Lancaster annotation scheme.

\begin{figure}[htbp]
\vspace*{0.3cm}
\begin{verbatim}
              [N Vous_PPSA5MS N] 
              [V accedez_VINIP5 
                 [P a_PREPA 
                    [N cette_DDEMFS session_NCOFS N] 
                 P] 
                 [Pv a_PREP31 partir_PREP32 de_PREP33 
                    [N la_DARDFS fenetre_NCOFS 
                       [A Gestionnaire_AJQFS 
                          [P de_PREPD 
                             [N taches_NCOFP N] 
                          P] 
                       A] 
                    N] 
                 Pv] 
              V] 
\end{verbatim}
\caption{Constituency annotation in the IBM Paris Treebank}
\label{ibm}
\end{figure}

Annotation schemes of this kind are usually intended to be theory-neutral and 
therefore try to use mostly uncontroversial categories that are recognized in 
all or most syntactic theories that assume some notion of constituent structure.
Moreover, the structures produced tend to be rather flat, 
since intermediate phrase level categories are usually avoided, as well as 
complex structures such as Chomsky adjunction. The drawback of this is that the
number of distinct expansions of the same phrase category can become very high.
For example, Charniak \cite{cha96} was able to extract 10,605 distinct 
context-free rules from a 300,000 word sample of the Penn Treebank. 
Of these, only 3943 occurred more than once in the sample.

The status of grammatical functions and their relation to constituent structure 
has long been a controversial issue in linguistic theory. Thus, whereas the 
standard view in transformational syntax since Chomsky \cite{cho65} has been 
that grammatical functions are derivable from constituent structure, proponents
of dependency syntax such as Mel'\v{c}uk \cite{mel88} have argued that functional
structure is more fundamental than constituent structure. Other theories, such as
LFG, steer a middle course by assuming both notions as primitive. 

When it comes to treebank annotation, the annotation of \emph{functional structure}
has become increasingly important in recent years. The most radical examples are 
perhaps the annotation schemes based on dependency syntax, exemplified by the
Prague Dependency Treebank of Czech (Hajic \cite{haj98}) and the METU Treebank 
of Turkish (Oflazer et al.\ \cite{ofl00}), where the annotation of dependency 
structure is added directly on top of the morphological annotation without any 
layer of constituent structure. Figure \ref{pdt} shows a simple example of 
dependency annotation from the Prague Dependency Treebank.

\begin{figure}[htbp]
\vspace*{0.3cm}
\begin{center}
  \begin{picture}(200,170)
\put(20,160){\circle*{5}}
\put(40,40){\circle*{5}}
\put(80,100){\circle*{5}}
\put(120,40){\circle*{5}}
\put(140,100){\circle*{5}}
\put(20,150){\makebox(0,0){\#}}
\put(20,140){\makebox(0,0){AuxS}}
\put(45,30){\makebox(0,0){Komin\'{i}k}}
\put(45,20){\makebox(0,0){Sb}}
\put(55,105){\makebox(0,0){vymet\'{a}}}
\put(55,95){\makebox(0,0){Pred}}
\put(125,30){\makebox(0,0){kom\'{i}ny}}
\put(125,20){\makebox(0,0){Obj}}
\put(145,90){\makebox(0,0){.}}
\put(145,80){\makebox(0,0){AuxK}}
\put(20,160){\line(1,-1){60}}
\put(20,160){\line(2,-1){120}}
\put(80,100){\line(-2,-3){40}}
\put(80,100){\line(2,-3){40}}
  \end{picture}\\
\begin{tabular}{llll}
Komin\'{i}k&vymet\'{a}&kom\'{i}ny&.\\
Chimneysweep&sweeps&chimney&.\\
\end{tabular}
\caption{Functional annotation in the Prague Dependency Treebank}
\label{pdt}
\end{center}
\end{figure}

The trend towards more functionally oriented annotation schemes is also reflected
in the extension of constituency-based schemes with annotation of grammatical 
functions. Cases in point are SUSANNE (Sampson \cite{sam95}), which is a development
of the Lancaster annotation scheme mentioned above, and Penn Treebank II 
(Marcus et al.\ \cite{mar94}), which adds functional tags to the original 
phrase structure annotation. One of the most interesting examples in this 
respect is the annotation scheme adopted in the TIGER Treebank of German
(Brants and Hansen \cite{bra02}), developed from the earlier NEGRA 
treebank and annotation scheme, which integrates the annotation of 
constituency and dependency in a graph where node labels represent phrasal 
categories while edge labels represent syntactic functions. 

The third kind of annotation scheme that is found in available treebanks 
is the kind that adheres to a specific linguistic theory and uses representations
from that theory to annotate sentences. Thus, HPSG has been used as the basis for
treebanks of Bulgarian (Simov et al.\ \cite{sim02}) and Polish (Marciniak et 
al.\ \cite{mar00}), and the Prague Dependency Treebank mentioned earlier is 
based on the theory of Functional Generative Description (Sgall et al.\ \cite{sga86}).
There has also been work done on automatic f-structure annotation in the theoretical
framework of LFG (see, e.g., Sadler et al.\ \cite{sad00}).

In conclusion, we may perhaps say that there has been a trend towards more 
functionally oriented annotation schemes in recent years, and that theory-specific
annotation schemes have become more common, but that it is probably still true
to say that the dominant paradigm in treebank annotation is the kind of 
theory-neutral annotation of constituent structure with added functional tags
represented by schemes such as the Penn Treebank II standard. 

\section{Conclusion}

In conclusion, MAMBA and SWECG emerge as the strongest candidates for use
in the annotation of a Swedish treebank. The other two schemes considered,
SynTag and S-CLE, are interesting in their own right but are on the whole
less suitable for adoption in a large-scale treebank project.

MAMBA and SWECG have the advantage of being firmly based in the 
Swedish tradition of descriptive grammar and can therefore be expected to have
good descriptive adequacy and coverage. This is true especially for MAMBA, 
which has been designed especially to handle spoken language as well as written 
language. Moreover, the fact that these schemes are based on notions of 
traditional grammar means that they provide an annotation which may be more 
accessible to non-expert treebank users.

The main weakness of SWECG is that the annotation contains little or no information
about phrase structure and is therefore difficult to relate to many current linguistic 
theories. However, this situation has clearly improved
with the development of FDG, which establishes a more direct connection to
dependency-based theories of syntax and also provides a better basis for the
reconstruction of phrase structure from dependency structure if this is required.

For MAMBA the biggest problem is instead the lack of resources for automatic 
annotation, although it may be possible to improve the situation by
using the available annotated corpora for bootstrapping a parsing system.
 
\begin{thebibliography}{99}

\bibitem {bir98}
Birn, Juhani (1998) Swedish Constraint Grammar. Lingsoft Inc. 
(URL: http://www.lingsoft.fi/ doc/swecg/intro/).

\bibitem {bra02}
Brants, Sabine and Hansen, Silvia (2002) 
Developments in the TIGER Annotation Scheme and their Realization in the Corpus.
In \emph{Proceedings of the Third Conference on Language Resources and Evaluation 
(LREC 2002)}, pp.\ 1643--1649, Las Palmas. 

\bibitem {cha96}
Charniak, Eugene (1996) Tree-Bank Grammars. In \emph{AAAI/IAAI}, Vol. 2,
pp.\ 1031--1036.

\bibitem {cho65}
Chomsky, Noam (1965) \emph{Aspects of the Theory of Syntax.} MIT Press.

\bibitem {did46}
Diderichsen, Paul (1946) \emph{Elementær dansk grammatik.} Copenhagen: Gyldendal.

\bibitem {jar86}
Järborg, Jerker (1986) Manual för syntaggning [Manual for syntagging]. 
Göteborgs universitet: Institutionen för språkvetenskaplig databehandling.

\bibitem {gam92}
Gambäck, Björn and Rayner, Manny (1992) The Swedish Core Language Engine.
In \emph{Papers from the 3rd Nordic Conference on Text Comprehension in 
Man and Machine}, Linköping University, Linköping, Sweden, pp.\ 71--85.

\bibitem {gar92}
Garside, R., Leech, G. and Varadi, T. (compilers) (1992) 
\emph{Lancaster Parsed Corpus}. 
A machine-readable syntactically-analysed corpus of 144,000
words, available for distribution through ICAME, The Norwegian Computing 
Centre for the Humanities, Bergen. 

\bibitem {haj98}
Hajic, Jan (1998) Building a Syntactically Annotated Corpus: The Prague 
Dependency Treebank. In \emph{Issues of Valency and Meaning}, pp. 106--132.
Prague: Karolinum.

\bibitem {mar93}
Marcus, Mitchell P., Santorini, Beatrice and Marcinkiewicz, Mary Ann (1993) 
Building a Large Annotated Corpus of English: The Penn Treebank. 
\emph{Computational Linguistics} 19, 313--330. 
[Reprinted in Armstrong, Susan (ed.) (1994)
\emph{Using large corpora}, pp.\ 273--290. 
Cambridge, MA: MIT Press.]

\bibitem {mar94}
Marcus, Mitchell P., Kim, Grace, Marcinkiewicz, Mary Ann,
MacIntyre, Robert, Bies, Ann, Ferguson, Mark, Katz, Karen and Schasberger,
Britta (1994) The Penn Treebank: Annotating Predicate Argument Structure",
In \emph{ARPA Human Language Technology Workshop}.

\bibitem {mar00}
Marciniak, Małgorzata, Mykowiecka, Agnieszka, Kup\'{s}\'{c}, Anna and
Przepi\'{o}rkowski, Adam (2000) An HPSG-Annotated Test Suite for Polish.
In \emph{Proceedings of the Second International
Conference on Language Resources and Evaluation (LREC 2000)}.

\bibitem {mel88}
Mel'\v{c}uk, Igor (1988) \emph{Dependency Syntax: Theory and Practice}.
State University of New York Press.

\bibitem {ofl00}
Oflazer, Kemal, Say, Bilge and Hakkani Tur, Dilep (2000)
A Syntactic Annotation Scheme for Turkish. 
In \emph{Proceedings of 10th International
Conference on Turkish Linguistics (ICTL-2000)}.

\bibitem {sad00}
Sadler, Louisa, von Genabith, Josef and Way, Andy (2000) 
Automatic F-Structure Annotation from the AP Treebank.
In Butt, Miriam and Holloway King, Tracy (eds.) 
\emph{Proceedings of the Fifth International Conference on
Lexical-Functional Grammar}, The University of California at Berkeley, 
19 July -- 20 July 2000. Stanford, CA: CSLI Publications.

\bibitem {sga86}
Sgall, Petr, Hajicova, Eva and Panevova, Jarmila (1986)
\emph{The Meaning of the Sentence in Its Pragmatic Aspects}. Reidel.
\bibitem {sam95}
Sampson, Geoffrey (1995) \emph{English for the Computer}. 
Oxford University Press.

\bibitem {sim02}
Simov, Kiril, Popova, Gergana, Osenova, Petya (forthcoming) HPSG-Based Syntactic
Treebank of Bulgarian (BulTreeBank). In Wilson, Andrew, Rayson, Paul, McEnery, Tony
(eds.) \emph{A Rainbow of Corpora: Corpus Linguistics and the Languages of 
the World}, pp.\ 135-142. Munich: Lincom-Europa.

\bibitem {tel74}
Teleman, Ulf (1974) \emph{Manual för grammatisk beskrivning av talad och skriven 
svenska [Manual for grammatical description of spoken and written
Swedish].} Lund: Studentlitteratur.


\end{thebibliography}
\end{document}
