This book presents a unique opportunity for constructing a consistent image of collaborative manual annotation for Natural Language Processing (NLP). NLP has witnessed two major evolutions in the past 25 years: firstly, the extraordinary success of machine learning, which is now, for better or for worse, overwhelmingly dominant in the field, and secondly, the multiplication of evaluation campaigns or shared tasks. Both involve manually annotated corpora, for the training and evaluation of the systems. These corpora have progressively become the hidden pillars of our domain, providing food for our hungry machine learning algorithms and reference for evaluation. Annotation is now the place where linguistics hides in NLP. However, manual annotation has largely been ignored for some time, and it has taken a while even for annotation guidelines to be recognized as essential. Although some efforts have been made lately to address some of the issues presented by manual annotation, there has still been little research done on the subject. This book aims to provide some useful insights into the subject. Manual corpus annotation is now at the heart of NLP, and is still largely unexplored. There is a need for manual annotation engineering (in the sense of a precisely formalized process), and this book aims to provide a first step towards a holistic methodology, with a global view on annotation.
Ebooka przeczytasz w aplikacjach Legimi na:
Liczba stron: 222
List of Acronyms
I.1. Natural Language Processing and manual annotation: Dr Jekyll and Mr Hy|ide?
I.2. Rediscovering annotation
1 Annotating Collaboratively
1.1. The annotation process (re)visited
1.2. Annotation complexity
1.3. Annotation tools
1.4. Evaluating the annotation quality
2 Crowdsourcing Annotation
2.1. What is crowdsourcing and why should we be interested in it?
2.2. Deconstructing the myths
2.3. Playing with a purpose
2.4. Acknowledging crowdsourcing specifics
2.5. Ethical issues
Appendix: (Some) Annotation Tools
A.1. Generic tools
A.2. Task-oriented tools
A.3. NLP annotation platforms
A.4. Annotation management tools
A.5. (Many) Other tools
End User License Agreement
1 Annotating Collaboratively
(Imaginary) contingency table for a toy example of POS annotation
Contingency table for the gene renaming annotation campaign [FOR 12c]
Contingency table for the gene renaming annotation campaign with the gene names as markables
2 Crowdsourcing Annotation
Accuracy of the most productive players in
Manually annotated corpora and machine learning process
Anchoring of notes in the source signal
Indention square brackets in a text by Virgil. Bibliothèque municipale de Lyon (Courtesy of Town Library of Lyon), France, Res. 104 950
Anchoring of auctoritates in De sancta Trinitate, Basel, UB B.IX.5, extracted from [FRU 12], by courtesy of the authors
1 Annotating Collaboratively
Traditional annotation phases (on the left) and cycles of agile annotation (on the right). Reproduction of Figure 2 from [VOO 08], by courtesy of the authors
Generic annotation pipeline (Figure 1 from [HOV 10], by courtesy of the authors)
Learning curve for the POS annotation of the Penn Treebank [FOR 10]. For a color version of the figure, see www.iste.co.uk/fort/nlp.zip
The annotation process, revisited (simplified representation)
Visualization of the complexity dimensions of an annotation task. For a color version of the figure, see www.iste.co.uk/fort/nlp.zip
POS annotation in the Penn Treebank [MAR 93]. For a color version of the figure, see www.iste.co.uk/fort/nlp.zip
Gene renaming annotation [JOU 11]. For a color version of the figure, see www.iste.co.uk/fort/nlp.zip
Structured named entity annotation [GRO 11]
The tagset dimension: taking the structure into account in the structured named entity annotation task [GRO 11]
Example of typed trace left by the annotator when annotating gene and protein names [FOR 09]
Example of annotation of a goal in football annotation [FOR 12b]: a context of more than the sentence is needed
The context as a complexity dimension: two sub-dimensions to take into account
Instantiated visualization: the delimited surface represents the complexity profile of the annotation task, here, gene renaming
Instantiated visualization: POS annotation in the Penn Treebank
Synthesis of the complexity of the gene names renaming campaign (new scale x2)
Case of impossible disagreement, with 3 annotators and 2 categories
Phenomena to take into account when computing inter-annotator agreements (Figure 1 from [MAT 15], by courtesy of the authors)
Scales of interpretation of kappas (from the ESSLI 2009 course given by Gemma Boleda and Stefan Evert on inter-annotator agreement, by courtesy of the authors)
Comparison of the behaviors of the metrics on categorization. For a color version of the figure, see www.iste.co.uk/fort/nlp.zip
Comparison of the behaviors of the metrics on categorization on the TCOF-POS corpus (no prevalence, but structure of the tagset taken into account). For a color version of the figure, see www.iste.co.uk/fort/nlp.zip
2 Crowdsourcing Annotation
Instructions for the travelers and employees of the colonies (French National Museum, 1860). The first edition dates from 1824
Number of players on
according to their scores (Feburary 2011 – Feburary 2012)
Number of active, registered editors (>= 5 edits/month) in
(Figure 1 from [HAL 13], courtesy of the author (CC-BY-SA))
Annotation of a dependency relation with
in the pharmacology domain: “For the Acute Coronary Syndroms (ACS), the duration of the IV depends on the way the ACS should be treated: it can last a maximum of 72 hours for patients who need to take drugs”
Interface of the game
: give ideas associated with the following term: “Quality”
: annotation interface
: peer validation interface
: game interface
: progress of the team that found the solution (pseudos of the players are shown in colors). For a color version of the figure, see www.iste.co.uk/fort/nlp.zip
: a fully-fledged game for dependency syntax annotation (player’s page)
: tutorial interface (correction of an error)
Amazon Mechanical Turk
: remuneration is at the heart of the platform
Bartles Interest Graph [MAR 15]
Marczewski’s Player and User Types and Motivations Hexad, courtesy of the author [MAR 15]
Gamification elements according to player types, courtesy of the author [MAR 15]
GWAP virtuous circle: the players should gain points only when they produce quality data
Amazon Mechanical Turk
: “Pay only when you’re satisfied with the results”
Table of Contents
Series Editor Patrick Paroubek
First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK
John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA
© ISTE Ltd 2016
The rights of Karën Fort to be identified as the author of this work have been asserted by her in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2016936602
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISSN 2051-2481 (Print)
ISSN 2051-249X (Online)
This book presents a unique opportunity for me to construct what I hope to be a consistent image of collaborative manual annotation for Natural Language Processing (NLP). I partly rely on work that has already been published elsewhere, with some of it only in French, most of it in reduced versions and all of it available on my personal website.1 Whenever possible, the original article should be cited in preference to this book.
Also, I refer to publications in French. I retained these publications because there was no equivalent one in English, hoping that at least some readers will be able to understand them.
This work owes a lot to my interactions with Adeline Nazarenko (LIPN/University of Paris 13) both during and after my PhD thesis. In addition, it would not have been conducted to its end without (a lot of) support and help from Benoît Habert (ICAR/ENS of Lyon).
Finally, I would like to thank all the friends who supported me in writing this book and proofread parts of it, as well as the colleagues who kindly accepted that their figures be part of it.
Automatic Content Extraction
Annotation Collection Toolkit
Association for Computational Linguistics
Annotation Graph Toolkit
Application Programming Interface
Association pour le Traitement Automatique des LAngues
(French Computational Linguistics Society)
Amazon Mechanical Turk
Human Intelligence Task
Linguistic Data Consortium
Natural Language Processing
Natural Language Processing (NLP) has witnessed two major evolutions in the past 25 years: first, the extraordinary success of machine learning, which is now, for better or for worse (for an enlightening analysis of the phenomenon see [CHU 11]), overwhelmingly dominant in the field, and second, the multiplication of evaluation campaigns or shared tasks. Both involve manually annotated corpora, for the training and evaluation of the systems (see Figure I.1).
These corpora progressively became the hidden pillars of our domain, providing food for our hungry machine learning algorithms and reference for evaluation. Annotation is now the place where linguistics hides in NLP.
However, manual annotation has largely been ignored for quite a while, and it took some time even for annotation guidelines to be recognized as essential [NÉD 06]. When the performance of the systems began to stall, manual annotation finally started to generate some interest in the community, as a potential leverage for improving the obtained results [HOV 10, PUS 12].
This is all the more important, as it was proven that systems trained on badly annotated corpora underperform. In particular, they tend to reproduce annotation errors when these errors follow a regular pattern and do not correspond to simple noise [REI 08]. Furthermore, the quality of manual annotation is crucial when it is used to evaluate NLP systems. For example, an inconsistently annotated reference corpus would undoubtedly favor machine learning systems, therefore prejudicing rule-based systems in evaluation campaigns. Finally, the quality of linguistic analyses would suffer from an annotated corpus that is unreliable.
Figure I.1.Manually annotated corpora and machine learning process
Although some efforts have been made lately to address some of the issues presented by manual annotation, there is still little research done on the subject. This book aims at providing some (hopefully useful) insights into the subject. It is partly based on a PhD thesis [FOR 12a] and on some published articles, most of them written in French.
The renowned British corpus linguist Geoffrey Leech [LEE 97] defines corpus annotation as: “The practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data. ‘Annotation’ can also refer to the end-product of this process”. This definition highlights the interpretative dimension of annotation but limits it to “linguistic information” and to some specific sources, without mentioning its goal.
In [HAB 05], Benoît Habert extends Leech’s definition, first, by not restricting the type of added information: “annotation consists of adding information (a stabilized interpretation) to language data: sounds, characters and gestures”.1 He adds that “it associates two or three steps: (i) segmentation to delimit fragments of data and/or add specific points; (ii) grouping of segments or points to assign them a category; (iii) (potentially) creating relations between fragments or points”.2
We build on these and provide a wider definition of annotation:
DEFINITION (Annotation).– Annotation covers both the process of adding a note on a source signal and the whole set of notes or each note that results from this process, without a priori presuming what the nature of the source (text, video, images, etc.), the semantic content of the note (numbered note, value chosen in a reference list or free text), its position (global or local) or its objective (evaluation, characterization and simple comment) are.
Basically, annotating is adding a note to a source signal. The annotation is therefore the note, anchored in one point or in a segment of the source signal (see Figure I.2). In some cases, the span can be the whole document (for example, in indexing).
Figure I.2.Anchoring of notes in the source signal
In the case of relations, two or more segments of the source signal are connected and a note is added to the connection. Often, a note is added to the segments too.
This definition of annotation includes many NLP applications, from transcription (the annotation of speech with its written interpretation) to machine translation (the annotation of one language with its translation in another language). However, the analysis we conduct here is mostly centered on categorization (adding a category taken from a list to a segment of signal or between segments of signal). It does not mean that it does not apply to transcription, for example, but we have not yet covered this thoroughly enough to be able to say that the research detailed in this book can directly apply to such applications.
In NLP, annotations can either be added manually by a human interpreter or automatically by an analysis tool. In the first case, the interpretation can reflect parts of the subjectivity of its authors. In the second case, the interpretation is entirely determined by the knowledge and the algorithm integrated in the tool. We are focusing here on manual annotation as a task executed by human agents whom we call annotators.
Identifying the first evidence of annotation in history is impossible, but it seems likely that it appeared in the first writings on a physical support allowing for a text to be easily commented upon.
Annotations were used for private purposes (comments from readers) or public usage (explanations from professional readers). They were also used for communicating between writers (authors or copyists, i.e. professional readers) [BAK 10]. In these latter senses, the annotations had a collaborative dimension.
Early manuscripts contained glosses, i.e. according to the online Merriam-Webster dictionary:3 “a brief explanation (as in the margin or between the lines of a text) of a difficult or obscure word or expression”. Glosses were used to inform and train the reader. Other types of annotations were used, for example to update the text (apostils). The form of glosses could vary considerably and [MUZ 85, p. 134] distinguishes between nine different types. Interlinear glosses appear between the lines of a manuscript, marginal glosses in the margin, surrounding glosses in the circumference and separating glosses between the explained paragraphs. They are more or less merged into the text, from the organic gloss, which can be considered as part of the text, to the formal gloss, which constitutes a text in itself, transmitted from copy to copy (as today’s standoff annotations).4
Physical marks, like indention square brackets, could also be added to the text (see an example in a text by Virgil5 in Figure I.3), indicating the first commented words. Interestingly, this primitive anchoring did not indicate the end of the commented part.
Figure I.3.Indention square brackets in a text by Virgil. Bibliothèque municipale de Lyon (Courtesy of Town Library of Lyon), France, Res. 104 950
The same limitation applies to the auctoritates, which appeared in the 8th Century to cite the authors (considered as authorities) of a citation. The anchoring of the annotation is noted by two dots above the first word of the citation, without indicating the end of it (see Figure I.4).
Figure I.4.Anchoring of auctoritates in De sancta Trinitate, Basel, UB B.IX.5, extracted from [FRU 12], by courtesy of the authors
This delimitation problem was accentuated by the errors made by the copyists, who moved the auctoritates and their anchors. To solve this issue, and without the possibility of using other markers like quotes (which will appear much later), introductory texts (pre- and peri-annotation) were invented.
From the content point of view, the evolution went from the explanatory gloss (free text) to the citation of authors (name of the author, from a limited list of authorities), precisely identified in the text. As for the anchoring, it improved progressively to look like our present text markup.
This rapid overview shows that many of today’s preoccupations – frontiers to delimit, annotations chosen freely or from a limited reference, anchoring, metadata to transmit – have been around for a while. It also illustrates the fact that adding an annotation is not a spontaneous gesture, but one which is reflected upon, studied, a gesture which is learned.
A good indicator of the rise in annotation diversity and complexity is the annotation language. The annotation language is the vocabulary used to annotate the flow of data. In a lot of annotation cases in NLP, this language is constrained.6 It can be of various types.
The simplest is the Boolean type. It covers annotation cases in which only one category is needed. A segment is annotated with this category (which can only be implicit) or not annotated at all. Experiences like the identification of obsolescence segments in encyclopaedia [LAI 09] use this type of language.
Then come the first-order languages. Type languages are, for example, used for morpho-syntactic annotation without features (part-of-speech) or with features (morpho-syntax). The first case is in fact rather rare, as even if the tagset seems little structured, as in the Penn Treebank [SAN 90], features can almost always be deduced from it (for example, NNP, proper name singular, and NNPS, proper name plural, could be translated into NNP + Sg and NNP + Pl).
As for relations, a large variety are annotated in NLP today, from binary-oriented relations (for example, gene renaming relations [JOU 11]) to unoriented n-ary relations (for example, co-reference chains as presented in [POE 05]).
Finally, second-order languages could be used, for example, to annotate relations on relations. In the soccer domain, for example, intercept(pass(p1, p2), p3) represents a pass (relation) between two players (p1 and p2), which is intercepted by another player (p3). In practice, we simplify the annotation by adapting it to a first-order language by reifying the first relation [FOR 12b]. This is so commonly done that we are aware of no example of annotation using a second-order language.
Jean Véronis concluded his state-of-the-art of the automatic annotation technology in 2000 with a figure summarizing the situation [VÉR 00]. On this figure, only the part-of-speech annotation and the multilingual alignment of sentences are considered “operational”. Most applications are considered as prototypes (prosody, partial syntax, multilingual words alignment), and the rest were still not allowing for “applications which are useful in real situations” (full syntax, discourse semantics) or were close to prototypes (phonetic transcription, lexical semantics). The domain has quickly evolved, and today much more complex annotations can be performed on different media and related to a large variety of phenomena.
In the past few years, we have witnessed the multiplication of annotation projects involving video sources, in particular sign language videos. A workshop on the subject (DEGELS) took place during the French NLP conference (TALN) in 2011 and 2012,7 and a training concerning video corpus annotation was organized by the Association pour le Traitement Automatique des LAngues (ATALA) in 2011.8
Moreover, more and more complex semantic annotations are now carried out on a regular basis, like opinions or sentiment annotation. In the biomedical domain, proteins and gene names annotation is now completed by the annotation of relations like gene renaming [JOU 11] or relations between entities, in particular within the framework of BioNLP shared tasks.9 Semantic annotations are also performed using a formal model (i.e. an ontology) [CIM 03], and linked data are now used to annotate corpora, like during the Biomedical Linked Annotation Hackathon (BLAH).10
Finally, annotations that are now considered as traditional, like named entities or anaphora, are getting significantly more complex, for example with added structuring [GRO 11].
However, there are still few corpora freely available with different levels of annotations, including with annotations from different linguistic theories. MASC (Manually Annotated Sub-Corpus) [IDE 08]11 is an interesting exception, as it includes, among others, annotations of frames à la FrameNet [BAK 98] and senses à la WordNet [FEL 98]. Besides, we are not aware of any freely available multimedia-annotated corpus with each level of annotation aligned to the source, but it should not be long until it is developed.
The ever-growing complexity of annotation is taken into account in new annotation formats, like GrAF [IDE 07]; however, it still has to be integrated in the methodology and in the preparation of an annotation campaign.
The exact cost of an annotation campaign is rarely mentioned in research papers. One noteworthy exception is the Prague Dependency TreeBank, for which the authors of [BÖH 01] announce a cost of US$600,000. Other articles detail the number of persons involved in the project they present: GENIA for example, involved 5 part-time annotators, a senior coordinator and one junior coordinator for 1.5 year [KIM 08]. Anyone who participated in such a project knows it that manual annotation is very costly.
However, the resulting annotated corpora, when they are well-documented and available in a suitable format, as shown in [COH 05], are used well beyond and long after the training of the original model or the original research purpose. A typical example of this is the Penn TreeBank corpus, created in the beginning of the 90s [MAR 93] and that is still used more than 20 years later (it is easy to find recent research like [BOH 13]). On the contrary, the tools trained on these corpora usually become quickly outdated as research is making progress. An interesting example is that of the once successful PARTS tagger, created using the Brown corpus [CHU 88] and used to pre-annotate the Penn TreeBank. However, when the technology becomes mature and generates results that the users consider satisfactory, the lifespan of such tools gets longer. This is the case for example in part-of-speech tagging for the TreeTagger [SCH 97], which, with nearly 96% of accuracy for French [ALL 08], is still widely used, despite the fact that it is now less efficient then state-of-the-art results (MElt [DEN 09] for example, obtains 98% accuracy on French). Such domains are still rare.
This trivial remark concerning the lifetime of corpora leads to important consequences with regard to the way we build manually annotated corpora.
First, it puts the cost of the manual work into perspective: a manual corpus costing US$600,000 like the Prague Dependency TreeBank, that has been used for more than 20 years like the Penn TreeBank is not that expensive (US$30,000 per year). It is even cheaper if you consider the total number of projects which used it: a quick search in the Association for Computational Linguistics (ACL) anthology12 with the keyword “Penn TreeBank” reveals that more than 30 research articles directly use the corpus (including the Penn Discourse TreeBank, but excluding the Penn treebanks created for other languages like Chinese), which corresponds to US$20,000 per project. If we consider that many research projects used it without putting its name in the title of the article, like the paper we wrote on the effects of pre-annotation [FOR 10], we can assume that a lot more than 30 projects were based on the Penn TreeBank, lowering its cost to probably less than that of a long internship.
Second, it is a strong argument not for building manually annotated corpora according to the possibilities of the system(s) that will use it, as they will be long forgotten when the annotated corpus is still be used. If the corpus is too dependent on the systems’ (limited) capabilities, it will not be useful anymore when the algorithms become more efficient.
Third, this implies that manual annotation should be of high quality, i.e. well-prepared, well-documented and regularly evaluated with adequate metrics. Manual annotation campaign preparation is often rushed and overlooked, because people want to get it over with as quickly as possible.13 This has been particularly emphasized in [SAM 00], where the author notes (on p. 7) that: “[…] it seems to me that natural language computing has yet to take on board the software-engineering lesson of the primacy of problem analysis and documentation over coding”.
There is, in fact, a need for annotation engineering procedures and tools and this is what this book aims at providing, at least partly.
In French, the original version is: “l’annotation consiste à ajouter de l’information (une interprétation stabilisée) aux données langagières : sons, caractères et gestes”.
Tysiące ebooków i audiobooków
Ich liczba ciągle rośnie, a Ty masz gwarancję niezmiennej ceny.
Napisali o nas:
Nowy sposób na e-księgarnię
Czytelnicy nie wierzą
Legimi idzie na całość
Projekt Legimi wielkim wydarzeniem
Spotify for ebooks