IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID

1

On the Influence of an Iterative Affect Annotation Approach on Inter-Observer and Self-Observer Reliability Sidney K. D’Mello Abstract—Affect detection systems require reliable methods to annotate affective data. Typically, two or more observers independently annotate audio-visual affective data. This approach results in inter-observer reliabilities that can be categorized as fair (Cohen’s kappas of approximately .40). In an alternative iterative approach, observers independently annotate small amounts of data, discuss their annotations, and annotate a different sample of data. After a pre-determined reliability threshold is reached, the observers independently annotate the remainder of the data. The effectiveness of the iterative approach was tested in an annotation study where pairs of observers annotated affective video data in nine annotate-discuss iterations. Self-annotations were previously collected on the same data. Mixed effects linear regression models indicated that inter-observer agreement increased (unstandardized coefficient B = .031) across iterations, with agreement in the final iteration reflecting a 64% improvement over the first iteration. Follow-up analyses indicated that the improvement was nonlinear in that most of the improvement occurred after the first three iterations (B = .043), after which agreement plateaued (B ≈ 0). There was no notable complementary improvement (B ≈ 0) in self-observer agreement, which was considerably lower than observer-observer agreement. Strengths, limitations, and applications of the iterative affective annotation approach are discussed. Index Terms—affect annotation; ground truth; self; observers; affect detection

——————————  ——————————

1

INTRODUCTION

A

FFECT detection is one of the most prominent areas in affective computing as evidenced by an undercurrent of affect detection research since the inception of the field. This is likely due to the fact that developing affect-sensitive interfaces is one of the core goals of affective computing and such interfaces must first detect affect in order to respond to affect. Recent reviews of the affect detection literature (see [1-3]) generally arrive at two similar conclusions (amongst others). The first is that the field has progressed tremendously over the last 15 years, to the point where affect-sensitive interfaces are becoming a reality (e.g., [4, 5]). The second is that several core challenges, some apparent since the inception of the field, still persist today. This paper addresses a methodological core challenge pertaining to the task of annotating affective data. It is easy to make the case for the importance of affect annotation for affect detection. Most affect detection systems rely on some form of supervised learning to identify relationships between machine-readable signals (facial expressions, physiology, linguistic and paralinguistic cues) and latent (or not directly observable) affective states. Supervised learning requires “ground truth” in the form of annotated (or labeled) affective data. This is a nontrivial problem in affect detection, because unlike supervised learning in domains such as biometrics or economic forecasting, affective ground truth can never be absolutely obtained. It must be approximated because affect is a psychological construct (i.e., a conceptual variable) rather than a

physical (e.g., height or weight) or an identity property (e.g., person X in a biometric system). How do researchers currently obtain affective ground truth for their affect detection systems? One strategy is to forgo the ground truth problem entirely by asking actors to portray target affective states (the acted state is the ground truth). Another approach is to induce specific affective states using a variety of affective elicitation methods (the induced state is the ground truth). These two approaches have their respective merits, but ultimately rely on the assumption that the affective displays elucidated by acting or induction are adequate representations of naturally arising affect. This might be a tenuous assumption since naturalistic affective displays can be more subtle, ambiguous, and contextually-driven (as detailed in Section 1.2). Further, neither the acting nor the induction approach addresses the ground truth problem when the goal is to collect naturalistic affective data. This type of data is collected when an individual spontaneously experiences and expresses affect while interacting with another agent (either human or artificial). Data in the form of video, audio, and text are collected during these interactions and need to be annotated for affect for supervised learning to proceed. The present paper is concerned with this form of affect annotation. It addresses the critical issue of whether reliability between two external annotators can be improved by changing the annotation procedure itself. It also considers agreement between external annotators and the person experiencing the affective state (henceforth called participant ———————————————— or self). This is done by considering an iterative affect anno S.K. D’Mello is with the University of Notre Dame, Notre Dame IN 46556, tation approach and studying its influence on inter-annotaUSA. E-mail: [email protected]. tor agreement obtained by annotating two existing data xxxx-xxxx/0x/$xx.00 © 200x IEEE

Published by the IEEE Computer Society

2

sets. In the remainder of this section we discuss the various dimensions of affect annotation, basic assumptions underlying affect annotation in general, review published interobserver agreement scores, introduce the iterative approach, discuss self-observer agreement, and discuss the goals and novelty of the current study.

1.1 Dimensions of Affect Annotation A single affect annotation study represents just one instance in the universe of possible annotation tasks. Therefore, it is prudent to consider some of the dimensions of affect annotation and highlight the ones emphasized in this paper. We organize our discussion around the following five dimensions: sources, timing, temporal resolution, data modality, and level of abstraction. The first dimension (sources) pertains to the types and number of individuals performing the annotations. The most common approach involves observer-based1 annotations. One method involves a small number of skilled (or expert) observers at a relatively high cost per observer [6]. For example, a psychiatrist may be asked to provide depression annotations, a marriage counselor might annotate sessions for marital conflict, or an FBI agent might provide annotations of deception. The second method involves a large number unskilled (or novice) observers at a relatively low cost. Novice observers include undergraduate research participants [7] or individuals recruited from crowdsourcing platforms like Mechanical Turk [8]. Rosenthal [9] suggests that selecting the type (experts, novices, intermediate) and number of observers involves a tradeoff among available resources (the observers themselves, cost per observation, and available annotation budget) and the difficulty of the annotation task (estimated by the average reliability between pairs of observers as detailed in Section 1.4). In the present study, the annotations were performed by individuals familiar with the basis of qualitative research. These individuals were selected as being typical of the undergraduate/graduate research assistants who commonly serve as affect annotators and lie between the novice-expert continuum. Annotations can be performed in the moment, as in the case when the participant self-reports their affective states verbally following an “emote-aloud” protocol [10], or by periodically filling out affect questionnaires [11]. Observers can also provide “live” annotations using field observation methods [12]. Alternatively annotations can be performed offline or after the interaction, typically from videos (with or without audio) of the participant recorded during the interaction. The present focus is on this form of offline annotation, because it the dominant method used in affective computing research. Temporal resolution pertains to whether data is annotated at the frame level (e.g., individual frames in a video stream), word level (e.g., individual spoken words), segment level (e.g., sentences or short actions), or session level (e.g., an entire session of interactions). The temporal resolution of choice has been shown to impact the annotation’s 1 We use the term observers (instead of annotators) to refer to situations where the annotations are performed by individuals other than the person experiencing/expressing the affective states (i.e., the self). This is because

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID

obtained [13]. The present paper focuses on segment-level annotations, where a segment includes one or more actions and can incorporate contextual information within which these actions are situated. Segment-level annotation represents a suitable intermediate level between very finegrained frame-level annotations and coarse-grained session-level annotations. Data modality refers to the information on which the annotations are based and influences the annotation process and outcomes [7, 14]. Common modalities in affect annotation include a video of the participant’s face, audio of speech, a screen shot of the computer interface (to re-create context), and videos capturing body movements and gestures. The most common modalities are the face and speech, with the face being emphasized in this paper. Finally, with respect to level of abstraction, affect is multicomponential and can be annotated at multiple levels. These could include behavioral expressions (e.g., facial expressions or gestures (e.g., [15]) or higher level constructs such as basic emotions (e.g., fear, anger [16]), or nonbasic complex blends of affect and cognition, such as confusion and frustration. The present emphasis is on annotation of affective states at the higher-levels of abstraction, including both nonbasic and basic affect.

1.2 Assumptions of Affect Annotation Given the multidimensional nature of affect annotation, decisions regarding the sources, timings, temporal resolution, data modality, and level of abstraction need to be made prior to conducting an annotation study. Many of these decisions are driven by the nature of the application. However, it is important to emphasize that the entire affect annotation endeavor is based on fundamental assumptions on the very nature of affect, so it is prudent to consider some of these assumptions and their implications. The assumptions arise due to the conceptual rather than physical status of affect as alluded to earlier. Understanding the nature of affect, its function, and properties has been a contentious issue in the affective sciences and is sometimes referred to as the “hundred year emotion war” [17, 18]. For example, there has been an ongoing debate in both the affective sciences and in affective computing as to whether affect is best represented via discrete categories (e.g., angry, fearful) or by fundamental dimensions (e.g., valence, arousal, power) [19, 20]. Other open issues pertain to whether emotions are innate or are learned, if they arise via appraisals/reappraisals or are they products of socioconstructivism, whether emotions are universally expressed or if context and culture shape emotion expression [16, 21-26]. Selecting appropriate methods to annotate affect necessitates choosing sides in these ongoing debates, a decision that introduces several fundamental assumptions. Of utmost relevance to affect annotation is the assumption of a link between experienced and expressed affect [27, 28]. Based on this assumption, it should theoretically be possible to “decode” an affective state based on visible expressions. Though most would concur with this position to annotations can also be performed by the self, i.e., when the person engaging in the interaction annotates his/her own data.

DMELLO: THE ITERATIVE APPROACH TO AFFECT ANNOTATION

some degree, an extreme view is that the affect annotation problem merely involves matching a fixed set of expressions (e.g., facial features, gestures, speech patterns) with a set of affective states. Available evidence suggests that it is rarely this straightforward. For example, although facial expressions are considered to be strongly associated with affective states, meta-analyses on correlations between facial expressions and affect have yielded small to medium effects for naturalistic expressions [29-32]. Thus, affect annotation is influenced by additional factors that moderate the experience-expression link. Some of these factors include contextual and social influences [33], individual differences in affect perception abilities [34], and individual and group (or cultural) differences in expression [35, 36]. In line with this, Scherer [37] recently proposed the dynamic Tripartite Emotional Expression and Perception Model (TEEP), which is an inference-based model. According to TEEP, externalized emotional expressions, which may not be intentional, unfold dynamically as the emotional event is sequentially appraised by the expresser (participant). These expressions are perceived as emotional symptoms by the observer, who interprets these signals via inference and attribution mechanisms that are strongly constrained by the socio-cultural context of the interaction. Thus, affect annotation involves inference, and inference involving a complex psychological construct inherently introduces a certain degree of error. The error is reflected by the degree of (mis)alignment when two observers annotate the same affective stimuli or when annotations of external observers are compared to annotations provided by the self. This issue is at the heart of this paper.

1.3 Inter-observer (OO) Agreement In order to obtain a general sense of inter-observer (OO) agreement reported in affective computing studies, we considered published papers that we were familiar with as well as conducted a more formal search of the literature. The formal search was done by identifying potential papers across all issues of IEEE Transactions on Affective Computing, the flagship journal of the field, and by examining the proceedings of the most recent edition of the Affective Computing and Intelligent Interaction (ACII 2013) conference – the flagship conference of the field. No study was discarded due to a high or low OO value. The search was not comprehensive, nor did it intend to be. This is because the goal was not to perform a meta-analysis on OO reliabilities, but to obtain a general sense of OO reliabilities as reported in affective computing studies. There are many possible metrics to compute OO reliability, each with different strengths and weaknesses [38]. In this paper, we use Cohen’s kappa [39] as the standard measure of inter-rater reliability. Kappa is computed as the difference between the proportion of observed agreement (Ao) and “chance” agreement (Ac): Kappa = (Ao-Ac)/(1Ac). A kappa of 1 reflects perfect agreement, a kappa of 0 reflects chance agreement, and kappas less than 0 represent agreement less than chance. Kappa is superior to simple percent agreement because it corrects for chance, however, it is limited in some other respects (as extensively discussed in [38]). Key limitations include the fact that kappa

3

does not allow observers to be permutated or interchanged, does not have well defined end points, and can allow systematic disagreements to influence the computations in unexpected ways. Our decision to focus on kappa is motivated by its prevalent use in affect annotation studies in the field of affective computing, thereby affording meaningful comparisons of the current results with existing research. Table 1 summarizes the data sources, number of affective states annotated, and inter-observer reliability (OO reliability) obtained from 14 offline observer-annotation studies. The kappas reported in the table are reproduced from each paper, with the exception of [14], where the kappa was estimated from reported confusion matrices. When ranges in kappas are reported, this is due to multiple annotations being performed, or kappas for multiple states or multiple annotators being reported separately. For example, Litman and Forbes-Riley [40] report kappas of 0.3 for Emotion vs. Non Emotion, 0.4 for Negative, Neutral, Positive, and 0.5 for Negative vs. Non-Negative annotations, respectively. In Schuller, et al. [41], the initial annotation yielded a kappa of 0.09, but kappa increased to .66 after eliminating difficult cases and cases with low confidence (computed based on variance across observers). We computed the mean of the OO kappa scores in Table 1 to get a general sense of OO reliability. For simplicity, we took the midpoint when a range in kappas was reported. This resulted in an average OO kappa of 0.39 (median of .37). Kappas were generally higher for studies that annotated only one affective state. In fact, there was a negative correlation between number of affective states annotated and OO kappa, Pearson’s r = -.47. Therefore, the average .39 kappa might be somewhat of an overestimation when multiple states are considered. TABLE 1 KAPPAS FROM PREVIOUS W ORK ON AFFECT ANNOTATION Study

No. States 1 1 5 1 7 7 2-3

Schuller, et al. [41] Pon-Barry, et al. [42] Lehman, et al. [43] Whitehill, et al. [13] Afzal and Robinson [44] Graesser, et al. [45] Litman and Forbes-Riley [40] Ang, et al. [46] 6 Shafran, et al. [47] 2-6 Alhothali [48] 5 D'Mello, et al. [49] 7 D'Mello, et al. [50] 5 Forbes-Riley and Litman [4] 1 Janssen, et al. [14] 5

Data Sources

Kappa

Spoken utterances Spoken utterances Face and speech Face Face and screen Face and screen Dialog transcript

.09 - .66 .45 .66 - .80 .39 - .68 .20 - .35 .36 .30 – .50

Spoken utterances Spoken utterances Face and screen Face and screen Face, speech, transcript Dialog transcript Face and speech

.47 .32 - .42 .32 .12 .12 - .18 .62 .11 - .37

1.4 Iterative Annotation Approach According to the widely popular categorization of Landis and Koch [51], kappas in the range of 0 to 0.2 are considered to be slight or poor, 0.2 to 0.4 fair, 0.4 – 0.6 moderate, 0.6 – 0.8 substantial, and 0.8 to 1.0 near perfect. On the basis

4

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID

of this categorization, the aforementioned average 0.39 kappa would be categorized as fair and is below the recommended 0.6 kappa for research studies in psychology. Potential reasons for the lower agreement might have to do with differences in the nature of the interactions (i.e., human-human in psychology studies and mainly humancomputer in affective computing studies) and the types of affective states annotated (mainly basic emotions in psychology studies and both basic and nonbasic affect in affective computing studies). Nevertheless, this raises the question of whether we are resigned to accepting fair OO kappas or if they can be improved? Rosenthal [9] discusses the notion of “effective reliability,” which is a metric that takes into account the number of observers and the mean reliability across all pairs of observers. The basic intuition is that the easiest way to increase reliability is to simply increase the number of observers so that observer-specific variance gets cancelled out. McKeown and Sneddon [52] performed a simulated analysis and concluded that 20-30 observers were needed to confidently detect an emotional signal from time-continuous annotations. Although they are careful to make no claims on generalizability of this result to other situations2, requiring such a large number of observers might not always be feasible. This is especially true when the goal is to annotate affective data for affect detection because the supervised classification techniques applied to build affect detectors require massive volumes of annotated data [53]. Recent advances in wearable sensors and the ubiquity of web-cams affords collection of large amounts of unannotated data over the Internet [54] and in the wild [55]. However, affect annotation is still a bottleneck and there is usually more data to be annotated than available annotation resources. Thus, it might be advantageous if OO agreement could be improved using the minimum number of observers (k = 2 are required to compute reliability). We investigate this possibility by considering a modification to the common affect annotation approach. The typical approach in affective computing research is the independent observer approach where two (or more) observers independently provide annotations without any attempt to achieve consensus. Instances where they agree are taken to be ground truth. Instances where they disagree are discarded, or the majority or average is taken as ground truth (where applicable). With the exception of Lehman, et al. [43], all of the studies reported in Table 1 adopted the independent approach. The independent approach eliminates concerns pertaining to inter-observer bias, which is an important advantage. A disadvantage is that affective labels and visible cues are highly subjective, so the lower OO kappa scores might be a result of different observers interpreting the labels and cues differently. Can inter-observer annotations be made more consistent without introducing bias? To answer this question, we propose an iterative approach motivated by the qualitative research methods literature [6]. Here, observers independently annotate a small random subset of the data and discuss their agreements/disagreements in an attempt to

arrive at a consensus. This process is iteratively repeated on another random subset until some pre-specified threshold of agreement is met. The observers then annotate the remaining data independently (i.e., with no more discussion). The idea is that the consensus-building discussions at the initial stages of annotations make assumptions explicit, expose differences of interpretations, and form a mutually-shared conceptual understanding of the stimuli akin to frame-of-reference training [6]. Systematically testing the influence of the iterative approach on OO agreement is one key goal of this work.

2 They specifically state “We make no claim that these results will generalize to other situations. The appropriate sample size for such analyses

remains an empirical question and is likely to vary with context and participant group (p. 164-165).”

1.5 Self-Observer (SO) Agreement There is yet another important aspect to consider beyond OO reliabilities. Even if the iterative approach results in improved OO reliability, how do the resulting observer affect annotations align with affect as experienced by the participants themselves (i.e., self-observer or SO reliability)? There is sufficient theoretical justification to expect that the two sources of affect annotations might not align. In particular, the self and observers have access to different sources of information and are subject to different biases, thereby arriving at different approximations of the objective reality (i.e., the true affective state) assuming one exists. In particular, affective states are multifaceted in that they encompass conscious feelings (“I feel afraid”), overt actions (“I freeze”), covert neurophysiological responses (“My muscles clench”), and metacognitive reflections (“I am a coward”). Importantly, access to these components varies by source (self vs. observer). The self has access to some conscious feelings, some overt actions, memories of the experience, and meta-cognitive reflections, but usually not to some of the unconscious affective components. The self must also engage in some form of reconstructive process when providing offline annotations of their own affective states [56]. They are also more likely to distort or misrepresent their affective states due to biases, such as reference bias [57] or social desirability bias [58]. In contrast, observers only have access to overt actions and behaviors that can be visibly perceived (e.g., facial features, postures, gestures) and must rely more heavily on inference [37]. However, the links between affective expressions and affective states are not as clear-cut as was previously assumed [33, 59-61], so observers must grapple with considerable ambiguity in making these inferences. Observers are also less likely to succumb to the same biases that befall self-reports, but they introduce their own set of biases, such as the halo effect [62]. Thus, there are strengths and pitfalls of reliance on either the self or observers for affective annotations. It is sometimes argued that self-annotations are less important if affect detection systems aim to model affective states as perceived by external observers (e.g., a teacher perceiving a student as being bored) [13, 20]. However, this line of argument might not hold when an affect detector trained on observer-annotations is used to respond to affect experienced by the self. For example, how useful is an observerbased detector that intervenes when it senses frustration,

DMELLO: THE ITERATIVE APPROACH TO AFFECT ANNOTATION

but the participant is actually confused? Therefore, perhaps the most defensible position is to consider both selfand observer-annotations as multiple imperfect estimates of the ground truth, thereby capitalizing on their merits while minimizing their flaws. However, this would require a modicum of SO reliability, and the influence of the iterative approach on SO reliability needs to be examined more carefully.

1.6 Current Study: Overview, Scope, and Novelty Overview. If the iterative approach results in improved kappas, then it might be advantageous for affective computing researchers to consider adopting it for affect annotation tasks. We test this possibility in the present study where pairs of observers independently annotated a short video segment of affective data, discussed their annotations to resolve disagreements, and then independently annotated a different video segment. This annotate-discuss-annotate cycle was repeated for nine iterations. Unbeknown to the observers, self-annotations were previously collected on the same data using a similar annotation protocol. This allowed us to ascertain if any improvements in OO agreement obtained by the iterative approach were associated with corresponding improvements in SO agreement. Thus, the data was used to address three key questions pertaining to: (1) the extent and rate of change in OO agreement across iterations; (2) the extent and rate of change in SO agreement across iterations; and (3) comparisons between overall OO and SO agreement. We predict a positive effect of iteration on OO agreement, and expect OO agreement to be higher than SO agreement, but are unsure of the effect of iteration on SO agreement. Scope. As noted in Section 1.1, affect annotation is a broad field with multiple dimensions (see reviews [20, 63]). The present emphasis is on two external observers providing annotations of naturally-occurring discrete affective states (e.g., bored, confused) of a set of participants during human-computer interaction contexts. The annotations were based on prerecorded videos of the participants’ faces and computer screens (to provide contextual information) and annotations were performed every 10-20 seconds (i.e., segment-level annotations). This annotation task was motivated by a particular affective computing application that involves the design of affect-sensitive educational technologies that automatically detect and respond to a learner’s affective state [4, 64, 65]. We focused on categorical or discrete affect representations given their suitability for triggering theory-based affect-sensitive responses (e.g., if participant is confused, then give a hint). This decision, however, should not be taken as an endorsement of discrete vs. dimensional representations of affect and the reader is directed to other resources for expanded discussions on the issues associated with continuous annotation of affect dimensions [20, 66]. Although the focus is on annotation of data obtained in the context of a specific affective computing application, the basic annotation task is sufficiently broad so that the findings are applicable to a range of annotation tasks that are routinely encountered in affective computing research (see Section 4.3 for details). Novelty. The basic idea of using iterative approach for

5

annotation tasks is not a new idea in the field of nonverbal behavior analysis and qualitative research methods [6, 9]. It is also sometimes employed in the field of affective computing for coding of low-level constructs, such as annotating videos for facial expressions [15]. However, as noted above, affective computing researchers rarely use an iterative approach in their affect annotation tasks. Even in cases where it has been used for affect annotation [43], its impact on OO agreement has yet to be systematically studied. Thus, researchers only have anecdotal evidence on the viability of this approach for improving OO agreement. Hence, systematically studying the effect of the iterative approach on OO agreement in the context of affective computing applications is one novel component of this research. The second novel component arises from our comparison of changes in OO agreement adduced by the iterative approach to corresponding changes (if any) in SO agreement. By comparing OO and SO kappas on the same stimuli, we can meaningfully study the differential effects of the iterative approach on one vs. the other. To the best of our knowledge, this has never been explored before, but is highly relevant to closed-loop affective computing systems that aim to intervene in response to detected affect.

2

METHOD

2.1 Observers The observers were eight undergraduate students from a private college in the United States. Observers were enrolled in a course on qualitative research methods. This was done to ensure that the observers more closely matched the individuals who perform annotations in many affective computing studies (i.e., student research assistants). The observers were familiar with the basics of qualitative research methods, but had no previous experience with affect annotation. 2.2 Stimuli Observers provided annotations on affective data (videos of faces and computer screens) collected in two previous studies. One study involved interactions with a dialogbased intelligent tutoring system called AutoTutor (called the AutoTutor study), while the second study consisted of solving analytical reasoning problems with a tablet PC (called the Analytical Reasoning study). These studies were selected as stimuli since they both collected self-annotations. Specifically, the individuals participating in these studies annotated their own videos immediately after completing the tutoring or problem solving sessions. The observers in the present study provided observer-annotations using the same videos and at the same time points as the self-annotations, thereby affording meaningful comparisons between the two annotations. Detailed descriptions of the two studies are available in previously published articles [45] for AutoTutor and [67] for Analytical Reasoning; here we focus on salient aspects of relevance to the present study. AutoTutor. Participants were 28 undergraduate students who interacted with AutoTutor, a dialog-based intelligent tutoring system for topics in Computer Literacy (see

6

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID

Figure 1). Aspects of the interaction included: (1) tutor posing a main question via synthesized speech; (2) participant providing a typed response; (3) tutor evaluating the response and providing feedback; (4) tutor breaking down the main question into subcomponents and posing hints and prompts; (5) repeating steps (2) and (3) until each subcomponent is adequately answered; (6) tutor providing a summary of the answer. These steps are repeated until all main questions were answered or a 32-minute time limit had elapsed.

tests. They were asked to solve 28 analytical reasoning problems (or until 35 minutes had elapsed) taken from preparatory materials from the LSAT (a required test for admission into law school in the U.S.). Participants interacted with two software programs that were displayed on a Tablet PC as shown in Figure 3. The left half of the screen displayed a customized program that displayed each problem, recorded participants’ answers, and provided feedback on the accuracy of their answers. The right half of the screen displayed Windows Journal™, a software application they could use to take notes and draw with a stylus.

Figure 1. AutoTutor interface

Greyscale videos of the participants’ faces were recorded during the tutoring session using an infrared camera (see Figure 2 for sample video frames). Videos of their computer screens (including the synthesized audio) were also recorded using screen capture software. Participants self-annotated their data after completing the tutoring session using a video-based retrospective affect annotation protocol [63]. Specifically, the two video streams were synchronized and displayed to the participants using a dualmonitor setup. Participants identified the affective state they were experiencing from a list of states (with definitions) in fixed 20-second intervals at which the videos automatically paused. Participants could also pause the video at any time and spontaneously indicate their affective state. The list of affective states were: boredom, engagement/flow, confusion, frustration, delight, surprise, and neutral (no affect); these are the states that were found to be prominent during interactions with AutoTutor and other similar learning technologies [68]. A total of 2967 self-reports were collected (approximately 100 per participant) of which 2537 were fixed and 430 were spontaneous. A. Boredom

B. Confusion

Figure 2. Facial displays of affect from AutoTutor study

Analytical Reasoning. Participants were 41 undergraduate students who were enrolled in a program that offered practice testing for graduate school standardized

Figure 3. Interface for Analytical Reasoning Study A. Confusion

B. Contempt

Figure 4. Facial displays of affect from Analytical Reasoning study

A commercial web-cam recorded videos of participants’ faces (see Figure 4 for examples), while videos of their computer screens were recorded with screen capture software. Participants self-annotated their affective states immediately after the problem solving session using a similar video-based retrospective affect annotation procedure as in the AutoTutor study, but with two important exceptions. First, this study included a somewhat larger set of affective states than the AutoTutor study - anger, anxiety, boredom, confusion, contempt, curiosity, disgust, eureka, fear, frustration, happiness, sadness, surprise, and neutral. Participants had access to a list of states with definitions while providing their annotations. Second, rather than asking participants to annotate affect in periodic 20-second intervals, the fixed annotation time points were tied to specific problem solving events - seven seconds after a new problem was displayed, halfway between the presentation of the problem and the submission of the response, and three seconds after feedback (to the response) was provided. Participants also made spontaneous annotations by

DMELLO: THE ITERATIVE APPROACH TO AFFECT ANNOTATION

manually pausing the videos at any time and indicating their affective states. There were a total of 2,792 affect annotations – 2306 were fixed and 486 were spontaneous.

2.3 Design It was necessary to strike a balance between number of available observers (N = 8), maximum length of an observation session without risking fatigue, number of iterations for each observer-pair to afford improvements in inter-observer reliability, and maximizing stimuli variability while ensuring there was no overlap across observations (i.e., each observer annotated each video exactly once). The design discussed below reflects our attempt to achieve such a balance while minimizing confounds, although we fully acknowledge that alternate designs are possible. The eight observers were divided into two sets of four. Observers A, B, C, and D were randomly assigned to the AutoTutor videos, while observers P, Q, R, and S were assigned to the Analytical Reasoning videos. Each observer was paired with all the other observers from his or her respective sets, thereby yielding six pairings per set (AB, AC, AD, BC, BD, CD for AutoTutor and PQ, PR, PS, QR, QS, RS for Analytical Reasoning). Thus, each observer participated in three separate annotation sessions, but each time with a different partner. Each session consisted of annotating nine 5-minute video clips from nine different participants. To minimize practice effects, each observer viewed a participant’s video exactly once, both within and across sessions. This required nine videos per session and 27 videos in all since each observer participated in three sessions. Thus, videos from 27 out of the 28 participants were selected from the AutoTutor study. A separate set of 27 videos was selected from the 41 videos in the Analytical Reasoning study. Video selection was random with the only requirement that the participants’ faces were clearly visible in the video. It was necessary to select a 5-minute segment from each video for affect annotation. This was done by selecting a 5minute segment from approximately the mid-point of the videos – between 12 to 17 minutes for the 32-minute AutoTutor sessions and between 15 to 20 minutes for the 35minute Analytical Reasoning sessions. Observer-annotations were collected at the exact same time points as the self-annotations. There were approximately 16 annotation points with an average 18-sec inter-annotation interval for the AutoTutor study and approximately 12 annotation points with an average 25-sec inter-annotation interval for the Analytical Reasoning study. 2.4 Procedure Observers came to the laboratory in pairs. The task and the annotation software were explained to the observers. Specifically, observers were informed of the basics of the task and that the goal was to attempt to improve inter-rater reliability. They were encouraged to discuss their ratings with each other, but were not given any specific instructions on how those discussions should unfold. Observers were given the same list of affective states and definitions as was provided for the self-annotations. They were, how-

7

ever, not given any specific instructions on how to annotate each affective state. The observers were then instructed on the use of the annotation software and could ask any clarification questions at this time. Following the instruction phase, each observer was asked to occupy one of two computer setups, each positioned in different parts of the lab. In annotate mode, the software displayed videos of participants’ faces and computer screens and paused at annotation points using the exact procedure as the self-annotations (see Section 2.2). Observers independently provided annotations for each 5minute video clip. Their annotations were stored on the individual computers. After completing the annotations for a single 5-minute clip, the annotation files were transferred to one of the two computers. Observers sat together in front of this computer to discuss their annotations in order to achieve consensus. The annotation software merged the annotation files and displayed them to the observers in review mode. In this mode, the observers could jointly review their annotations as indexed in the videos, identify points of agreement/disagreement, and play the video segments corresponding to each annotation. However, they could not alter their annotations. Conclusion of the discussion period triggered the end of Iteration 1. The aforementioned annotate-discuss process was repeated for 9 iterations, with a different participant video in each iteration. Completion of iteration 9 marked the end of the annotation session. The average annotation time of each iteration was 6.68 mins (SD = 1.66 mins). There was an average delay of 4.86 mins (SD = 2.54 mins) between consecutive iterations, which indicates considerable discussion between the annotators.

3

RESULTS

A total of 1648 annotations (972 from AutoTutor and 676 from Analytical Reasoning) were obtained in the present study. These observer-annotations were aligned with the self-annotations. Inter-observer (OO) agreement was computed as the Cohen’s kappa between two observers’ annotations for each iteration. Self-observer (SO) agreement was computed by taking the average Cohen’s kappa between the self and each of the observers; i.e., Average (SelfObserver 1 kappa, Self-Observer 2 kappa). Krippendorff’s alpha [38] was also considered as an alternate reliability metric, however, it was strongly correlated with Cohen’s kappa for both OO (r = .992) and SO (r = .830) agreement. Consequently, we proceeded with Cohen’s kappa in order to facilitate comparisons with published studies reported in Table 1. All analyses use two-tailed significance testing with an alpha of .05.

3.1 Overall Agreement Table 2 displays overall agreement scores by averaging across the 9-iterations. One-sample t-tests comparing mean agreement for the combined stimuli to chance (kappa = 0) were significant for both OO, t(11) = 12.0, p < .001, d = 3.48 and SO, t(11) = 7.16, p < .001, d = 2.07, kappas. A paired samples t-test indicated that OO agreement was significantly higher than SO agreement, t(11) = 8.48, p <

8

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID

.001., d = 3.56. These analyses, performed on the combined stimuli, replicated when the individual AutoTutor and Analytical Reasoning stimulus were examined separately (not reported here). TABLE 2 DESCRIPTIVE STATISTICS FOR KAPPA ACROSS ITERATIONS

Stimulus

k

Inter-observer (OO) M (Stdev)

Self-observer (SO) M (Stdev)

Auto Tutor

6

.373 (.204)

.053 (.123)

Analytical Reasoning

6

.301 (.272)

.099 (.123)

Combined

12

.337 (.243)

.076 (.125)

Note. k is number of pairs

3.2 Linear Iteration Effect Figure 5 depicts mean (across pairs) OO and SO kappas across iterations for the combined stimuli. We note an increase in OO agreement across iterations, while SO agreement appears to be relatively stable. The data was analyzed more systematically with mixed-effects linear regression models [69] using the lme4 package in R. The models predicted either OO or SO kappa (dependent variables) from iteration (1 to 9 continuous fixed effect) with observer-pair as a categorical random effect. The iteration × stimulus (AutoTutor and Analytical Reasoning) interaction term was included to ascertain whether the iteration effect varied by stimulus. 0.4

Kappa

OO SO 0.2

0 0

2

4

Iteration

6

8

10

Figure 5. Inter-observer (OO) and self-observer (SO) kappas across iterations for combined stimuli. The lines in the plot are from linear models.

The results indicated a marginally significant effect of iteration on OO kappa, F(2, 104) = 3.00, stdev3pair = .053, p = .088, B = .031, SE = .012. The iteration effect was not significant for the SO model (p = .77), thereby suggesting no relationship (B = -.003, SE = .007) between iteration and SO kappa. The iteration × stimulus interaction was also marginally significant for OO kappa, F(1, 104) = 3.34, p = .07, but not for SO kappa (p = .802). Thus, there appears to be a linear trend across iterations for OO but not SO kappa. 3

Standard deviation of the variance of the observer pair random effect.

3.3 Nonlinear Iteration Effect The iteration effects are slopes of linear regression models, so it appears that each iteration resulted in a .031 (marginally significant) increase in OO kappa, while SO kappas were virtually unchanged (-.002 effect). However, a closer look at Figure 5 reveals that per-iteration OO kappa improvement of .031 might not be entirely accurate due to the assumption of linear growth. According to Figure 5, OO kappas increased from iterations 1-3, after which they remained relatively stable. To test this more formally, we repeated the mixed effects modeling after dividing the OO data into three separate subsets: iterations 1-3, 4-6, and 79. We note a significant iteration effect (B = .043, p = .024) for iterations 1-3, but no notable effect for iterations 4-6 (B = .0002, p = .405) or iterations 7-9 (B = .008, p = .402). The iteration × stimulus interaction terms were not significant (ps > .556) for any of these models, suggesting similar results across stimuli. Thus, although the average OO kappa marginally increased from 0.23 (iteration 1) to 0.38 (iteration 9), the significant improvement occurred across iterations 1-3 (mean kappa of 0.36 at iteration 3). The effective improvement in kappas (compared to iteration 1) was 0.15 (64%) after all 9 iterations and 0.12 (53%) after the first 3 iterations. Figure 6 provides a visualization of the nonlinear relationship between OO kappa and iteration. It was produced using a generalized additive mixed model (GAMM) approach [52], which consists of modeling a response variable via an additive combination of parametric and nonparametric smooth functions of predictor variables. GAMMs are an extension of standard generalized additive models (GAMs) by modeling autocorrelations among residuals (since the data are time series). The response variable in our model was OO kappa, there was no parametric predictor, and iteration was the nonparametric smooth predictor. The random effect was observer pair and the first-order autoregressive function was used. The smooth function was significant for the OO model (p = .022) shown in Figure 6 , but not for the SO model (p = .941 – not shown here). As Figure 6 indicates, we note a steady improvement in OO kappa for the early iterations after which it tapers off. 3.4 Bootstrapping Analysis The small sample size of 12 observer-pairs raises issues of overfitting. We addressed this question by performing a bootstrapping analysis, which is the recommended technique to assess overfitting for models constructed on small data sets [70]. The analysis proceeded by recomputing the linear regression models across 1000 samples, with a subset of the data sampled at the observer-pair level (with replacement). The boot package in R was used for the requisite computation. The results yielded a mean (across the 1000 bootstrap samples) OO iteration effect of .030, which was very similar to the .031 effect on the entire data set. When restricted to data from iterations 1-3, the bootstrapped OO effect of .042 was virtually identical to the .043 effect obtained on the entire data set. Thus, the bootstrapping analysis confirms that the iteration effect on OO

DMELLO: THE ITERATIVE APPROACH TO AFFECT ANNOTATION

kappas was not merely due to overfitting.

Figure 6. Nonlinear relationship between iteration (X-axis) and inter-observer (OO) kappa (Y-axis) from a generalized additive mixed model after z-score standardizing kappa by observer pair.

4

DISCUSSION

The present study focused on inter-observer (OO) reliability in affective annotation, a critical issue for affect detection systems that rely on supervised learning. We hypothesized that OO reliability could be improved via an iterative annotation process where observers attempt to achieve consensus on a small subset of the data before independently annotating the remaining data. Whether the iterative procedure results in improved OO reliability was unknown, as was its influence on self-observer (SO) reliability. Therefore, we conducted a study to investigate the potential of this iterative approach in improving both OO and self-observer (SO) reliability. In this section, we consider our major findings, their implications, limitations, and future work.

4.1 Main Findings and Implications Observer-observer agreement. Our first research question focused on the extent and rate of change in OO agreement across iterations. The results indicated that (on average) OO reliabilities increased by 64% across 9 iterations. Most of the improvement (53%) occurred over the first 3 iterations, which suggests that the iterative approach can be effectively implemented without being overly cumbersome. Furthermore, the final OO kappa of 0.38 (from iteration 9) is within the range reported in the studies reviewed in Table 1; this indicates that the present results are consistent with previous work despite starting with much lower initial OO kappas (0.23 after iteration 1). Taken together, these findings suggest that the iterative approach holds promise as one method to improve OO reliabilities. An astute reader might lament the fact that despite demonstrating improvement, the final OO kappa was not in the substantial range of 0.6 – 0.8 [51]. We acknowledge this point, but emphasize that, as the name implies, the iterative approach is fundamentally iterative. It can lead to incremental improvements over an initial point, not radical improvements. It is akin to a hill climbing procedure, that can incrementally reach the peak of a hill or a local maxima, but one needs to climb an entirely different hill

9

for dramatic improvements. Taking this analogy a bit further, the “hill” represents the basic affordances of the affective data, including quality of the stimuli, level of intensity and expressivity of affect, and inherent discriminability of the affective states. Different annotation tasks represent different hills, each presumably affording different levels of achievable kappa. This raises the question of whether there should be a recommended minimum kappa for all affect annotation tasks in the field of affective computing? And if so, what should it be? One extreme position is that anything short of the recommended 0.6 kappa [51] is unacceptable. That being said, it should be noted that this recommended minimum kappa is mainly applicable when the decisions are clear-cut and decidable, such as coding pronounced facial expressions, basic gestures, and other directly perceptible behavior. This is rarely the case in annotating naturalistic affect experiences because affective expressions are not mere “readouts” of a person’s affective state because a variety of factors (e.g., intensity, motivations, context) introduce considerable ambiguity in the link between felt and expressed affect [33, 59-61] (also see Section 1.2). Lower kappa scores are to be expected when annotating affect since it is a latent state that is ill-defined and possibly indeterminate. Therefore, it might be less meaningful to strictly rely on a one-size-fits-all agreement criterion that is completely oblivious to the nuances of the specific annotation task, both in terms of the nature of the stimuli (e.g., acted vs. naturalistic affective displays) and the annotation procedure itself (e.g., number of states annotated, number of annotators, unimodal vs. multimodal annotation). Indeed, the reliance on context-free benchmarks to categorize “acceptable” agreement values has been criticized as being implausible (“Our computations suggest no one value of kappa can be accepted as adequate, as convenient as this might be” – p. 368 [71]) or even harmful [72]. Further, Hallgren [73] notes that Krippendorff [74] points out that “acceptable IRR estimates will vary depending on the study methods and the research question.” Thus, rather than relying on a one-sized-fits-all threshold for acceptable agreement, in our view, it is more meaningful to demonstrate incremental improvement over some starting point and a moderate ending point. The iterative approach meets these two basic requirements (for OO agreement) because it resulted in an ending kappa in the fair range (kappa around 0.4) compared to a starting kappa in the poor range (kappa around 0.2). That being said, the ending kappa of 0.4 might not be adequate for accurate affect detection. This is because traditional supervised classifiers and corresponding metrics to evaluate classification accuracy are ill equipped to handle ambiguity in class labels. Low reliability in annotations could result in less reliable affect detectors and low confidence in the evaluation of such detectors because the ground truth is ambiguous. One solution is to compensate for the inherent ambiguity in affect annotation using multilabel classification techniques (not to be confused with multi-class classification [75]) where multiple viewpoints on the same affective state can co-exist. Another is to attempt to increase reliability and the simplest way to do this

10

is to increase the number of observers [9, 52]. Or perhaps, it might be advisable to consider both options. Self-observer and observer-observer agreement. Our second research question was concerned with the extent and rate of change in self-observer (SO) agreement across iterations while the third question involved comparisons between OO and SO agreement. The results indicated that the measurable improvements in OO kappas were not accompanied by corresponding improvements in SO kappas, which were substantially lower. This begs the question of whether the lower SO kappas were an artifact of the iterative approach? Is it the case that observers achieve improved reliabilities because they are simply attuning to each other by implicitly or explicitly co-constructing a set of facial display rules that are used to guide affect judgments in subsequent phases (a similar phenomenon has been shown in studies demonstrating the evolution of mini-languages in lab settings [76]). Problems occur when the display rules created by the observer pairs resemble a sort of mini-culture that does not align with broader cultural interpretation of affective displays. At first blush, the present data appears to support observer attunement, but a closer look at the literature indicates that this might not necessarily be the case. The few studies that have compared OO and SO reliabilities without implementing an iterative approach have all concluded that observer-annotations rarely align with self-annotations. Available examples include SO kappas of .08 to .16 [45], .19 to .29 [48], .03 to .28 [49], and .03 to .11 [7]. The present study achieved a SO kappa of .076, which is within the range of these previous studies that did not adopt an iterative approach. This suggests that the low SO kappas obtained here might not be an artifact of the iterative process itself, but more of a symptom of the underlying discrepancy between observer- and self- annotations. As noted above, almost all affect annotation studies focus on either observer- or self- annotation, but rarely both. Thus, the results of this study suggest that improvement in OO agreement by the iterative approach or by some other means does not guarantee a similar improvement in SO agreement. This is an important finding that suggests that it might be important to consider both self- and observer- based annotation as the most defensible approach. Extrapolating further, it might also be fruitful to reconsider the notion of an “absolute” ground truth which assumes the existence of a specific affective state in all situations. Rather, in some cases, it might be more prudent to settle for a “fuzzy” ground truth composed of multiple viewpoints reflected by annotations provided by the self as well as multiple observers. This would require the use of multilabel classification techniques (see above) when engineering the affect detectors.

4.2 Limitations There are a number of limitations with this present study. First, the study utilized a small number of observers, who were students enrolled in a qualitative research methods course. The individual iterations were also quite short (5 minutes) and the affective data emphasized visual cues in the form of facial expressions and upper body movements.

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID

Additionally, the annotation subsections were selected from the middle 5-minutes of each session akin to a ‘thin slice’ sampling approach [77], as opposed to annotating the entire session. These were all design decisions based on what could be accomplished in a single study, but this raises the question of the generalizability of the findings more broadly. Therefore, replication with a larger number of observers with different backgrounds (Mechanical Turk workers vs. clinical psychologists) across longer annotation sessions and with more diverse affective cues (specifically audio-visual cues) is warranted. A second limitation is that the study failed to rule out practice effects as one reason for the observed improvement in OO agreement. In other words, perhaps OO agreement improved across iteration simply because the observers became more familiar with the annotation task. At this point, we can only argue against practice effects instead of ruling them out empirically. The argument goes as follows. Each observer annotated 27 different videos in three sessions of 9-iterations each. If practice effects were in play, then OO agreement for the early iterations in Session 2 would be similar to the late iterations in Session 1. Similarly, early Session 3 agreement should be similar to late Session 2 agreement. This was not the case as agreement in early iterations was lower than late agreement. Furthermore, practice effects would result in continued improvement across all 9-iterations, when in fact, improvement was only noted for the first three iterations. Therefore, although practice effects might have had some effect, they likely did not play a major role. That being said, the issue of practice effects does need to be ruled out more conclusively as one can find weaknesses with our argument. For instance, practice effects within a given session might saturate after three iterations rather than increasing across nine iterations. Hence, replicating the present study with a control condition that engages in iterative annotation, but without discussion and consensus-building with another observer, is an important future work item needed to empirically address the issue of practice effects. A third limitation has to do with the fact that the observer-annotations were limited to 5-minute intervals from the middle portion of each video, while the self-annotations were done on the entire video. This might explain the overall low SO reliabilities. However, we do not suspect this to be the case, because the present SO kappas are similar to other studies where observers annotated entire sessions similar to the self-annotations. In particular, Graesser, et al. [45] report an SO kappa of 0.08 when undergraduate students annotated videos of entire AutoTutor sessions. Similarly, in D'Mello, et al. [49], two teachers annotated the entire first or second half of AutoTutor videos and achieved SO kappas of .03 and .11. As discussed in Sections 1.2 and 1.5, the low SO kappas are more likely to be a byproduct of the assumptions underlying the basic task of affect annotation, regardless of whether an iterative approach is applied or not (but see limitation 5 below). Fourth, we did not record the discussion period following the annotations, thereby precluding a detailed analysis on the content of the discussions. This also did not allow

DMELLO: THE ITERATIVE APPROACH TO AFFECT ANNOTATION

us to study the impact of observer personality on the discussions. For example, do extroverted and highly charismatic observers dominate and sway the discussion in one direction? How do hierarchical relationships among observers (e.g., 4th vs. 3rd year student; professor vs. graduate student) influence the tone of the discussions? In general, observers can vary across a number of dimensions, such as age, gender, ethnicity, personality, prior experience, and so on. This raises the critical issue of studying how observer match or mismatch influences the discussion and subsequently inter-observer agreement. These questions can be addressed in future studies by collecting individual difference measures on the observers and by recording, transcribing, and coding the content of the discussions. One can then study how OO agreement is mediated by the nature of the discussion and if inter-observer personality differences moderates these effects. Fifth, the fair OO reliability of roughly 0.4 kappa might raise questions pertaining to the validity of comparisons between observer- and self- annotations. We provided multiple reasons for the low SO reliability (see Sections 1.2, 1.5, and 4.1), but it would be prudent to increase OO reliability in order to more confidently address the issue of low SO reliability. As noted above, both Rosenthal [9] and McKeown and Sneddon [52] provide guidelines for selection of an appropriate number of observers needed to achieve a desired level of effective reliability. Therefore, only after increasing OO reliability (by increasing the number of observers) can we make more definitive statements on the status of SO reliability.

4.3 Future Work Future work is needed to address the aforementioned limitations as well as to expand the scope of the iterative approach by applying it to a broader landscape of affect annotation tasks. In particular, the present study only considered the task of two observers simultaneously engaging in the iterative annotation task. However, multiple observers might be needed in order to improve reliability beyond what could be accomplished by two observers alone. This raises many intriguing questions. For example, what is the best way to adapt the basic approach to multiple observers and what is the anticipated impact in OO and SO agreement when multiple observers are involved? Adapting the iterative approach to more than two observers and studying the effects on OO and SO reliability is an important avenue for further research. There is also the potential of adapting the present methodology to study observer attunement - the phenomenon where two observers create a set of display rules or a miniculture specific to the stimuli. This could be studied by replicating the same study using the exact same stimuli, but with a different set of observer pairs (e.g., A’, B’, C’, D’). One can then compare agreement between the current observer pairs and the new observer pairs (i.e., A-B vs. A’-B’). Similar ratings between the two sets of OO pairs would suggest that the observers are more closely approximating a broader cultural interpretations of the affect displays, while highly dissimilar ratings would be indicative of observers forming a mini-culture specific to the stimuli.

11

Annotations in many affective computing applications are mainly used to provide labeled data for supervised classification (as discussed in the Introduction). This raises the issue of whether improved OO agreement obtained with an iterative approach actually results in stronger correlations between diagnostic features (e.g., facial expressions, acoustic-prosodic cues) and affective states, thereby ultimately resulting in more accurate affect detectors. Systematic experiments are needed to study how affect detection accuracies are influenced by multiple annotation schemes (e.g., single observers, two observers with/without iterative approach, multiple observers with/without iterative approach), both when inter-observer agreement is low and high, and when affective expressions are muted vs. highly expressive. In general, affect annotation is a multidimensional problem encompassing the sources, timing, temporal resolution, data modality, and level of abstraction of the annotations (see Section 1.1). It is difficult (if not impossible) for a single study to simultaneously consider all these dimensions, so it is judicious to consider the generalizability of the results of the present study to the broader landscape of affect annotation. The present study primarily varied the source (self vs. observer) of the annotations, while keeping the number of observers (two), timing (offline), representation (categorical), modality (face + screen), temporal resolution (tens of seconds), and level of abstraction (affective states) constant. It also considered two different affective stimuli that varied with respect to the populations sampled, human-computer interaction contexts, and affective states annotated. The fact that that the main findings replicated across these two stimuli despite considerable differences provides some evidence on the generalizability of the iterative approach. Although we have some confidence that the basic methodology of the iterative approach can be applied to a variety of annotation tasks, future research is needed to ascertain if the present pattern of findings replicates across different annotation tasks. Fortunately, the simplicity of the iterative approach lends itself to a variety of affect annotation tasks (e.g., different affective states, different temporal resolutions, different modalities, dimensional representations). Application to some of these tasks will require trivial changes (e.g., swapping out one list of affective states for another), while others will require a redesign of the annotation software or protocol. For example, if a time-continuous annotation scheme is used (e.g., frame-level annotations over time), then perhaps the discussion period might be best facilitated by merging and displaying time series generated by the two annotators. It might also be beneficial to perform preliminary analyses on the two time series (e.g., cross correlations) to identify and display periods of agreement/disagreement prior to the discussion.

4.4 Concluding Remarks This paper sought to study the influence of an iterative affect annotation approach on OO and SO agreements in the context of annotation of complex affective data. The key finding was that the iterative annotation approach yielded improvements in OO agreement to a certain point, but had

12

no effect on SO agreement, which was also notably lower. This raises the question of whether affective computing researchers should adopt the iterative approach for their affect annotation tasks? Our answer to this question depends on the goals (improve OO agreement, SO agreement, or both) and complexity of the annotation task (as reflected in baseline agreement scores without the iterative approach). In our view, the present study demonstrated that OO agreement can improve from the poor (kappas around 0.2) to the fair range (kappas of around 0.4) with just three annotate-discuss iterations, so the iterative approach might be best suited for similarly difficult (but not impossible) annotations tasks. With a bit of extrapolation, we might suggest that the iterative approach would also boost fair OO agreement toward the moderate range (kappa of 0.6). The iterative approach is unlikely to be helpful when baseline OO agreement is negligible (kappas of 0) and is likely not needed when OO agreement is sufficiently high (kappas > .8). As noted from the selected survey in Table 1, a large number of studies have OO agreement baselines within the poor to fair range, thereby suggesting considerable potential for the iterative approach. It is imperative to point out that the iterative approach by itself did not dramatically improve OO reliability. The ending kappas were almost double of the staring kappas, but were still only fair. Thus suggests an upper bound to the iterative approach when only two observers attempt a difficult annotation task. Hence, the best way forward would be to consider combining the iterative approach with other methods to improve reliability, such as increasing the number of observers or perhaps improving the annotation schemes themselves. The iterative approach did not result in any improvement in SO agreement, so we would not advocate its use when the goal is to increase SO agreement. The low SO agreement might be partly attributed to low OO agreement, so SO agreement needs to be reconsidered after first increasing OO agreement (e.g., by using the iterative approach, a larger number of observers, and improved annotation methods). The low SO agreement might also be more of an intrinsic issue pertaining to the unique perspectives of the self vs. observers when performing affect annotations (discussed in Sections 1.5 and 4.1). This raises the question of whether it might be possible to align the two perspectives, while being mindful of the affordances of the affective data being annotated. One intriguing option is to attune the observers to the self-annotations by modifying the iterative approach. If self-annotations are collected prior to the observer-annotations, then each observer can be shown his/her alignment with the self-annotations rather than with the annotations of another observer. This process would have to be iteratively repeated for small segments of affective data and there would be no opportunity for discussion. It is an empirical question of whether this approach will be effective in bridging the gap between the self and observers or if never the twain shall meet?

ACKNOWLEDGMENT We thank John Nichols and Natalie Person for providing

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID

resources for data collection. This study would not be possible without their support. This research was supported by the National Science Foundation (NSF) (ITR 0325428, HCC 0834847, and DRL 1235958). Any opinions, findings, conclusions, or recommendations expressed are those of the author and do not reflect the views of the NSF.

REFERENCES [1] R.A. Calvo and S.K. D’Mello, “Affect detection: An interdisciplinary review of models, methods, and their applications,” IEEE Transactions on Affective Computing, vol. 1, no. 1, 2010, pp. 18-37; DOI 10.1109/T-AFFC.2010.1. [2] Z. Zeng, M. Pantic, G. Roisman and T. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, 2009, pp. 39-58. [3] S.K. D'Mello and J. Kory, “Consistent but Modest: Comparing multimodal and unimodal affect detection accuracies from 30 studies,” Proceedings of the 14th ACM International Conference on Multimodal Interaction L.-P. Morency, et al., eds., ACM 2012, pp. 31-38. [4] K. Forbes-Riley and D.J. Litman, “Benefits and challenges of realtime uncertainty detection and adaptation in a spoken dialogue computer tutor,” Speech Commun., vol. 53, no. 9-10, 2011, pp. 1115-1136; DOI 10.1016/j.specom.2011.02.006. [5] S. D'Mello, B. Lehman, J. Sullins, R. Daigle, R. Combs, K. Vogt, L. Perkins and A. Graesser, “A time for emoting: When affectsensitivity is and isn’t effective at promoting deep learning,” Proceedings of the 10th International Conference on Intelligent Tutoring Systems, J. Kay and V. Aleven, eds., Springer, 2010, pp. 245-254. [6] H.J. Bernardin and M.R. Buckley, “Strategies in rater training,” Academy of Management Review, vol. 6, no. 2, 1981, pp. 205-212. [7] S.K. D’Mello, N. Dowell and A.C. Graesser, “Unimodal and multimodal human perception of naturalistic non-basic affective states during Human-Computer interactions,” IEEE Transactions on Affective Computing, vol. 4, no. 4, 2013, pp. 452 - 465. [8] R. Morris and D. McDuff, “Crowdsourcing techniques for affective computing,” The Oxford Handbook of Affective Computing, R. Calvo, et al., eds., Oxford University Press, 2014, pp. 384-394. [9] R. Rosenthal, “Conducting judgment studies: some methodological issues,” Handbook of nonverbal behavior research methods in the affective sciences, J. A. Harrigan, et al., eds., Oxford University Press, 2005, pp. 199-236. [10] S. Craig, S. D'Mello, A. Witherspoon and A. Graesser, “Emote aloud during learning with AutoTutor: Applying the facial action coding system to cognitive-affective states during learning,” Cognition & Emotion, vol. 22, no. 5, 2008, pp. 777-788. [11] J. Sabourin, B. Mott and J. Lester, “Modeling learner affect with theoretically grounded dynamic bayesian networks,” Proceedings of the Fourth International Conference on Affective Computing and Intelligent Interaction, S. D'Mello, et al., eds., Springer-Verlag, 2011, pp. 286-295. [12] R. Baker and J. Ocumpaugh, “Interaction-Based Affect Detection in Educational Software,” The Oxford Handbook of Affective Computing, R. Calvo, et al., eds., Oxford University Press, 2015, pp. 233-245. [13] J. Whitehill, Z. Serpell, Y.-C. Lin, A. Foster and J. Movellan, “The

DMELLO: THE ITERATIVE APPROACH TO AFFECT ANNOTATION

[14]

[15]

[16] [17]

[18]

[19] [20]

[21]

[22]

[23] [24]

[25]

[26]

[27]

[28]

[29] [30]

[31]

Faces of Engagement: Automatic Recognition of Student Engagement from Facial Expressions,” IEEE Transactions on Affective Computing, vol. 5, no. 1, 2014, pp. 86 - 98. J. Janssen, P. Tacken, J. de Vries, E. van den Broek, J. Westerink, P. Haselager and W. IJsselsteijn, “Machines Outperform Laypersons in Recognizing Emotions Elicited by Autobiographical Recollection,” Human–Computer Interaction, vol. 28, no. 6, 2013, pp. 479-517. B. McDaniel, S. D’Mello, B. King, P. Chipman, K. Tapp and A. Graesser, “Facial features for affective state detection in learning environments,” Proceedings of the 29th Annual Meeting of the Cognitive Science Society, D. McNamara and G. Trafton, eds., Cognitive Science Society, 2007, pp. 467-472. P. Ekman, “An argument for basic emotions,” Cognition & Emotion, vol. 6, no. 3-4, 1992, pp. 169-200. K.A. Lindquist, E.H. Siegel, K.S. Quigley and L.F. Barrett, “The Hundred-Year Emotion War: Are Emotions Natural Kinds or Psychological Constructions? Comment on Lench, Flores, and Bench (2011),” Psychological Bulletin, vol. 139, no. 1, 2013, pp. 264268. H.C. Lench, S.W. Bench and S.A. Flores, “Searching for evidence, not a war: Reply to Lindquist, Siegel, Quigley, and Barrett (2013),” Psychological Bulletin, vol. 113, no. 1, 2013, pp. 264-268. J. Russell, “Core affect and the psychological construction of emotion,” Psychological Review, vol. 110, 2003, pp. 145-172. R. Cowie, G. McKeown and E. Douglas-Cowie, “Tracing emotion: an overview,” International Journal of Synthetic Emotions (IJSE), vol. 3, no. 1, 2012, pp. 1-17. C. Izard, “Innate and universal facial expressions: Evidence from developmental and cross-cultural research,” Psychological Bulletin, vol. 115, no. 288-299, 1994. C. Izard, “The many meanings/aspects of emotion: Definitions, functions, activation, and regulation,” Emotion Review, vol. 2, no. 4, 2010, pp. 363-370; DOI 10.1177/1754073910374661. L. Barrett, “Are emotions natural kinds?,” Perspectives on Psychological Science vol. 1, 2006, pp. 28-58. L. Barrett, B. Mesquita, K. Ochsner and J. Gross, “The experience of emotion,” Annu. Rev. Psychol., vol. 58, 2007, pp. 373-403; DOI 10.1146/annurev.psych.58.110405.085709. J.J. Gross and L.F. Barrett, “Emotion generation and emotion regulation: One or two depends on your point of view,” Emotion Review, vol. 3, no. 1, 2011, pp. 8-16. P. Ekman, “Strong Evidence for Universals in Facial Expressions - a Reply to Russells Mistaken Critique,” Psychological Bulletin, vol. 115, no. 2, 1994, pp. 268-287. P. Ekman, “Expression and the nature of emotion,” Approaches to emotion, K. Scherer and P. Ekman, eds., Erlbaum, 1984, pp. 319344. D. Keltner and P. Ekman, “Facial expression of emotion,” Handbook of emotions2nd ed, R. Lewis and J. M. Haviland-Jones, eds., Guilford, 2000, pp. 236–264. L. Camras and J. Shutter, “Emotional facial expressions in infancy,” Emotion Review, vol. 2(2), 2010, pp. 120-129. A.J. Fridlund, P. Ekman and H. Oster, “Facial expressions of emotion,” Nonverbal behavior and communication, 2nd ed., A. W. Siegman and S. Feldstein, eds., Erlbaum, 1987, pp. 143–223. W. Ruch, “Will the real relationship between facial expression and affective experience please stand up: The case of

13

exhilaration,” Cognition & Emotion, vol. 9, no. 1, 1995, pp. 33-58. [32] J.A. Russell, J.A. Bachorowski and J.M. Fernandez-Dols, “Facial and vocal expressions of emotion,” Annu. Rev. Psychol., vol. 54, 2003, pp. 329-349. [33] B. Parkinson, A.H. Fischer and A.S. Manstead, Emotion in social relations: Cultural, group, and interpersonal processes, Psychology Press, 2004. [34] D. Goleman, Emotional intelligence, Bantam Books, 1995. [35] H. Elfenbein and N. Ambady, “On the universality and cultural specificity of emotion recognition: A meta-analysis,” Psychological Bulletin, vol. 128, no. 2, 2002, pp. 203-235; DOI 10.1037//0033-2909.128.2.203. [36] H. Elfenbein and N. Ambady, “Is there an ingroup advantage in emotion recognition?,” Psychological Bulletin, vol. 128, 2002, pp. 243-249. [37] M. Mehu and K. Scherer, “A psycho-ethological approach to social signal processing,” Cognitive Processing, vol. 13, no. 2, 2012, pp. 397-414. [38] A.F. Hayes and K. Krippendorff, “Answering the call for a standard reliability measure for coding data,” Communication methods and measures, vol. 1, no. 1, 2007, pp. 77-89. [39] J. Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, vol. 20, no. 1, 1960, pp. 37-46. [40] D. Litman and K. Forbes-Riley, “Predicting student emotions in computer-human tutoring dialogues,” Proc. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2004, pp. 352-359. [41] B. Schuller, R. Müller, F. Eyben, J. Gast, B. Hörnler, M. Wöllmer, G. Rigoll, A. Höthker and H. Konosu, “Being bored? Recognising natural interest by extensive audiovisual integration for real-life application,” Image and Vision Computing, vol. 27, no. 12, 2009, pp. 1760-1774. [42] H. Pon-Barry, S.M. Shieber and R. Cowie, “Recognizing uncertainty in speech,” EURASIP Journal on Advances in Signal Processing, vol. 2011, 2010, pp. 18. [43] B. Lehman, M. Matthews, S. D'Mello and N. Person, “What are you feeling? Investigating student affective states during expert human tutoring sessions,” Proceedings of the 9th International Conference on Intelligent Tutoring Systems, B. Woolf, et al., eds., Springer, 2008, pp. 50-59. [44] S. Afzal and P. Robinson, “Natural Affect Data - Collection & Annotation in a Learning Context,” Proc. Proceedings of 2009 International Conference on Affective Computing & Intelligent Interaction, 2009. [45] A. Graesser, B. McDaniel, P. Chipman, A. Witherspoon, S. D'Mello and B. Gholson, “Detection of emotions during learning with AutoTutor,” Proceedings of the 28th Annual Conference of the Cognitive Science Society, R. Sun and N. Miyake, eds., Cognitive Science Society, 2006, pp. 285-290. [46] J. Ang, R. Dhillon, A. Krupski, E. Shriberg and A. Stolcke, “Prosody-based automatic detection of annoyance and frustration in human-computer dialog,” Proc. International Conference on Spoken Language Processing, 2002, pp. 2037-2039. [47] I. Shafran, M. Riley and M. Mohri, “Voice signatures,” Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, 2003, pp. 31-36. [48] A. Alhothali, “Modeling user affect using interaction events,”

14

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58] [59]

[60]

[61] [62]

[63]

[64]

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID

Computer Science, University of Waterloo, Waterloo, Canada, 2011. S.K. D'Mello, R. Taylor, K. Davidson and A. Graesser, “Self versus teacher judgments of learner emotions during a tutoring session with AutoTutor,” Proceedings of the 9th international conference on Intelligent Tutoring Systems, B. Woolf, et al., eds., Springer-Verlag, 2008, pp. 9-18. S.K. D'Mello, N. Dowell and A. Graesser, “Unimodal and Multimodal Human Perception of Naturalistic Non-Basic Affective States during Human-Computer Interactions,” IEEE Transactions on Affective Computing, vol. 4, no. 4, 2013, pp. 452 – 465. J.R. Landis and G.G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, 1977, pp. 159-174. G.J. McKeown and I. Sneddon, “Modeling continuous self-report measures of perceived emotion using generalized additive mixed models,” Psychological Methods, vol. 19, no. 1, 2014, pp. 155-174. P. Domingos, “A few useful things to know about machine learning,” Communications of the ACM, vol. 55, no. 10, 2012, pp. 78-87. D. McDuff, R. Kaliouby and R.W. Picard, “Crowdsourcing facial responses to online videos,” IEEE Transactions on Affective Computing, vol. 3, no. 4, 2012, pp. 456-468. J. Hernandez, E. Hoque, W. Drevo and R. Picard, “Mood meter: counting smiles in the wild,” Proceedings of the 2012 ACM Conference on Ubiquitous Computing, ACM, 2012, pp. 301-310. R.E. Nisbett and T.D. Wilson, “Telling more than we can know: Verbal reports on mental processes,” Psychological Review, vol. 84, no. 3, 1977, pp. 231-259. S.J. Heine, D.R. Lehman, K. Peng and J. Greenholtz, “What's wrong with cross-cultural comparisons of subjective Likert scales?: The reference-group effect,” Journal of Personality and Social Psychology, vol. 82, no. 6, 2002, pp. 903-918. J.A. Krosnick, “Survey research,” Annu. Rev. Psychol., vol. 50, no. 1, 1999, pp. 537-567. J.M. Carroll and J.A. Russell, “Do facial expressions signal specific emotions? Judging emotion from the face in context,” Journal of Personality and Social Psychology, vol. 70, no. 2, 1996, pp. 205-218. J. Russell, “Is there universal recognition of emotion from facial expression - A review of the cross-cultural studies,” Psychological Bulletin, vol. 115, no. 1, 1994, pp. 102-141. A.J. Fridlund, Human facial expression: An evolutionary view, Academic Press, 1994. P.M. Podsakoff, S.B. MacKenzie, J.Y. Lee and N.P. Podsakoff, “Common method biases in behavioral research: A critical review of the literature and recommended remedies,” Journal of Applied Psychology, vol. 88, no. 5, 2003, pp. 879-903. K. Porayska-Pomsta, M. Mavrikis, S.K. D’Mello, C. Conati and R. Baker, “Knowledge Elicitation Methods for Affect Modelling in Education,” International Journal of Artificial Intelligence in Education, vol. 22, 2013, pp. 107-140. S. D'Mello and A. Graesser, “Feeling, thinking, and computing with affect-aware learning technologies,” The Oxford Handbook of Affective Computing, R. Calvo, et al., eds., Oxford University Press, 2015.

[65] K. VanLehn, W. Burleson, S. Girard, M.E. Chavez-Echeagaray, J. Gonzalez-Sanchez, Y. Hidalgo-Pontet and L. Zhang, “The Affective Meta-Tutoring project: Lessons Learned,” Proc. Intelligent Tutoring Systems, Springer, 2014, pp. 84-93. [66] A. Metallinou and S. Narayanan, “Annotation and processing of continuous emotional attributes: Challenges and opportunities,” 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG 2013), IEEE, 2013, pp. 1-8. [67] S. D'Mello, B. Lehman and N. Person, “Monitoring affect states during effortful problem solving activities,” International Journal of Artificial Intelligence In Education, vol. 20, no. 4, 2010, pp. 361389. [68] S. D'Mello, “A selective meta-analysis on the relative incidence of discrete affective states during learning with technology,” Journal of Educational Psychology, vol. 105, no. 4, 2013, pp. 10821099. [69] J.C. Pinheiro and D.M. Bates, Mixed-effects models in S and SPLUS, Springer Verlag, 2000. [70] A.C. Davison and D.V. Hinkley, Bootstrap methods and their application, Cambridge University Press, 1997. [71] R. Bakeman, D. McArthur, V. Quera and B.F. Robinson, “Detecting sequential patterns and determining their reliability with fallible observers,” Psychological Methods, vol. 2, no. 4, 1997, pp. 357-370. [72] K.L. Gwet, Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among multiple raters, Advanced Analytics, LLC, 2012. [73] K.A. Hallgren, “Computing inter-rater reliability for observational data: An overview and tutorial,” Tutorials in quantitative methods for psychology, vol. 8, no. 1, 2012, pp. 23. [74] K. Krippendorff, Content analysis: An introduction to its methodology, Sage, 1980. [75] G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” International Journal of Data Warehousing and Mining, vol. 3, no. 3, 2006, pp. 1-13. [76] T.C. Scott-Phillips and S. Kirby, “Language evolution in the laboratory,” Trends in Cognitive Sciences, vol. 14, no. 9, 2010, pp. 411-417. [77] N. Ambady and R. Rosenthal, “Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis,” Psychological Bulletin; Psychological Bulletin, vol. 111, no. 2, 1992, pp. 256.

Author Bio

Sidney D’Mello is an Assistant Professor in the departments of Computer Science and Psychology at the University of Notre Dame. His interests include affective computing, artificial intelligence, human-computer interaction, speech recognition, and natural language understanding. He has published over 180 journal papers, book chapters, and conference proceedings. He is an associate editor for IEEE Transactions on Affective Computing, IEEE Transactions on Learning Technologies, and IEEE Access. D’Mello received his PhD. in Computer Science from the University of Memphis in 2009.

On the Influence of an Iterative Affect Annotation Approach on Inter ...

annotation study where pairs of observers annotated affective video data in nine annotate-discuss iterations. Self-annotations were previously .... lected as being typical of the undergraduate/graduate re- ... and can incorporate contextual information within which ..... tional technologies that automatically detect and respond.

1MB Sizes 4 Downloads 204 Views

Recommend Documents

On the Influence of an Iterative Affect Annotation Approach on Inter ...
speech, a screen shot of the computer interface (to re-create ... both the affective sciences and in affective computing as to .... Positive, and 0.5 for Negative vs.

The Influence of Positive vs. Negative Affect on ...
memory, visual monitoring, auditory monitoring, and math tasks. ...... effectiveness of persuasive messages and social influence strategies. Journal of.

On the Influence of Sensor Morphology on Vergence
present an information-theoretic analysis quantifying the statistical regu- .... the data. Originally, transfer entropy was introduced to identify the directed flow or.

Study on the influence of dryland technologies on ...
Abstract : A field experiment was conducted during the North East monsoon season ... Keywords: Sowing time, Land management, Seed hardening and Maize ...

On the Convergence of Iterative Voting: How Restrictive ...
We study convergence properties of iterative voting pro- cedures. Such procedures are ... agent systems that involve entities with possibly diverse preferences.

Culture's Influence on Emotional Intelligence - An Empirical Study of ...
Culture's Influence on Emotional Intelligence - An Empirical Study of Nine Countries.pdf. Culture's Influence on Emotional Intelligence - An Empirical Study of ...

Influence of an inhomogeneous magnetic field on ...
(OV) at any instant, based on Beer's law, is given by ... It is the TI of the sample at any instant 't'. ... Conference on Biorheology, Sofia, October 2000, p. 13.

The Affect of Lifestyle Factors on Eco-Visualization Design - GitHub
male female busyness busy fairly busy not busy. Visualization. Encoding. Fig. 2 How lifestyle factors affect the visualization encoding. Information capacity is the representation of the number of “discrete information sources that a system can rep

A Query Approach for Influence Maximization on ...
on Specific Users in Social Networks. Jong-Ryul Lee ... steadily increased in online social networks such as Face- book and .... influence paths, but it is more efficient than the PMIA heuristics ... can be modified to find top-k influencers on speci

Distributed PageRank Computation Based on Iterative ... - CiteSeerX
Oct 31, 2005 - Department of Computer. Science. University of California, Davis. CA 95616, USA .... sults show that the DPC algorithm achieves better approx-.

Inter-Railway - inter-Division transfer on own request-mutual transfer ...
Sa*ht ilnter*Rui*lr*y/i*t*r*Ili*:ixfi**l rtn**.*rqf*r *rm **wr* *vquqsffmx*n*ll fr***xfry r*f. *llv*s$qrm*tN **rwtrsltetl **d*'es s* p*r fuloql*l ${3P - Fmrt F if,em *#{$}h,t ''. Kxf* Th$x t*#i*x $e*l*,r S*- 3{li?fTr*n#fi'$.f.F*r!i*y dnt**l S.*.fr{f

Mendelian Randomisation study of the influence of eGFR on coronary ...
24 Jun 2016 - 1Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine,. UK. 2Department of Tropical Hygiene, Faculty of Tropical Medicine, Mahidol University, Thailand. 3Institute of. Cardiovascular Scienc

The Influence of Intellectual Property Protection on the ...
May 1, 2011 - systems. Countries that declared themselves to be “developing” upon ... products, countries had to accept the filing of patent applications.

Influence of the process temperature on the steel ... - Isoflama
R. Droppa Jr. a,b ... Keywords: Pulsed plasma nitriding; Tool steel; Diffusion mechanism; .... analytical alcohol with the purpose of stopping the reaction. 3.

The Influence of Admixed Micelles on Corrosion Performance of ...
The Influence of Admixed Micelles on Corrosion Performance of reinforced mortar.pdf. The Influence of Admixed Micelles on Corrosion Performance of ...

Influence of vermiwash on the biological productivity of ...
room temperature (+30oC) and released back into the tanks. The agitation in .... The data were subjected to Duncan's .... In proc.2nd Australian Conf. Grassl.

Influence of the Electrostatic Plasma Lens on the ...
experiments carried out between the IP NAS of Ukraine,. Kiev and the LBNL, Berkeley, ... voltage Uacc ≤ 20 kV, total current Ib ≤ 500 mA, initial beam diameter ...

Influence of the process temperature on the steel ... - Isoflama
Scanning Electron Microscopy with spatially resolved X- ray energy disperse spectroscopy was also employed to map nitrogen influence on the morphology of ...

Influence of the microstructure on the residual stresses ...
Different iron–chromium alloys (4, 8, 13 and 20 wt.%Cr) were nitrided in a NH3/H2 gas mixture at 580 °C for various times. The nitrided microstructure was characterized by X-ray diffraction, light microscopy and hardness measurements. Composition

Lie on the Fly: Iterative Voting Center with ... - Zinovi Rabinovich
ing online scheduling a natural next step. Therefore the im- portance of research into iterative ... Operational Program “Education and Lifelong Learning” of the Na- tional Strategic Reference Framework (NSRF) .... voting, computing this set is N

A Shared Task on the Automatic Linguistic Annotation ...
Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, pages 78–90, .... range of CMC/social media genres, and (ii) a Web.

Lie on the Fly: Iterative Voting Center with ... - Zinovi Rabinovich
Operational Program “Education and Lifelong Learning” of the Na- tional Strategic Reference Framework (NSRF) ... we constructed an experiment on real-world data. We com- pared manipulative voters to truthful voters in ..... intelligence and data