HOW CHILDREN LEARN TO PRONOUNCE: NOT BY IMITATION BUT BY THEIR MOTHERS’ VOCAL MIRRORING Piers Messum Department of Phonetics and Linguistics, University College London [email protected]
uncontroversial that at some stage in development a particulate principle emerges. A child then starts to conceive words as being made up of speech sounds (sub-word units of production forming part of a mental syllabary ). If a particulate principle operates, then it is necessary to draw a distinction between two activities required for word adoption: learning to pronounce words and learning to pronounce. The former is relatively simple once a child has solved the so-called ‘correspondence problem’ between speech sounds he hears and those he makes (speech sounds that listeners must take to be equivalent to their own speech sounds): the challenge is then one of recognising and sequencing the elements correctly. However the ‘correspondence problem’ has to have been solved prior to this being possible, which is the latter challenge, that of learning to pronounce. Heyes (e.g. ) represents this graphically as in figure 1, where the sequencing problem is represented by horizontal associations, and the correspondence problem is represented by the vertical associations between sensory input and motor output.
ABSTRACT It is generally assumed that children learn to pronounce speech sounds by imitation from adult models. This requires that a child creates some form of representation for a speech sound in a single modality, which he uses for both perception and production. It is usually imagined that this underlying representation is auditory/acoustic, but arguments can also be made for motor/gestural alternatives and recent neuroimaging evidence has added strength to the alternative views. But whether auditory or motor, it is supposed that the representation is abstracted from examples of the speech sound that the child hears, and then used to guide his own production of the corresponding speech sound. The debate on the underlying nature of speech representation has (rightly) occupied phonetics for many years. The presumption that speech must be encoded in a single modality is never challenged because learning by direct imitation requires it; but this presumption should be challenged. It is not attested that children learn to pronounce speech sounds by imitation: this has only ever been an assumption. In fact, on closer examination we find good reasons to believe firstly, that children cannot learn the qualities of speech sounds this way; and secondly, that vocal mirroring (as explained below) is a plausible alternative. This implies that the inner representation of a speech sound is both auditory and motor at the same time, thus resolving the longstanding debate on the 'true' nature of speech and explaining some otherwise puzzling adult and developmental data.
Figure 1: Parsing the input creates a sequencing (horizontal) specification, but the motor equivalents to the sensory elements identified (the vertical specification) must also have been established before word reproduction is possible.
Keywords: Speech development, pronunciation, imitation, underlying representation, shadowing. 1. INTRODUCTION Thus to learn the pronunciation of a word like “gruffalo”, a speaker may parse and then reproduce the word shape as three speech sounds: perhaps corresponding to “gru”, “ffa” and “lo”. To
How is the mature skill of word pronunciation acquired? Although early words may be pronounced by mimicry of whole word shapes, it is
Firstly, any particular instance of a speech sound is inherently meaningless, but as a member of the category of tokens that form the speech sound it has some meaningful characteristics. Under the mature paradigm, speech sounds have to be identified in order for a word to be reproduced. The ‘meaning-making’ mode of perception that a listener uses to achieve this is distinct from the sensory awareness that is required for mimicry. For this reason, adult learners of foreign languages engineer situations where they can attend to words said to them as sensory experiences, knowing in advance the ‘meaning’ (or speech sound categories) of what they are expecting to hear. Typically they ask for a word to be repeated to them slowly, setting themselves to hear it as a sound image rather than as a word. Then they attend to the sensory experience that their own production for the word produces, and modify their production if they judge that this does not resemble the model sufficiently. However, young children learning L1 have no similar control of the words presented to them. They have to identify these words (and any speech sounds within them, if appropriate) and having done so the ephemerality of the input means that it is too late for them to then attend to it for its sensory characteristics. (Imagine our perceptual abilities as two occupants in a car. The occupant who can recognise events outside has to alert the one who can analyse these events; but when the car is moving fast, an event has passed by before the analyser can attend to it. Contrast this with nonephemeral input – like the marks on a page that make up handwritten letters – where the ‘car’ is stationary and attention can move between recognition and analysis at leisure.) A second problem for the learning of the qualities of speech sounds by copying may be bone conducted sound. This distorts the perception of vowels, in particular, in adults (e.g. ) and may be even more destructive in children. Thirdly, it may be that a speaker’s experience of his own output is not generated by hearing the sidetone, but by a forward model of what he intends to produce . He perceives what he intended to say. One or more of these (and other) problems may be fatal to the ‘mimicry’ account of speech sound acquisition. Speech sounds would be perceptually opaque rather than perceptually transparent (i.e. it would not be possible for the child to compare the
understand how mature word pronunciation operates, we have to ask how the pronunciation of each of these speech sounds has been learnt prior to this. The general assumption about this process is that sound qualities are copied: “Infants learn to produce sounds by imitating those produced by another and imitation depends upon the ability to equate the sounds produced by others with ones infants themselves produce.” 
This assumption is based on the belief that speech sounds spoken within words are perceptually transparent; that is, that the experience of hearing a speech sound said by someone else as part of a word and the experience of hearing oneself saying a speech sound in a similar circumstance are similar and comparable. As a general example, the sight of another person making a circle with thumb and first finger and the sight of oneself doing the same thing would be perceptually transparent. However ‘pouting’ would be perceptually opaque, because it can be seen on the face of another person, but only felt on one’s own face not visibly observed (in the absence of a device such as a mirror). There is a well known problem with the supposed perceptual transparency of speech arising from the objectively dissimilar output from the different sized vocal tracts of adults and children. However, since young children appear to be able to normalise speech across a very wide range of speakers (e.g. ) it is assumed that this problem can be solved in a way that also contributes to the learning of the qualities of speech sounds. If this is so from the start, then it seems straightforward that a child should be able to learn speech sounds by mimicry, i.e. by a self-supervised matching-totarget process where the child captures the output of others, captures his own output, compares the two and moves his own output towards the adult norm as necessary. 2. WHY THE QUALITY OF A SPEECH SOUND CANNOT BE LEARNT BY MIMICRY However, there are reasons why this may not be the case, particularly for mimicking speech sounds appearing within words as compared to mimicking the overall form of the speech signal.
Figure 2: The infant’s VMS α (lower left) creates a speech sound whose exact quality is not important. His mother interprets this to be equivalent to her /x/ and produces a speech sound [x] in response. The child categorises this as /X/, and, knowing that his mother was imitating him, infers a correspondence between α and /X/.
qualities of his mother’s speech sounds and his own). Particularly so when they appear embedded in words in a stream of communicative speech rather than in non-meaningful isolation. The former, of course, is the context in which children learn pronunciation. 3. HOW SPEECH SOUNDS CAN BE LEARNT BY MIRRORING However, perceptual opacity does not preclude learning. An alternative mechanism can be employed, in the form of a literal or metaphorical ‘mirror’. By its use, the ‘imitator’/’observer’ can inform himself of what he performs. For speech, such a mirror is provided by the many episodes of speech sound reformulation that occur in mother-child interaction. Pawlby  reported that over 90% of ‘imitative’ exchanges for infants between 17 and 43 weeks were actually of this type, where a mother ‘imitates’ her child rather than vice versa. Reformulation of a child’s vocal output by his mother continues until at least age 4 .) Such reformulation transforms the child’s output into his mother’s interpretation of this within L1. It is therefore analogous to so-called ‘affect attunement’  rather than to simple mimicry. As the mother’s response comes within the context of an imitative exchange, it provides the child with the evidence for him to deduce a correspondence between his output and the speech sound equivalent within L1 that she produces. He understands that his mother regards the two as equivalent, and relies on her judgment in this. In practice, suitable evidence of equivalence will appear as a response to a child’s vocal motor schemes (VMS) , that is, to articulatory gestures that have become reliable and repeatable. Yoshikawa et al.  demonstrate the plausibility of such an approach, which is illustrated in figure 2. For the adoption of words from L1, the correspondences the child has learnt between his VMS’s and adult speech sounds (some of these during the babbling phase) can be deployed in the opposite direction, to produce speech sounds equivalent to those he recognizes. Thus the vocal mirror that his mother holds up to his output is the bootstrap into his L1 pronunciation, rather than direct mimicry of her output on his part.
4. THE UNDERLYING REPRESENTATION OF SPEECH SOUNDS Notice that the child does not deduce correspondences on the basis of sound similarity. His mother may (or may not) regard his output and her response as acoustically similar, but whatever the basis for her judgment of equivalence, he just relies upon her judgment as conclusive. Thus the primary association he forms is between his articulation (a VMS) and the speech sound he hears in response. His inner representation for a speech sound is neither purely acoustic nor purely gestural, but a hybrid that we can call a pushmi-pullyu representation (PPR) following Millikan . This resolves some longstanding problems in speech. For example, Porter and Lubker  reported insignificant differences between simple and choice reaction times in speech shadowing (the subjects shifting to either a predefined vowel or to the vowel being modelled when they detected a change in the signal). This unexpected result was confirmed and extended by Fowler et al.  who pointed out that an explanation was possible if a theory allowed for, “… articulatory properties as well as acoustic ones to be associated with phonological categories. From this perspective, in the choice task, listeners perceive the disyllables acoustically, but the consequence of mapping the cues onto a phonological category is that articulatory properties are made available.”
Such a result would be expected from a PPR being the underlying representation for speech sounds.
6. REFERENCES  Blakemore, S-J., Wolpert, D.M., and Frith, C.D. 2002. Abnormalities in the awareness of action. Trends in Cognitive Sciences 6 (6), 237-242.  Chouinard, M.M., and Clark E.V. 2003. Adult reformulations of child errors as negative evidence. Journal of Child Language 30, 637-669.  Fowler, C.A., Brown, J.M., Sabadini, L., Weihing, J. 2003. Rapid access to speech gestures in perception: evidence from choice and simple response time tasks. Journal of Memory and Language 49, 396-413.  Guenther, F.H., Ghosh, S.S., Tourville, J.A. 2006. Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language 96, 280-301.  Heyes, C. 2001. Causes and consequences of imitation. Trends in Cognitive Sciences 5 (6), 253-261.  Howell, P. 1985. Acoustic feedback of the voice in singing. In: Howell, P., Cross, I., West, R. (eds) Musical Structure and Cognition edited by 259-286 New York: AP.  Kohler, E., et al. 2002. Hearing sounds, understanding actions. Science 297, 846-848.  Kuhl, P. K. 1987. Perception of speech and sound in early infancy. In: Salapatek, P. and L. Cohen (eds), Handbook of Infant Perception, Vol 2 275-382 New York: AP.  Kuhl, P. K. 1991. Perception, cognition, and the ontogenetic and phylogenetic emergence of human speech. In: Brauth, S.E., Hall, W.S., Dooling, R.J. (eds) Plasticity of Development 79 Cambridge MA: MIT Press.  McCune, L., Vihman M.M. 1987. Vocal Motor Schemes. Papers and Reports in Child Language Development, Stanford University Department of Linguistics 26, 72-79.  Messum, P. R. 2003. Invariance of effort in child speech breathing as a 'fast and frugal' heuristic for the acquisition of durational phenomena in stress-accent languages. In: Solé, M.J., Recasens, D., Romero, J. (eds) 15th ICPhS, 2007-2010 Barcelona: Causal Productions.  Messum, P. R. 2005. Learning to talk: a non-imitative account of the replication of phonetics by child learners. In Chatzidamianos, G. (ed) CamLing 2005: Proceedings of The University of Cambridge Third Postgraduate Conference in Language Research 99-109.  Messum, P. R. 2007. On the role of imitation in learning to pronounce. PhD thesis, London University.  Millikan, R. G. 2005. Pushmi-pullyu representations. In Language: a biological model (OUP).  Pawlby, S. J. 1977. Imitative interaction. In: Schaffer, H. R. (ed) Studies in Mother-Infant Interaction, 203-223 London: Academic Press.  Porter, R.J., Lubker, J.F. 1980. Rapid reproduction of vowel-vowel sequences: evidence for a fast and direct acoustic-motoric linkage in speech. Journal of Speech and Hearing Research 23 (3), 593-602.  Stern, D. N. 1985. The Interpersonal World of the Infant, London: Karnac Books.  Yoshikawa, Y., Asada, M., Hosoda, K., Koga, J. 2003. A constructivist approach to infants' vowel acquisition through mother-infant interaction. Connection Science 14 (4), 245-258.
5. SUMMARY This account is properly developed in , which also updates the account of the replication of temporal phenomena in , . The overall picture argued for here is one where children are not ‘junior’ phoneticians, but more pragmatic learners of how to speak. Instead of acquiring their pronunciation of L1 through close examination and modelling of the input signal, a combination of physiology, speech aerodynamics, articulatory experimentation and general social learning skills leads children to produce output that is similar to that of the speakers around them. But this is not the result of qualities or timing patterns of the speech signal itself being copied. The inner representation created by a mirroring process is a PPR, an association between the sound category heard and the gestures needed to reproduce it. This reconciles the evidence presented on each side of the longstanding debate between those who characterise speech as an acoustic code and those who view it as gestures made audible (e.g. Stetson, Motor Theory). It further explains how perception may precede production, but how the particular form of perception that supports production develops in tandem with it. This yields an integrated picture of production and perception in child speech development1. Finally, it should be noted there is no evidence whatsoever for the conventional idea that speech sounds are learnt by imitation. This has simply been assumed, from the fact that children do come to produce speech that resembles that of the speakers around them and in the absence of an account with an alternative mechanism. In fact, all the data and circumstantial evidence argues in favour of a non-imitative account of the type described here. 1
This account may also inform our understanding of mirror neurons. A MN would be a possible neurological structure to instantiate a PPR, since it responds both when an event is heard and when it is performed . One puzzle has been why MN’s are found only in higher primates. If, however, MN’s are created by the type of mirroring interaction I have described rather than by simple imitation, then we would only expect to find them in cognitively highly sophisticated animals, since this interaction requires the observer to make a deduction of correspondence, based on a recognition that he is being imitated.