Across Acoustics

Reconsidering Classic Ideas in Speech Communication

August 07, 2023 ASA Publications' Office
Across Acoustics
Reconsidering Classic Ideas in Speech Communication
Show Notes Transcript

Most researchers know the seminal articles that have impacted their field. Sometimes, though, the research in those articles can get misinterpreted or exaggerated, and those misunderstandings can take hold and reappear year after year. In this episode, we talk to the editors of the Special Issue on Reconsidering Classic Ideas in Speech Communication, Matthew Winn (University of Minnesota), Richard Wright (University of Washington), and Benjamin Tucker (Northern Arizona University), about ideas in Speech Communication that were reexamined in the special issue.

Read the Reconsidering Classic Ideas in Speech Communication

Read more from The Journal of the Acoustical Society of America (JASA).

Learn more about Acoustical Society of America Publications.

Music Credit: Min 2019 by minwbu from Pixabay. https://pixabay.com/?utm_source=link-attribution&utm_medium=referral&utm_campaign=music&utm_content=1022

Kat Setzer  00:06

Welcome to Across Acoustics, the official podcast of the Acoustical Society of America's publications office. On this podcast, we will highlight research from our four publications. I'm your host, Kat Setzer, Editorial Associate for the ASA.  Today, we're going to highlight another recent JASA special issue, the Special Issue on Reconsidering Classic Ideas in Speech Communication. Joining me today are the three editors of the special issue, Matthew Winn of University of Minnesota, Richard Wright of the University of Washington, and Benjamin Tucker of Northern Arizona University. Thank you guys for taking the time to speak with me today. How are you doing?

 

Richard Wright  00:47

Great, doing great. It's a beautiful day and school just ended this week. So now I have some free time to do research.

 

Matthew Winn  00:55

Yeah, thanks. Good. Happy to join you.

 

Kat Setzer  00:58

So tell us a bit about yourselves and your research backgrounds.

 

Matthew Winn  01:01

This is Matt. I'm a professor of audiology at the University of Minnesota, as you said, and my work focuses on speech communication and people who have hearing loss. I worked for a short time as a clinical audiologist, but was originally drawn to this field because of my interest in speech acoustics. And in my lab, we work often with people who use cochlear implants, which are devices that help give a sense of hearing when a person can no longer benefit from hearing aids. Cochlear implants present a special challenge to speech perception because they distort the speech signal in ways that force a person to generally exert some more effort when they're listening to try to figure out what they're hearing. And most people in my lab work on measuring that effort and how to explain it. Particularly, we're trying to understand when a person has a moment of uncertainty when they have to figure out the last thing they heard and process that, but still pay attention to the next thing that they're about to hear.

 

Kat Setzer  01:57

Very interesting. That sounds like that probably impacts quite a few people.

 

Richard Wright  02:01

I'll go next. My name is Richard Wright, and I'm a professor of linguistics and phonetics at the University of Washington. My research focuses on the perception and production of speech, and I'm especially interested in how speech perception and speech production vary in context and how that shapes spoken language. So in addition to the basic acoustics of English, I'm also interested in language documentations, from an acoustic point of view, and perceptual strategies in distortion with humans with hearing loss. I also do a little bit of speech technology, although that's been put on backburner for now.

 

Kat Setzer  02:41

Okay, cool.

 

Benjamin Tucker  02:44

And then, I'm Benjamin Tucker. I'm a professor of Speech Science at Northern Arizona University. My research focuses on understanding, production, and perception of speech, but specifically, I'm interested in what happens when we're having an actual conversation. So often we bring people into the lab, and we have them record a list of words. And then we try to do a perception study on that. But I want to know what happens when we're actually using real speech and having conversations and how that's different than what we normally would record in the lab. I also, like Richard, dabble in doing work on endangered languages and doing stuff in speech technology as well.

 

Kat Setzer  03:27

Nice. So how did this special issue come about?

 

Matthew Winn  03:31

Well, there were two main forces that set this issue into motion. First, the three of us and, you know, most of our friends at the Acoustical Society, we're often chatting about how big ideas are sometimes misinterpreted or exaggerated, or that they diverge from the actual original paper that they came from. And you see the same misunderstandings year after year. And there's frustration with this. And then that turns into mostly just side room chatter, hallway chatter at conferences, when all of us are saying, "You know, we really should set the record straight on these things." I mean, not to, you know, chastise people, but to prevent people from wasting their time on things that might be misunderstood. So the energy and the ideas were already floating in our heads. And  I get along with people who really want to do this, because that shows their passion for getting things right. So the energy was there.  But then the second major thing that set this into motion was a special session put together by Chris Stecker, who's a scientist at Boys Town National Research Hospital. He organized the session with a title like, "Pruning ideas from the garden of psychoacoustics," this the poetic title, but it was generally the same theme, which was several years ago now. He invited speakers to present on ideas that should be reevaluated or maybe dismissed altogether. His technical committee Psychological and Physiological Acoustics has a lot of overlap with our Speech Communication group. So it was no surprise that we took inspiration from his effort so we, you know, we wanted to respect his main message that our scientific findings and our assumptions should be reexamined, reassessed. So Richard and I organized a similar session in the Speech Communication section of the meeting. And there was a lot of enthusiasm and that enthusiasm grew into this special issue. Everyone said, "You know, this was really good. We all got together, we shared ideas, and gave each other a lot of advice." And people wanted to put those ideas on paper for for this special issue.

 

Richard Wright  05:24

And I'll add that we, at that point, we got Ben involved, because he was passionate about this as well. And he's an excellent editor. So I knew he'd be great to have a lot. 

 

Kat Setzer  05:44

Yeah. I mean, it is such a fascinating idea. And you can think about sort of this idea of like, if things were being misinterpreted from the original research, then it could snowball effect and cause future research problems. 

 

Matthew Winn  05:57

Yeah, definitely. 

 

Kat Setzer  05:58

So in your introduction to the special issue, you mentioned that the issue really focuses on two categories of topics, the first being common ideas from speech research. Were there any commonly held beliefs in speech research that you were surprised to see reconsidered? Or any that you were surprised that didn't get called into question? 

 

Richard Wright  06:15

This is Richard, I'll answer that one. There are many ideas that we didn't get to; there are many, many ideas. Because in science, you know, ideas have to be reconsidered regularly, to make sure they still make sense, given what we know now, and to make sure that they're being interpreted. So they're, the only one surprising thing is, we didn't get as many as we could. But if we had, we'd probably still be editing the issue. So it's probably good that we didn't get everything. But we weren't really surprised by anything we got, because we sort of knew what people had presented at the conference and what people were passionate about. It was really good to see papers, though, that, you know, because this could turn into something negative, we really wanted to avoid that. And we were really happy to see that none of the papers were really, you know, attacking ideas and spending time criticizing people. Rather, we spent time embracing the value of the original work and offering some new considerations on how to expand and appreciate what we know. So that was-- it's not surprising, but it was it was a happy result of this.

 

Kat Setzer  07:24

Yeah, that sounds very constructive. That's great.

 

Benjamin Tucker  07:26

Well, I mean, you mentioned that they fell in to two parts. And we, it wasn't something that we had planned for it to happen. It just kind of, you know, we looked at all the submissions afterwards. And we kind of noticed that there seem to be two categories, right? And those two categories were methodological issues, and kind of how we do the research, and then really dwelling on kind of those actual ideas and how those get passed along from generation to generation of researchers. And in those research papers, right, and so you talked about the snowball effect and how that kind of those ideas kind of snowball and just kind of carry on from generation to generation.

 

Kat Setzer  08:08

So some articles called certain foundational experimental findings into question. Can you talk about what the research in the special issue showed with regards to these findings?

 

Richard Wright  08:17

Yeah, I'll, this is Richard again. I'll start with a very well-known finding called the McGurk Effect, and it's an auditory illusion that was first described by McGurk and MacDonald's 1976 paper in Nature, "Hearing lips and seeing voices." It wasn't the first time people had found that visual signals were important to speech perception. That work goes back to some of the work that was done in the military during World War II and work of Sumby and Pollack in the late 50s. But what McGurk and MacDonald found was that if you combine two sort of channels of information, cues from the visual channel and cues from the auditory channel, in a way that disagree, the result is often (but not always, and not in everybody), but the result is this illusion that you hear a third and different percept that's neither the visual nor the auditory. So for example, in their work, they combined the auditory stimulus "ba," with the visual stimulus "ga," so sort of the lips weren't closing the mouth was open, and many of the listeners perceive that as "da," d. So they had a bilabial, auditory alveolar visual and you ended up with a pseudoalveolar percept. And that was... it's very striking. And it's really fun and and it can be done under lots of conditions. And so part of its popularity is that it is really easy to use, it's very easy to create the stimulus; you just need a recording of somebody saying something with a sound that disagrees with the visual sound. You need a visual channel but they don't have to be tightly timed. They can be off by hundreds of milliseconds. So it doesn't require a lot of precision in editing stimuli and creating the stimuli. So then one reason is popular is anybody can do it, right? Anybody with a camera basically can do that. So it has been done a lot. It has been done in lots of languages, and has been done under lots of different conditions. And it replicates most of the time. It's also striking, so that one of the things that people love about this one is that it just really shocks you the first time you see, you perceive, let's say an alveolar, and then you close your eyes and you hear that it's a bilabial. And it goes, it's robust enough that, for example, you can have a woman's face saying, "ga" and a man's voice saying, "ba," and you hear a man's voice saying, "da." So it's sort of, there's all sorts of cool add-ons to it.  And it has become used for-- because people-- well, the verb is McGurking and it even has a verb, right? Because people McGurk at different rates, it was thought for quite a while to be a measure of auditory and visual integration in speech perception. There are a couple of problems with this; recent research has highlighted them, but some of the problems have been known for a long time. One problem is that it's something that's detached from anything that would happen naturally in speech perception. You'd never get a face saying one sound and that same face producing a different sound auditorily. And so it's an illusion that doesn't connect with what we sort of think of, as, you know, everyday processes, the kinds that Ben and Matt and I are interested in, and a lot of people are interested in. And because people do it at different rates, they have different susceptibilities to the illusion, it was thought that this could be a measure of integration. And recent research has highlighted that it is not at all correlated with somebody's ability, or with a person's performance in auditory and visual speech perception. And the paper by Van Engen, et al. really highlights this and in a great way. And there's a great quote from them, if I may. They say, "Although the McGurk Effect is a fascinating illusion, truly understanding the combined use of auditory and visual information during speech perception requires tasks that more closely resemble everyday communication, namely, words, sentences, narratives with congruent auditory and visual cues." And I thought that was a great, it was a great paper, it highlights what's interesting about it, but it also says we need to go beyond this illusion as scientists to truly try to understand what's going on.

 

Kat Setzer  12:40

Okay, so basically, it just doesn't really apply to real life, or like how we actually speak and interpret speech

 

Richard Wright  12:46

Yeah, and and because it's variable in the susceptibility to it: some people don't do it as much, it turns out kids are different from adults, context has an influence on it. So if the word makes more sense in a particular context, it's more likely to happen. You know, so if the word begins with a D, and you're more likely, and it's highly predictable in a sentence context, you're more likely to hear a D there. So it's not a low, it's not something going on at a low level, like originally thought. It's being integrated, much higher up in the system. And for that reason, it's a very complex process that doesn't really make contact with the things people are trying to use it for. It's cool, and I still use it to teach. And it's still fun to teach. And that makes the effect really strong, but but it isn't useful in the way that it has become used. And that was Van Engen et al.'s point in their paper.

 

Matthew Winn  13:45

This is Matt again. Yeah, there were some other really foundational things called into question in the special issue. And there's a concept in speech perception that's had a very firm grip on our field for at least 60 years or so that's called categorical perception. And this is probably the concept from our field that spread the farthest outside the field into things like psychology and neuroscience. And you can find this concept in classrooms and in published papers even today. And there are a few different ways of explaining what this is, but you can think of it through different analogies, right? So if you take a range of sounds that change very gradually, but then the perception of those sounds does not change gradually, we seem to ignore the little gradual changes until they cross a special threshold. And then boom, we notice it finally, as a full-on category change. And by category, I mean, for example, like a D sound or a T sound. So we can always tell the difference between a D and a T, but we ignore differences between different Ds and we ignore the differences between different Ts, or so we've been told.  So there's this analogy I like to use about citizenship. So even though I'm in Minnesota and I live very close to the border with Canada, my status or my category is the same as someone from Arizona. So, you know, we know that Minnesota and Arizona are different places, right. But from the perspective of the border agent, there's no difference between us. It's just the US either way, that's the category. So it doesn't matter how far you are from the border, you're perceiving someone's citizenship categorically, ignoring those variations.  So bringing that back to speech, the idea of categorical perception is that we care about whether a sound is a D or a T. But we don't care about the differences between all the different instances of D. And this idea has a lot of attraction, it seems to solve this issue of how everyone has a different voice, and you ignore all the little differences between you know, when you hear me over the phone versus in a one room or reverberant room or outside, it seems to be a very efficient system for getting around all of that complication. Because you wouldn't want to be confused by all those slight changes. And you know, our lived experience is consistent with this, we don't go around thinking, "Well, that sound was halfway between the D and the T." We seem to be decisive about what we hear. And you can set up an experiment that strongly suggests this is how we perceive sounds categorically.  However, this idea does turn out to have a lot of weaknesses, which show up twice in this special issue with one large commentary by Bob McMurray, and then another paper about how we can test in a better way by Keith Applebaum. And the main evidence that they're bringing to the table is that we know that we don't ignore those tiny differences in sounds between talkers, especially. If we did then we wouldn't notice any difference in dialects, which of course we do. And also people who work as speech language pathologists, for example, they perceive and keep track of very subtle changes in a child's articulation as they developed. And this requires an ability to perceive those small differences, which is not what you'd get if categorical perception were true. So Bob points out, as does Ben Munson very passionately, that the contributions of those fields like sociophonetics, dialects, and speech language pathology, those fields have a lot of wisdom to offer. But they've traditionally been undervalued and overshadowed by other fields that have maybe more power and more clout. But we don't have to look very far, there's evidence right here in the so-called mainstream field of speech perception that supports the idea of perception being continuous rather than categorical; we do notice that Minnesota is different than Arizona. And we do notice that, "thu," is different than "tha.' These are qualities about a talker that we remember, and we use to understand what they're saying and predict what they're about to say. And so Bob's paper reviews that evidence really thoroughly, and shows that we gain a lot by, you know, dispelling with the myth of categorical perception and embracing all the complexity that's really there.

 

Richard Wright  18:09

This is Richard. Another point I like to add, when I'm talking about this with my students, is that if you know who's talking, you perceive things slightly differently than if you don't know who's talking. And we wouldn't be able to... So if you pick the phone up, and your brother says, "Hey," you know what your brother from the subtle differences in his pronunciation from everyone else's. And if we ignored in perception and perceptual processes, we ignore subtle, tiny, really, really small differences, then we wouldn't be able to tell it was our brother on the phone. We might... We wouldn't be able to tell that it was a man on the phone. We wouldn't be able to tell that it was a friendly, "Hi,"  you know. Or, there is a whole bunch of information there. And it's not because you can see him, it's not because you know the context, because you just pick the phone up. There isn't a context yet. But there's a friendly voice. That's a man who sounds just like your brother, your odds are you're going to perceive that in a different way than if it's a stranger, when you pick up the phone, they say, "Hey." That would be a very strange thing. And you wouldn't perceive it in exactly the same way.

 

Kat Setzer  19:19

Right. That makes sense.

 

Benjamin Tucker  19:20

This has been... There's another example from kind of the the word recognition literature from the special issue where there's this idea that when we identify or we comprehend words when we're listening to them, that there's this kind of competition going on. So you hear part of a word and all the words that are like it are competing, and you have to pick, right, as you're listening to it, you have to pick which word you think it is. Right? So if you hear the word cat, you know, if you've just heard the "ca" part, it could be cat or it could be cad, or it could be something else. So there's this competition happening and we as listeners have to figure that out.  And one of the ways researchers have tried to quantify this is with this notion called neighborhood density, and what they do is they calculate how many words are like our word "cat." And they use this this notion called edit distance, which just means that you can change one sound or add a sound, or delete a sound to get different possible words. So like with the word, with cat, you could have cats, or at or bat or kit or cat. Those would all be neighbors of that word, cat. And when you have words that have lots of neighbors, then there's more competition. And maybe it's harder to recognize that word. But if you have a word that has very few neighbors, then there's probably less competition. The problem with this kind of way of quantifying competition is that it's really just based on how we write the sounds. And it's not really based on what we hear or the acoustic characteristics of the sounds. Now this idea of neighborhood density was developed in, was developed in the 80s. And the original person who wrote about this, in his dissertation, pointed out that this was not the best way to do this, but it was what he had to go with at the time. But that notion has kind of perseverated. Everybody who's doing these kinds of word recognition studies uses neighborhood density and really doesn't question it.  One of the papers in the special issue actually questions neighborhood density and says, "Well, can we do this differently? Can we do this the way that was originally intended?" So what they do in that paper is they attempt to calculate acoustic distance. So you take a word like cat, and you take a word like cad, and you look at the acoustic characteristics using computational techniques, to measure the kind of the distance acoustically between those two words. And thenyou can check to see if we get that same kind of competition. So if I have words that are competing, that have lots of things that are really distant to it, then I should maybe recognize it faster. But if it's has lots of things that are close to it, then that can lead to more competition and slower identification of the word or recognition of the word.

 

Richard Wright  22:19

Yeah, I may jump in here. This is Richard. There's some nice examples I use in talking about this, or... if you're using phoneme distance, that means all phonemes are equidistant from each other. And we know that they're not. So for example, Ben's example "cat," a neighbor of cat is sat, and a neighbor of cat is gnat, and a neighbor of cat is Matt. And gnat and Matt are enough like each other that on the recording, they're hard to tell apart, whereas sat and gnat are, even though they're very similar in in sort of what the tongue is doing, not exactly the same. They're wildly different. You will never confuse a "sa" with a "na" unless you have very severe hearing loss, or there's some kind of distortion going on to blocks out the acoustics. So that  highlights, you know, why the edit distance based on letters or phonemes is a very poor substitute for the acoustic distance. Why were they using this, I guess, spelling-based version of the acoustical distance up until now?

 

Benjamin Tucker  23:29

Well, I think it was, originally it was just computationally there wasn't the means to do it before. Right? And then it became habit, right? That's just what everybody did. Right? We knew this thing was important. So as I do my next research project, oh, I can't forget to include this, I need to include this. And at that point, it was something that was, you know, we're doing the best that we can with kind of the abilities that in technology that we have at the time, but then that kind of solidifies, it concretizes, right, it becomes kind of concrete in the field, and that's just what we do.

 

Kat Setzer  24:06

Okay, so it just becomes, the shortcut becomes habit, even though now you have all the means to figure this out more accurately. Okay. Got it.

 

Richard Wright  24:15

So I mean, the same, let's say, culprits involved in people using this string edit distance, are...  they're a recurring theme in why things get used: It's convenient and easy. Previous research that's important has used it. If you don't use it, maybe people will ding you for trying something new. And you might not get your paper published, especially if you're a student just starting off. But the convenience issue, I think, is one of the driving forces. I'm an author who's guilty of using this string edit distance at the very same lab where it was developed, nany years later when I had resources available to me to do an acoustic distance. And in that paper I, I said this would be better done with acoustics. But that was basically, it wasn't, it was a... it would be a lot of work. And so the authors of this special issue did that. They did a lot of work to calculate these distances in a way that's much more plausible for human perception and much more valid from an acoustic point of view. 

 

Kat Setzer  25:21

Some research looked at commonly used stimuli. What did these papers show?

 

Richard Wright  25:25

This is Richard again, one of the things that has happened in creating speech stimuli is that a paper does a really good job of measuring something, and then it becomes kind of foundational, a touch point, even if the author of that paper says, "These are values that are very specific to this group of speakers that I'm studying, and should not be used as a as a sort of gold standard for what vowels are," for example. So one of these cases is this group of vowel values called Hillenbrand vowels. And they're based on a beautiful study that was done. Hillenbrand measured a group of speakers in  Southwestern Michigan. and he was actually replicating a seminal paper by Peterson and Barney, which was, the Peterson and Barney paper was a methodological paper showing that you could measure formants, which are the first few resonances of the vocal tract, which change over time, and that's what creates the vowel percept. So as our tongue moves around in our mouth, it changes the resonating qualities of our vocal tract. And as those resonations changed, we perceive them as different vowels. And so Peterson and Barney did this beautiful study where they recorded lots and lots of people, and then they measured different things. And it turned out that the formants these resonances were important for classifying vowels, and they wrote it as a methods paper. How can you do this? And it kind of ended up becoming a gold standard for what vowels are in North American English.  And Hillenbrand came along many years later and said, "Well, it's been a while." And there were several issues with the way Peterson and Barney... So this was reconsidering the Peterson and Barney, and in a way that sort of, in the spirit of our special issue. And he did a, he added some methodologies to it-- again, it was a methodological paper and it was meant to be, "Here's how you should go about measuring vowels," not "Here's what vowels are in North America." And he says that in the paper and and one of the things he added was vowel dynamics. So he didn't just measure the midpoint of the vowel the way Peterson and Barney did. He measured sort of how the vowel resonances or formants change over time and and how that's important to what, talking about what a vowel is. And he, to sort of be comparable to the Peterson and Barney paper, he published a set of vowel values from his group of speakers from Southwestern Michigan measured in the way that Peterson and Barney did at the midpoint of the vowel.  And what happens is that people, especially people who are not right in acoustic phonetics, but in neighboring field, in speech technology, sometimes in hearing sciences, especially, will see that table and they will view it as "This is what North American vowels are." Even though if you ask anybody on the street, "Do people sound different in different parts of the country?" Almost everyone will say, "Yeah, more in some places than others." So if you're in South, you know, southern Louisiana, and you're in northern Michigan, you are not expected to sound like each other. And Southwestern Michigan is a very specific place. It was a very specific time, but it was a paper about what a good way to go about measuring vowels, and here's a set of vowels  that people can use. And so what's happened is in linguistics, as well, but in linguistics, outside of the the detailed acoustic phonetic vowel community, they just grabbed this table of vowel values, and they plug it into speech synthesizers, or they measure a bunch of vowels and then they compare them. Let's say they have, they're studying a speech pathology, that compare the vowels of the speaker to this vowel set. Or they use these vowels as basis for stimuli that you, that they create. And again, I've done this, but it was in a paper where I was trying to say you can't do this.  So what happened is they became a value  in stimuli and and sort of a gold standard for stimuli for "North American English." And you wouldn't use Japanese for English. And you probably wouldn't use speakers from a more pronounced dialect, sort of Glaswegian English for people from Michigan. And you probably wouldn't even use people from southern Louisiana as stimuli for people from Michigan because they'll behave differently than the native speakers of that particular dialect. And the other thing to remember is that language changes over time and that study was done a while ago. And what constitutes a, you know, a vowel that's in a particular word in one place, is not the same as... and at one time... it's not the same as... that same word later with the same "vowel," it's written as the same vowel, it will have, it'll have subtle differences or not-so-subtle differences. And so what we sort of think about is that the Hillenbrand vowels are great if you're making stimuli for people from Southwestern Michigan in the 80s, early 80s, but they're not great for people from California in the 2010, 2020s, and in the future. So you should measure the vowels locally, base your stimuli on the vowels that your speakers actually use if you're trying to measure things like accuracy and identification or... and noot measure howhow good people are at adapting to a different dialect.

 

Matthew Winn  30:52

Yeah, I mean, I think to add something, I think just to reiterate that, like, the issue that we're taking with in this paper is not with the study that Hellenbrand did, right? To echo what Richard was saying, it's not, the issue is not with the study, it's with what people have done with the work, you know, maybe oversimplifying it or twisting it to serve their purpose, specifically treating it as the gold standard description of what American English vowels are, because they are, in fact, from a particular dialect at a particular point in time. And to say that that's the gold standard should sound absurd to anyone. But it was a gold standard of how to measure the vowels. It actually didn't just replicate the Peterson and Barney study, it added a lot of rigor, and a lot of descriptions of methods that people really should use. But the, you know, this idea that we raised in our commentary was that a huge point of that study that's been ignored, should be more prominent, which is that there's this beautiful table of vowel formant measurements, sort of single number values, and that table is very attractive, it looks to be like, "Oh, here's the values," right? But when you read the paper, it cautions very strongly against using that table as the end product of the study. Instead, they wrote how if we really want to understand vowdls, we have to look beyond a single measurement, and see how the formants change over the course of the vowel. And that's something that's, you know, if you want to make the sell... the vowel sound natural and English-like, you have to include those dynamic changes. And so much of the literature has, you know, read only up until that table, and then ignored the warning that comes after it ignored the dynamics being a big piece of the original message of the paper.

 

Richard Wright  32:44

And again, it's convenient. And because other people did it this way, now everybody's doing it this way. And that's the sort of thing that we were watching out for in this issue. That's the thing we wanted to highlight.

 

Kat Setzer  32:58

Right? That makes sense. So others investigated assumptions surrounding speech and how we quantify it. Can you talk about this a bit?

 

Benjamin Tucker  33:07

Sure. This is Ben. And I think this will kind of dwell on the theme of, you know, how we use things. And sometimes we use things out of convenience, right? So one common measure is acoustic duration. So we measure how long a sound is, a speech sound, right? So do I produce my "mu" sound as a "m" or "mmmm," right? One is longer in duration than the other. We use this measure for a lot of things. And we use it all the time. And one of the reasons we use it is because it's convenient. But the problem is, is that there's a lot of other acoustic information in the signal that isn't just duration. And so we make assumptions about speech based on this one acoustic measure because it's convenient. And then we kind of don't get into all the other kinds of acoustic measures, right? The Hillenbrand data was all about resonance frequencies for vowels, and those are really important as well. And there's other acoustic measures. And one example of this is from, comes from one of the papers in the special issue, right? Using certain acoustic measures can become habitual, right? We measure certain sounds in a particular way. And this is the way we do it. And that momentum can be really difficult to break. And part of that is just because it's what we know, and it's what we're familiar with, and/or we're trying to replicate previous studies, right? This is what they did, and they probably know more than I do, so I'm going to do what they did. And, you know, maybe a better way to think of it is I want what I do to be relatable to what people have done previously, so I'm going to use that same measure or method. But they made choices and assumptions about their data and I'm adopting those whether I actually know what those assumptions are, not just because I'm making this decision to do what they did. So one of the articles in the special issue by Christine Shadle actually talks about something that we use to measure fricatives, so sounds like "sss" and "sh," these are called spectral moments.

 

Kat Setzer  35:18

Can you explain what a fricative is? 

 

Benjamin Tucker  35:20

Yeah, a fricative is like a "sss" sound or a "sh" sound. So when we produce a fricative sound, we're creating noise in our vocal tract. So we make kind of a tight constriction, and we force.... it's like, it's like the  sound when you poke a hole in your tire, and you can hear the air rushing out. We do that, we do that with speech so that the "ssss" or the "shhh" sound, right, you have both of those. It's kind of that noisy sound. And so the term we use for that in speech is fricative.

 

Kat Setzer  35:53

And then what's a spectral moment?

 

Benjamin Tucker  35:55

A spectral moment.... I think that the easiest way to kind of conceptualize a spectral moment is, if you listen to "ssss" an "shhhh," that you'll notice that one sounds like it has a higher pitch other. And that's, that's one of the spectral moments, right? These are, basically the spectral moments are just ways in which we can analyze the spectrum of the sound. And we can take different information from that spectrum. And for me, the one that I would teach the most often is this kind of idea that when you hear a "sss' and a "sh," one sounds like it has a higher pitch than the other one. 

 

Kat Setzer  36:40

Got it. 

 

Benjamin Tucker  36:41

So Christine Shadle, she's been using spectral moments for most of her career, in the research she's done on fricatives. But she suggests some alternatives to this kind of standard measure of fricatives. Right? If I when I think about fricatives, when I think about how I'm going to measure them, this idea of spectral moments comes up. I'm going to, I'm going to talk about that. That's just the first thing I think of when I think about measuring them.

 

Matthew Winn  37:07

Yeah, I think that we're noticing, like, that theme is coming up a lot, where there's a lot of complexity in measuring the acoustics of speech, and it can be easy to say, "Well, you know, this is an easy measurement that the computer can give me," and that might not be what I want, it might not actually be capturing everything I need, ut because it's so easy to measure, it gets used very often. And then people sort of think of it as the standard. And so I think calling out that pattern, not in a moralistic way, but just saying, "Oh, as it turns out, this doesn't have.... You know, we can we can do more. There are other..." You know, I think we've touched on a couple of different, like, as you mentioned, the two main sections of the issue, the theoretical concepts and also the methodology. Most of the papers have some implications for both of these things for how we think about what speech communication is, but also how we actually conduct the research.  And Tim Beechey wrote this excellent paper that pointed out that the way we test speech recognition really ignores a lot of the things that are perhaps hard to quantify, but are very meaningful. So for example, when you take a pause right before you're about to say something very important, you know, perception of that pause is very important. And it's not just, it's not enough just to know what words someone said, you know, when you realize that someone's trying to speak between the lines, so to speak, like those are actually, those are important aspects of communication. And he points out that to really understand how communication works, we have to pay attention to that. And of course, everyone uses this differently. And one of the hallmarks of our field is how the acoustics of speech are so variable in so many ways. And so we want to learn how a listener handles those many layers of variability. Every talker sounds different than the last and every, even a single person sounds different depending on the time of day and the room they're in, the mood they're in, and so on. So Alexandra Kapadia dug into this, and in her paper in this issue, she tried to separate the effect of variability between talkers, as opposed to the variability within a single talker's voice, right? So I sound different than you, but also I can make my voice sound different, for example, when I'm teaching versus when I'm chatting with my friends. So the study found that when you're dealing with the variability across different voices, you are both less accurate and also slower at recognizing words, but when the variability is just within one talker's voice, you're slower but you don't lose the accuracy. Right? So this is not just me. This is adding something new; this wasn't just a commentary or an opinion piece. This was new original work that took this old established idea of variability but showed it has more details to appreciate.

 

Kat Setzer  40:04

So there were-- you kind of mentioned this already-- there were also some methodological assumptions that were challenged. What were these?

 

Matthew Winn  40:11

Well, I think we touched on some of the methodology with Richard's description of the vowels and some of our commentary on categorical perception. I wonder if we want to move to some of the, you know, some of the things that we accept as facts that might be reconsidered here.

 

Richard Wright  40:30

So one of the one of the assumptions, and or one of the pieces of received wisdom in the linguistic literature especially, is that the more sounds a language has that are contrasted, and contrast of sounds change meanings... So for example, English has quite a few vowels; it has e, eh, a, ah, aa... that's a lot already. And then it's got all the back vowels, whereas many languages have five, they have some kind of an E, some kind of an A, some kind of an AH, some kind of an O and some kind of OO. And they that makes the vowel space sparser, right? And so they're not as crowded. And there's this sort of perceived wisdom inin linguistic phonetic literature, in the phonological literature, that languages with larger inventories with more sounds in them that change meanings of words should show less variation than languages with small inventory. So for example, the five vowel systems can vary more, because they're not going to run into each other acoustically or perceptually. And one of the papers in this issue by Hauser demonstrated really nicely that you can't just make this assumption about a language. You can't just say, "Oh, this language has five vowels, so that's going to vary more than a language like Swedish, which has many, many more vowels.". And she shows that languages with large phonemic inventories still have-- this touches on the point that that Matt was making and that Ben touched onto about variability-- that it doesn't matter if you have lots of sounds or a few sounds, you still have variability. And the variability has to be considered in the sort of the context of what makes something variable. So if you only record words in isolation, you don't take into account how they change in the conversations the way Ben was talking about when you're teaching versus when you're having a conversation with a friend versus when you're having a conversation with a stranger... you're not really capturing variation. And Matt touched on this from a perceptual point of view, but this notion that somehow languages will be less variable if they have more crowded systems has not been established, but it's been accepted. And so Hauser did a really, really nice job of talking about it and and highlighting that languages have variability, even if they have large inventories.

 

Benjamin Tucker  42:53

So another example can come from... there's a another study that looks at this thing called the iambic-trochaic law, okay? And what this law, I'll explain the law in just a minute, what the law is trying to do is it's trying to figure out how we parse out words from connected speech, right? I'm producing a stream of connected speech. And if you were to look at a recording of it, you wouldn't see pauses in there. Speech. Doesn't. Sound. Like. This. When. We... Right? We don't put pauses between each word, we use information in the signal to figure out where those word boundaries are, and to figure out where the words are. So this iambic-trochaic law is really all about how we use the signal and the kind of the prominence, right, the rise and the fall of certain syllables to figure out where those word boundaries are. And so this study takes this idea that the... of this iambic trochaic law, and where the prominence is in the signal. And they actually test this with how listeners kind of parse it  and they provide evidence that in a way supports this idea of the iambic-trochaic law, but it also challenges it as well, right? And this is one of the fun things kind of about this approach and this paper is, right, and really, I think this is true of all of the papers, right? There's kernels of very important truth in all of the research that we're reconsidering, right, and it's figuring out what parts of that are important and useful, and how we can expand on that truth to better understand the way the world works around us, in this case speech, or the iambic trochaic law if we're more specific.

 

Kat Setzer  44:48

So why do these ideas stick around if they're problematic?

 

Richard Wright  44:52

This is Richard. I guess I'll take that, although this is a sort of summarizing some points that we've been making throughout the interview. One reason is that humans are attracted to simple stories. The easier the story is to tell, in scientific writing, the easier it is to write down, the more attractive it is. This is sometimes falsely attributed to Occam's Razor in science. So Occam's Razor, you know, is not just you tell the simplest story regardless of how many free parameters you're ignoring. So we're attracted to these, the simple story, you know, and  then that becomes sort of something our students can understand. Sometimes it's related to teaching. And there are a few basic factors that we know are at play, and maybe some other mysterious ones. So we're sort of ignoring lots of the parameters that might be at work here. And that last bit that Ben was talking about, that iambic trochaic law highlights this really nicely. And the compelling stories... If it's,,, if it can be... if it's compelling to us, we're more likely to believe it. And in some ways this is, you know, confirmation bias, but in other ways, it's just, if it makes sense, if it's, if it's sort of intuitive, then it's easier for people to grab it, and they kind of forget some of  the other details in there. And if you start actually reviewing all the caveats and exceptions of what it is that you're telling in your story, you kind of lose the reader or listener. And it's also less likely to be introduced in a class. So that the example that that Matt was talking about with categorical perception, it's really easy to tell that story, to say, "There's this sound, and then there's this sound, and you ignore all the variation within and you just perceive these two sides categorically." It's ignoring all of the subtleties. And if you start drilling down, especially if you're trying to teach this as one unit in like a 10-week class, and you've got to cover a lot of stuff, it's easier to tell the story. So it can sometimes find its way in through education.  But sometimes it's in a paper. People, if you've questioned these the sorts of stories that are intuitive and easy to tell, reviewers will sometimes push back and they'll be like, "Oh, well, you know, we all know this is this thing.' And a lot of us have a habit of letting time go by, you know, since the last time we read a paper, and  I'm as guilty of this as the next person. And so when we remember what was in that paper that we're going to cite and blame for an idea, let's say in a paper we're writing, we don't remember, sometimes all of the subtle details. This ism the Hillenbrand paper was really, or the Hillenbrand section of that paper was a nice illustration of this: People remember the takeaway that's easy to tell, but they forget all of the detailed bits that Hillenbrand was referring to. The other thing is that when something confirms something that we are researching, it's easy for us to ignore things that kind of don't agree with what we've, what we're finding. And that's the confirmation bias. And finally, sometimes great ideas just get lost. There was a really nice paper. This was what, it was Doug Whalen, right? Am I remembering that correctly? Yeah, by Doug Whalen, about a paper that Dennis Klapp, who, sort of the father of modern synthesis and a very influential figure in speech acoustics and speech synthesis, he wrote a really nice paper and Doug basically said, "This paper has been forgotten." And I think that it, you know, sometimes ideas don't get reconsidered because we don't remember that they're out there. So it's, that's another source we forget. Or it just was published in a time when people weren't ready for it or it got lost because in obscurity.

 

Kat Setzer  48:59

Interesting. Yeah. 

 

Richard Wright  49:00

So that's my two cents on that.

 

Kat Setzer  49:03

I like it. So after editing the special issue, do you have any takeaways for researchers when it comes to thinking about seminal ideas?

 

Benjamin Tucker  49:10

This has been, I'm going to kind of take an education perspective or teaching perspective, since all three of us are teachers, and hopefully this will influence the way we teach. But you know, many of the things that have come up are really things that we teach in class often, right? So I've talked about spectral moments. That's something that I talk to my students when I'm teaching about acoustics or Matt talked about categorical perception. I spent a substantial amount of time in a speech perception class talking about categorical perception because a lot of research used that method. What's what's nice about this special issue is, right, it gives us an opportunity to also integrate into our teaching and the opportunity for students to critically think about some of these things instead of just me saying "Well, here's the thing that you just need to know. So, you know, learn it and memorize these facts." Well, you know, it's not just fact, there's things that we can question or things that we can try and study and better understand why these things are happening within the context of, in which they happened, or, you know, question, okay categorical perception was a thing, but what are some of the problems with it? And why would we want to be careful with that? And it's possible that one of the reasons that some of these, you know, ideas have endured for so many years is because they keep getting taught, right, I'm going to teach the same things that I was taught when I was a student. And that becomes cyclical, right, we just kind of pass on that information. And we want to be careful with that. And encouraging kind of this critical thinking, with articles from this special issue or articles from other places, I think is really important, because then the students learn to do, what Matt and Richard set out to do from the very beginning, is to really kind of reconsider and really critically think about those those things.  I think it also happens that occasionally, or maybe more often than we'd like, we teach things that are out of our comfort zone. Right? That's not my little domain of my niche area of expertise. It's related, and I have expertise, and I can teach about that. But I rely on other people's material for that, and, or just rely on kind of what is common knowledge. And so then I perpetuate that in my teaching, and so hopefully, the special issue can  be a resource for that and kind of expanding that. And really, it's an opportunity for us to teach our students that the scientific process is, it's a process, it's an ongoing, living continuing process. It's not something that is stagnant. There isn't a set of facts that you have to memorize. It's, that's why we do research right there. We're trying to discover things, and it's an ongoing process when we do that.

 

Matthew Winn  52:03

Yeah, and to reinforce that too, just even sticking with the idea of being a teacher, I think you gain a lot of rapport with students by being a living example of that. And by showing like, you know, you will run into this thing over and over, this idea over and over again. But here's this, you know, the secret about it, right? And I think you can gain a lot of respect from someone who says, "Oh, like, you're really thinking about this, I can trust what you're saying." And that secret is not always, that's not a mean-spirited thing, right, to point out that something is wrong.  And just to put up a button on that issue, to, you know, circle back to how we opened the discussion, just revisiting the idea that the spirit of this issue is not to vilify or shame anybody who published that work that we're reconsidering. The idea is to recognize, you know, we're all humans and humans have attachments to personalities, and to ideas and stories, especially if they're central to their academic training. And even though you know, those stories might not always be true. And each of us has no doubt published some things that will be critiqued in the future, and will be shown to be an incomplete picture of whatever it is we're trying to understand. And the goal of this issue is to show that we don't have to be threatened by that, because you know, we can embrace it, we can be open minded and be able to change our minds. And that will lead to progress, right? So we hope that this is going to be first of several issues of this kind, because there are going to be more ideas, right? We didn't capture all of them. And they're going to be more ideas that need to be examined and reconsidered. And I hope that we would welcome that kind of criticism and not, you know, be so attached to our output as our value as people or as scientists. So you know, hopefully, the special issue will be a good example of that, how to do it collegially and respectfully, and maybe even with a little sense of humor.

 

Kat Setzer  53:56

It's kind of like it's workshopping the common knowledge in a way. 

 

Matthew Winn  53:59

Yeah. 

 

Kat Setzer  54:00

You're working together. Well, thank you again for taking the time to speak with me today. And I have to say, when this special issue was in the works, I thought it would make such an interesting discussion for the podcast. So this idea of reviewing ideas and methods to make sure everything is scientifically sound is a practice I'm sure many researchers can stand by. And it's interesting to hear how our understanding of speech continues to evolve. And I do hope we get more special issues on this because it is super interesting.

 

Matthew Winn  54:25

Yeah. And I mean, maybe the thing to add is that we've mentioned a few of the papers in this issue, but there were a lot more that we didn't even have time to mention. And that's not to say that they weren't as good. But there's a lot of value in this special issue. So I hope that people check out all the different papers in it

 

Kat Setzer  54:41

Yeah, we'll have a link in the show notes for our listeners, so they can go straight to the page and, you know, look at all the articles that are available. 

 

Matthew Winn  54:50

Great. 

 

Kat Setzer  54:51

Yeah. Thank you all. I really appreciate it. Have a good day.

 

Matthew Winn  54:55

Thanks. Yeah.

 

Benjamin Tucker  54:56

Thank you, Kat.

 

Kat Setzer  55:00

Thank you for tuning into Across Acoustics. If you'd like to hear more interviews from our authors about their research, please subscribe and find us on your preferred podcast platform.