Skip to main content
Robert Nelson
  • Minato, Tokyo
Why Adult Language Learning is Harder: A Computational Model of the Consequences of Cultural Selection for Learnability. Robert N. Nelson Jr. (rnnelson@purdue.edu) Department of English, Purdue University West Lafayette, IN. 47906... more
Why Adult Language Learning is Harder: A Computational Model of the Consequences of Cultural Selection for Learnability. Robert N. Nelson Jr. (rnnelson@purdue.edu) Department of English, Purdue University West Lafayette, IN. 47906 Abstract This paper reports on a limited model of language evolution that incorporates transmission noise and errorful learning as sources of variation. The model illustrates how the adaptation of language to the statistical learning mechanisms of infants may be a factor in the apparent ceiling on adult second language achievement. The model is limited in its focus to only phonotactics because the probabilistic imbalances that have been found in phonotactics have been found to be effective cues in the very first language learning task, speech segmentation (Saffran & Theissen, 2003; Mattys & Jusczyk, 2001), and in the organization of lexical memory (Vitevitch, Luce, Pisoni & Auer, 1999). The argument that this model supports is that these probabilistic imba...
This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams, Segalowitz & Leclair, 2014). When sampling from a population generated by a power-law process... more
This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams, Segalowitz & Leclair, 2014). When sampling from a population generated by a power-law process (e.g., a Zipf distribution), the probability that any given member is selected is dependent on its rank, such that higher frequency rank (i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower rank (i.e., 100th, 1000th) members. Because of this, sampling is much more likely to select from the same limited group of words. The CR measure, however, assumes a uniform distribution, and so drastically underestimates the size of the vocabulary when applied to power-law data. Work with simulated data shows ways that the degree of underestimation may be lessened. Applying these methods to real data shows effects parallel to those in the simulations.
Gries (2008, 2021) defined two dispersion measures able to alert corpus analysts to words that have a problematically limited distribution. Gries (2010, 2022) posited that these measures may additionally be relevant to language... more
Gries (2008, 2021) defined two dispersion measures able to alert corpus analysts to words that have a problematically limited distribution. Gries (2010, 2022) posited that these measures may additionally be relevant to language development research, as the learnability of a pattern may be predicted by the evenness of its distribution in corpora. However, both measures work by comparing vectors of observed and expected frequencies in partitioned corpora and this method cannot determine that a word is evenly distributed because it cannot distinguish the random noise inherent to an unbiased process from substantial non-random bias. An additional concern with the 2008 measure is raised: the 2008 measure is Manhattan distance scaled to the unit interval and, as such, it is extremely sensitive to the number of corpus parts because this choice sets the dimensionality of the measure space. In sum, this short analysis presents evidence that these measures should not be used to declare a pattern evenly distributed as neither can tell the difference between statistical noise and systematic bias.
To some extent, we seem to use language in chunks—multiple words that are co-selected and used as gestalt units. By some estimates, these chunks constitute more than 50% of a given text (Erman & Warren, 2000). The extent to which our... more
To some extent, we seem to use language in chunks—multiple words that are co-selected and used as gestalt units. By some estimates, these chunks constitute more than 50% of a given text (Erman & Warren, 2000). The extent to which our communication is composed of these units has broad implications for linguistic theory, psycholinguistics, and applied linguistics, and so is the focus of this study. This study shows that claims made regarding the nature of formulaic language (Sinclair, 1991) lead to a method for the automatic detection of holistically used multiword patterns in text corpora, which in turn allows for the estimation of the ‘chunkiness’ of linguistic corpora. These estimates may be useful for materials development in language teaching, as well as corpus linguistic and psycholinguistic studies.
Research Interests:
It has long been recognized that developing measures of the internal structure of collocations is an important goal (Sinclair, 1991). Recently, Gries’ (2013) presented a measure that captures the asymmetric nature of conditional... more
It has long been recognized that developing measures of the internal structure of
collocations is an important goal (Sinclair, 1991). Recently, Gries’ (2013) presented
a measure that captures the asymmetric nature of conditional probabilities in collocations.
This paper intends to contribute to the discussion by introducing measures
of asymmetry and redundancy that may meet the needs of some researchers. Two
asymmetry measures are described. The first captures only frequency asymmetry
while the second is an asymmetric version of the mutual information measure. A
measure of semantic redundancy is also described here. This measure takes a higher
value when the fact that two words co-occur contains more information than the
uncertainty introduced by the occurrence of the individual words.
Research Interests:
This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams, Segalowitz & Leclair, 2014). When sampling from a population generated by a power-law process... more
This short paper discusses shortcomings of the capture-recapture (CR)
method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams,
Segalowitz & Leclair, 2014). When sampling from a population generated by
a power-law process (e.g., a Zipf distribution), the probability that any given
member is selected is dependent on its rank, such that higher frequency rank
(i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower
rank (i.e., 100th, 1000th) members. Because of this, sampling is much more
likely to select from the same limited group of words. The CR measure, how-
ever, assumes a uniform distribution, and so drastically underestimates the size
of the vocabulary when applied to power-law data. Work with simulated data
shows ways that the degree of underestimation may be lessened. Applying these
methods to real data shows effects parallel to those in the simulations.
Research Interests:
please let me know if you find errors, better ways to say/do the things described in the chapter, or have suggestions for content
Research Interests: