In Robert F. Port & T. van Gelder (Eds.)
Mind as Motion: Explorations in the Dynamics of Cognition. Cambridge, MA:
MIT Press, 1995. Pp. 195-223.

Language as a dynamical system

Jeffrey L. Elman

University of California, San Diego

Introduction

Despite considerable diversity among theories about how humans process language, there are a number of fundamental assumptions which are shared by most such theories. This consensus extends to the very basic question about what counts as a cognitive process. So although many cognitive scientists are fond of referring to the brain as a `mental organ' (e.g., Chomsky, 1975)--implying a similarity to other organs such as the liver or kidneys--it is also assumed that the brain is an organ with special properties which set it apart. Brains `carry out computation' (it is argued); they `entertain propositions'; and they `support representations'. Brains may be organs, but they are very different than the other organs found in the body.

Obviously, there are substantial differences between brains and kidneys, just as there are between kidneys and hearts and the skin. It would be silly to minimze these differences. On the other hand, a cautionary note is also in order. The domains over which the various organs operate are quite different, but their common biological substrate is quite similar. The brain is indeed quite remarkable, and does some things which are very similar to human-made symbol processors; but there are also profound differences between the brain and digital symbol processors, and attempts to ignore these on grounds of simplification or abstraction run the risk of fundamentally misunderstanding the nature of neural computation (Churchland & Sejnowski, 1992). In a larger sense, I raise the more general warning that (as Ed Hutchins has suggested) ``cognition may not be what we think it is''. Among other things, I will suggest in this chapter that language (and cognition in general) may be more usefully understood as the behavior of a dynamical system. I believe this is a view which both acknowledges the similarity of the brain to other bodily organs and respects the evolutionary history of the nervous system, while also acknowledging the very remarkable properties possessed by the brain.

In the view I will outline, representations are not abstract symbols but rather regions of state space. Rules are not operations on symbols but rather embedded in the dynamics of the system, a dynamics which permits movement from certain regions to others while making other transitions difficult. Let me emphasize from the beginning that I am not arguing that language behavior is not rule-governed. Instead, I suggest that the nature of the rules may be different than what we have conceived them to be.

The remainder of this chapter is organized as follows. In order to make clear how the dynamical approach (instantiated concretely here as a connectionist network) differs from the standard approach, I begin by summarizing some of the central characteristics of the traditional approach to language processing. Then I shall describe a connectionist model which embodies different operating principles from the classical approach to symbolic computation. The results of several simulations using that architecture are presented and discussed. Finally, I will discuss some of the results which may be yielded by this perspective.

Grammar and the lexicon: The traditional approach

Language processing is traditionally assumed to involve a lexicon, which is the repository of facts concerning individual words, and a set of rules which constrain the ways those words can be combined to form sentences. From the point of view of a listener attempting to process spoken language, the initial problem involves taking acoustic input and retrieving the relevant word from the lexicon. This process is often supposed to involve separate stages of lexical access (in which contact is made with candidate words based on partial information), and lexical recognition (or retrieval, or selection; in which a choice is made on a specific word), although finer-grained distinctions may also be useful (e.g., Tyler & Frauenfelder, 1987). Subsequent to recognition, the retrieved word must be inserted into a data structure which will eventually correspond to a sentence; this procedure is assumed to involve the application of rules.

As described, this scenario may seem simple, straightforward, and not likely to be controversial. But in fact, there is considerable debate about a number of important details. For instance:

Is the lexicon passive or active? In some models, the lexicon is a passive data structure (Forster, 1976). In other models, lexical items are active (Marslen-Wilson, 1980; McClelland & Elman, 1986; Morton, 1979) in the style of Selfridge's ``demons'' (Selfridge, 1958).

How is the lexicon organized and what are its entry points? In active models, the internal organization of the lexicon is less an issue, because the lexicon is also usually content addressable, so that there is direct and simultaneous contact between an unknown input and all relevant lexical representations. With passive models, an additional look-up process is required and so the organization of the lexicon becomes more important for efficient and rapid search. The lexicon may be organized along dimensions which reflect phonological, or orthographic, or syntactic, or syntactic properties; or it may be organized along usage parameters, such as frequency (Forster, 1976). Other problems include how to catalog morphologically related elements (e.g., are ``telephone'' and ``telephonic'' separate entries? ``girl'' and ``girls''? ``ox'' and ``oxen''?); how to represent words with multiple meanings (the various meanings of ``bank'' may be different enough to warrant distinct entries, but what about the various meanings of ``run'', some of which are only subtly different, and others which have more distant but still clearly related meanings?); whether the lexicon includes information about argument structure; and so on.

Is recognition all-or-nothing, or graded? In some theories, recognition occurs at the point where a spoken word becomes uniquely distinguished from its competitors (Marslen-Wilson, 1980). In other models, there may be no consistent point where recognition occurs; rather, recognition is a graded process subject to interactions which may hasten or slow down the retrieval of a word in a given context. The recognition point is a strategically-controlled threshold (McClelland & Elman, 1986).

How do lexical competitors interact? If the lexicon is active, there is the potential for interactions between lexical competitors. Some models build inhibitory interactions between words (McClelland & Elman, 1986); others have suggested that the empirical evidence rules out word-word inhibitions (Marslen-Wilson, 1987).

How are sentence structures constructed from words? This single question has given rise to a vast and complex literature. The nature of the sentence structures themselves are fiercely debated, reflecting the diversity of current syntactic theories. There is in addition considerable controversy around the sort of information which may play a role in the construction process, or the degree to which at least a first-pass parse is restricted to the purely syntactic information available to it (Frazier & Rayner, 1982; Trueswell, Tanenhaus, & Kello, 1992).

There are thus a considerable number of questions which remain open. Nonetheless, I believe it is accurate to say that there is also considerable consensus regarding certain fundamental principles. I take this consensus to include the following.

(a) A commitment to discrete and context-free symbols. This is more readily obvious in the case of the classical approaches, but many connectionist models utilize localist representations in which entities are discrete and atomic (although graded activations may be used to reflect uncertain hypotheses).

A central feature of all of these forms of representation--localist connectionist as well as symbolic--is that they are intrinsically context-free. The symbol for a word, for example, is the same regardless of its usage. This gives such systems great combinatorial power, but it also limits their ability to reflect idiosyncratic or contextually-specific behaviors.

This assumption also leads to a distinction between types and tokens and motivates the need for variable binding. Types are the canonical context-free versions of symbols; tokens are the versions which are associated with specific contexts; and binding is the operation which enforces the association (e.g., by means of indices, subscripts, or other diacritics).

(b) The view of rules as operators and the lexicon as operands. Words in most models are conceived of as the objects of processing. Even in models in which lexical entries may be active, once word a is recognized it becomes subject to grammatical rules which build up higher-level structures.

(c) The static nature of representations. Although the processing of language clearly unfolds over over time, the representations which are produced by traditional models typically have a curiously static quality. This is revealed in several ways. For instance, it is assumed that the lexicon pre-exists as a data structure in much the same way that a dictionary exists independently of its use. Similarly, the higher-level structures created during sentence comprehension are built up through an accretive process, and the successful product of comprehension will be a mental structure in which all the constituent parts (words, categories, relational information) are simultaneously present. (Presumably these become inputs to some subsequent interpretive process which constructs discourse structures.) That is, although processing models (``performance models'') often take seriously the temporal dynamics involved in computing target structures, the target structures themselves are inherited from theories which ignore temporal considerations (``competence models'').

(d) The building metaphor. In the traditional view, the act of constructing mental representations is similar to the act of constructing a physical edifice. Indeed, this is precisely what is claimed in the Physical Symbol System Hypothesis (Simon, 1980). In this view, words and more abstract constituents are like the bricks in a building; rules are the mortar which binds them together. As processing proceeds, the representation grows much as does a building under construction. Successful processing results in a mental edifice which is a complete and consistent structure, again, much like a building.

I take these assumptions to be widely shared among researchers in the field of language processing, although they are rarely stated explicitly. Furthermore, these assumptions have formed the basis for a large body of empirical literature; they have played a role in the framing of the questions which are posed, and later in interpreting the experimental results. Certainly it is incumbent on any theory which is offered as replacement to at least provide the framework for describing the empirical phenomena, as well as improving our understanding of the data.

Why might we be interested in another theory? One reason is that this view of our mental life which I have just described, that is, a view which relies on discrete, static, passive, and context-free representations, appears to be sharply at variance with what is known about the computational properties of the brain (Churchland & Sejnowski, 1992). It must also be acknowledged that while the theories of language which subscribe to the assumptions listed above do provide a great deal of coverage of data, that coverage is often flawed, internally inconsistent and ad hoc, and highly controversial. So it is not unreasonable to raise the question: Do the shortcomings of the theories arise from assumptions which are basically flawed? Might there be other, better ways of understanding the nature of the mental processes and representations which underlie language? In the next section, I would like to suggest an alternative view of computation, in which language processing is seen as taking place in a dynamical system. The lexicon is viewed as consisting of regions of state space within that system; the grammar consists of the dynamics (attractors and repellers) which constrain movement in that space. As we will see, this approach entails representations which are highly context-sensitive, continuously varied and probabilistic (but of course 0.0 and 1.0 are also probabilities), and in which the objects of mental representation are better thought of as trajectories through mental space rather than things which are constructed.

An entry-point to describing this approach is the question of how one deals with time and the problem of serial processing. Language, like many other behaviors, unfolds and is processed over time. This simple fact--so simple it seems trivial--turns out to be problematic when explored in detail. Therefore, I turn now to the question of time. I describe a connectionist approach to temporal processing and show how it can be applied to several linguistic phenomena. In the final section I turn to the pay-off and attempt to show how this approach leads to useful new views about the lexicon and about grammar.

The problem of time

Time is the medium in which all our behaviors unfold; it is the context within which we understand the world. We recognize causality because causes precede effects; we learn that coherent motion over time of points on the retinal array is a good indicator of objecthood; and it is difficult to think about phenomena such as language, or goal-directed behavior, or planning without some way of representing time. Time's arrow is such a central feature of our world that it is easy to think that, having acknowledged its pervasive presence, little more needs to be said.

But time has been the stumbling block of many theories. An important issue in models of motor activity, for example, has been the nature of the motor intention. Does the action plan consist of a literal specification of output sequences (probably not), or does it represent serial order in a more abstract manner (probably so, but how; e.g., Fowler, 1977; Jordan & Rosenbaum, 1988; Kelso, Saltzman, & Tuller, 1986; MacNeilage, 1970)? Within the realm of natural language processing, there is considerable controversy about how information accumulates over time and what information is available when (e.g., Altmann & Steedman, 1988; Ferreira & Henderson, 1990; Trueswell, Tanenhaus, & Kello, in press).

Time has been a challenge for connectionist models as well. Early models, perhaps reflecting the initial emphasis on the parallel aspects of these models, typically adopted a spatial representation of time (e.g., McClelland & Rumelhart, 1981). The basic approach is illustrated in Figure 1. The temporal order of input events (first-to-last) is represented by the spatial order (left-to-right) of the input vector. There are a number of problems with this approach (see Elman, 1990, for discussion). One of the most serious is that the left-to-right spatial ordering has no intrinsic significance at the level of computation which is meaningful for the network. All input dimensions are orthogonal to each other in the input vector space. The human eye tends to see patterns such as 01110000 and 00001110 as having undergone a spatial (or temporal, if we understand these as representing an ordered sequence) translation, because the notation suggests a special relationship may exist between adjacent bits. But this relationship is the result of considerable processing by the human visual system, and is not intrinsic to the vectors themselves. The first element in a vector is not ``closer'' in any useful sense to the second element than it is to the last element. Most important, is not available to simple networks of the form shown in Figure 1. A particularly unfortunate consequence is that there is no basis in such architectures for generalizing what has been learned about spatial or temporal stimuli to novel patterns.

Figure 1. A feed-forward network which represents time through space. Circles represent nodes; arrows between layers indicate full connectivity between nodes in adjacent layers. The network is feed-forward because activations at each level depend only on the input received from below. At the conclusion of processing an input, all activations are thus lost. A sequence of inputs can be represented in such an architecture by associating the first node (on the left) with the first element in the sequence; the second node with the second element; and so on.

More recent models have explored what is intuitively a more appropriate idea: Let time be represented by the effects it has on processing. If network connections include feedback loops, then this goal is achieved naturally. The state of the network will be some function of the current inputs plus the network's prior state. Various algorithms and architectures have been developed which exploit this insight (e.g., Elman, 1990; Jordan, 1986; Mozer, 1989; Pearlmutter, 1989; Rumelhart, Hinton, & Williams, 1986). Figure 2 shows one architecture, the Simple Recurrent Network, which was used for the studies to be reported here.

Figure 2. A simple recurrent network (SRN). Solid lines indicate full connectivity between layers, with weights which are trainable. The dotted line indicates a fixed one-to-one connection between hidden and context layers. The context units are used to save the activations of the hidden units on any time step. Then on the next time step, the hidden units are activated not only by new input but by the information in the context units--which is just the hidden units' own activations on the prior time step. An input sequence is processed by presenting each element in the sequence one at a time, allowing the network to be activated at each step in time, and then proceeding to the next element. Note that although hidden unit activations may depend on prior inputs, by virtue of prior inputs' effects on the recycled hidden unit/context unit activations, the hidden units do not record the input sequence in any veridical manner. Instead, the task of the network is to learn to encode temporal events in some more abstract manner which allows the network to perform the task at hand.

In the SRN architecture, at time t hidden units receive external input, and also collateral input from themselves at time t-1 (the context units are simply used to implement this delay). The activation function for any given hidden unit is the familiar logistic,

but where the net input to the unit at time t, , is now

That is, the net input on any given tick of the clock t includes not only the weighted sum of inputs and the node's bias, but the weighted sum of the hidden unit vector at the prior time step. (Henceforth, when referring to the state space of this system, I shall be referring specifically to the k-dimensional space defined by the k hidden units.)

In the typical feedforward network, hidden units develop representations which enable the network to perform the task at hand (Rumelhart, Hinton, & Williams, 1986). These representations may be highly abstract and are function-based. That is, the similarity structure of the internal representations reflects the demands of the task being learned, rather than the similarity of the inputs' form. When recurrence is added, the hidden units assume an additional function. They now provide the network with memory. But as is true in the feedforward network, the encoding of the temporal history is task-relevant and may be highly abstract; it rarely is the case that the encoding resembles a verbatim tape-recording.

One task for which the SRN has proven useful is prediction. There are several reasons why it is attractive to train a network to predict the future. One which arises with supervised learning algorithms such as backpropagation of error is the question of where the teaching information comes from. In many cases, there are plausible rationales which justify the teacher. But the teacher also reflects important theoretical biases which one might sometimes like to avoid (for example, if one were interested in using the network to generate alternative theories). Since the teacher in the prediction task is simply the time-lagged input, it represents information which is directly observable from the environment and is relatively theory neutral. Furthermore, there is good reason to believe that anticipating the future plays an important role in learning about the world. Finally, prediction is a powerful tool for learning about temporal structure. Insofar as the order of events may reflect upon the past in complex and non-obvious ways, the network will be required to develop relatively abstract encodings of these dependencies in order to generate successful predictions.

The SRN architecture, as well as other forms of recurrent networks, have been used in a variety of applications and has yielded promising results. The SRN's ability to handle temporal sequences makes it a particularly relevant architecture for modeling language behaviors. The deeper question which then arises is whether the solutions found by such recurrent network architectures differ in any substantial ways from more traditional models. And if the solutions are different, are these differences positive or negative?

Rules and representations: A dynamical perspective

We begin with the observation that networks such as that in Figure 2 are dynamical systems. This means that their state at any given point in time is some function which reflects their prior state (see Norton, this volume, for a detailed review of the definition and characteristics of dynamical systems). The computational properties of such networks are not yet fully known, but it is clear that they are considerable (Siegelmann & Sontag, 1992). It also seems reasonable that the conceptual notions which are associated with discrete automata theory and symbolic computation may offer less insight into their functioning than the concepts from dynamical systems theory (e.g., Pollack, 1990). How might such networks be applied to problems relevant to language processing, and how might they suggest a different view of the underlying mechanisms of language? One way to approach this is to consider the problem of how the elements of language may be ordered.

Language is a domain in which the ordering of elements is particularly complex. Word order, for instance, reflects the interaction of multiple factors. These include syntactic constraints, semantic and pragmatic goals, discourse considerations, and processing constraints (e.g., verb-particle constructions such as ``run up'' may be split by a direct object, but not when the noun phrase is long enough to disrupt the processing of the discontinuous verb as a unit). Whether or not one subscribes to the view that these knowledge sources exert their effects autonomously or interactively, there is no question that the final output--the word stream--reflects their joint interplay.

We know also that the linear order of linguistic elements provides a poor basis for characterizing the regularities which exist within a sentence. A noun may agree for number with a verb which immediately follows it (as in 1a) or which is separated by an arbitrarily great distance (as in 1b):

1. (a) The children(pl) like(pl) ice cream. 
   (b) The girl(sg) who Emily baby-sits for every other Wednesday while her parents go to nightschool likes(sg) ice cream.

Such considerations led Miller and Chomsky (1963) to argue that statistically-based algorithms are infeasible for language learning, since the number of sentences which a listener would need to hear in order to know precisely which of the 14 words which precede likes in (1b) determines the correct number for likes would vastly outnumber the data available (in fact, even conservative estimates suggest that more time would be needed than is available in an entire individual's lifetime). On the other hand, recognition that the dependencies respect an underlying hierarchical structure vastly simplifies the problem: Subject nouns in English agree for number with their verbs; embedded clauses may intervene but do not participate in the agreement process.

One way to challenge a simple recurrent network with a problem which has some relevance to language would therefore be to attempt to train it to predict the successive words in sentences. We know that this is a hard problem which cannot be solved in any general way by simple recourse to linear order. We know also that this is a task which has some psychological validity. Human listeners are able to predict word endings from beginnings; listeners can predict grammaticality from partial sentence input; and sequences of words which violate expectations--i.e., which are unpredictable--result in distinctive electrical activity in the brain. An interesting question is whether a network could be trained to predict successive words. In the following two simulations we shall see how, in the course of solving this task, the network develops novel representations of the lexicon and of grammatical rules.

The lexicon as structured state space

Words may be categorized with respect to many factors. These include such traditional notions as noun, verb, etc.; the argument structures they are associated with; and semantic features. Many of these characteristics are predictive of a word's syntagmatic properties. But is the reverse true? Can distributional facts be used to infer something about a word's semantic or categorial features? The goal of the first simulation was to see if a network could work backwards in just this sense.

A small lexicon of 29 nouns and verbs was used to form simple sentences (see Elman, 1990, for details). Each word was represented as a localist vector in which a single randomly assigned bit was turned on. This input representation ensured that there was nothing about the form of the word which was correlated with its properties, and thus that any classifications would have to be discovered by the network based solely on distributional behavior.

A network similar to the one shown in Figure 2 was trained on a set of 10,000 sentences, with each word presented in sequence to the network and each sentence concatenated to the preceding sentence. The task of the network was to predict the successive word. After each word was input, the output (which was the prediction of the next input) was compared with the actual next word and weights were adjusted by the backpropagation of error learning algorithm.

At the conclusion of training, the network was tested by comparing its predictions against the corpus. Since the corpus was non-deterministic, it was not reasonable to expect that the network (short of memorizing the sequence) would be able to make exact predictions. Instead, the network predicted the cohort of potential word successors in each context. The activation of each cohort turned out to be highly correlated with the conditional probability of each word, in that context (the mean cosine of the output vector with the empirically derived probability distribution was 0.916).

This behavior suggests that in order to maximize performance at prediction, the network identifies inputs as belonging to classes of words based on distributional properties and co-occurrence information. These classes were not represented in the overt form of the word, since these were all orthogonal to each other. However, the network is free to learn internal representations at the hidden unit layer which might capture this implicit information.

To test this possibility, the corpus of sentences was run through the network a final time. As each word was input, the hidden unit activation pattern which was produced by the word, plus the context layer, was saved. For each of the 29 words, a mean vector was computed, averaging across all instances of the word in all contexts. These mean vectors were taken to be prototypes, and were subjected to hierarchical clustering. The point of this was to see whether the inter-vector distances revealed anything about similarity structure of the hidden unit representation space (Euclidean distance being taken as a measure of similarity). The tree in Figure 3 was then constructed from that hierarchical clustering.

Figure 3. Hierarchical clustering diagram of hidden units activations in the simulation with simple sentences. After training, sentences are passed through the network, and the hidden unit activation pattern for each word is recorded. The clustering diagram indicates the similarity structure among these patterns. This structure, which reflects the grammatical factors that influence word position, is inferred by the network; the patterns which represent the actual inputs are orthogonal and carry none of this information.

The similarity structure revealed in this tree indicates that the network discovered several major categories of words. The two largest categories correspond to the input vectors which are verbs and nouns. The verb category is subdivided into those verbs which require a direct object, those which are intransitive, and those for which (in this corpus) a direct object was optional. The noun category is broken into animates and inanimates. The animates contain two classes: human and nonhuman, with nonhumans are subdivided into large animals and small animals. The inanimates are divided into breakables, edibles, and miscellaneous.

First, it must be said that the network obviously knows nothing about the real semantic content of these categories. It has simply inferred that such a category structure exists. The structure is inferred because it provides the best basis for accounting for distributional properties. Obviously, a full account of language would require an explanation of how this structure is given content (grounded in the body and in the world). But it is interesting that the evidence for the structure can be inferred so easily on the basis only of form-internal evidence, and this result may encourage caution about just how much information is implicit in the data and how difficult it may be to use this information to construct a framework for conceptual representation.

However, my main point is not to suggest that this is the primary way in which grammatical categories are acquired by children, although I believe that cooccurrence information may indeed play a role in such learning. The primary thing I would like to focus on is what this simulation suggests about the nature of representation in systems of this sort. That is, I would like to consider the representational properties of such networks, apart from the specific conditions which give rise to those representations.

Where is the lexicon in this network? Recall the earlier assumptions: The lexicon is typically conceived of as a passive data structure. Words are objects of processing. They are first subject to acoustic/phonetic analysis, and then their internal representations must be accessed, recognized, and retrieved from permanent storage. Following this, the internal representations have to be inserted into a grammatical structure.

The status of words in a system of the sort described here is very different: Words are not the objects of processing as much as they are inputs which drive the processor in a more direct manner. As Wiles and Bloesch (1992) have suggest, it is more useful to understand inputs to networks of this sort as operators rather than as operands. Inputs operate on the network's internal state and move it to another position in state space. What the network learns over time is what response it should make to different words, taking context into account. Because words have reliable and systematic effects on behavior, it is not surprising that all instances of a given word should result in states which are tightly clustered, or that grammatically or semantically related words should produce similar effects on the network. We might choose to think of the internal state that the network is in when it processes a word as representing that word (in context), but it is more accurate to think of that state as the result of processing the word, rather than as a representation of the word itself.

Note that there is an implicitly hierarchical organization to the regions of state space associated with different words. This organization is achieved through the spatial structure. Conceptual similarity is realized through position in state space. Words which are conceptually distant produce hidden unit activation patterns which are spatially far apart. Higher-level categories correspond to large regions of space; lower-level categories correspond to more restricted subregions. For example, dragon is a noun and causes the network to move into the noun region of the state space. It is also [+animate], which is reflected in the subregion of noun space which results. Because large animals typically are described in different terms and do different things than small animals, the general region of space corresponding to dragon, monster and lion is distinct from that occupied by mouse, cat, and dog. The boundaries between these regions may be thought of as hard in some cases (e.g., nouns are very far from verbs) or soft in others (e.g., sandwich, cookie, and bread are not very far from car, book, and rock. One even might imagine cases where in certain contexts, tokens of one word might overlap with tokens of another. In such cases, one would say that the system has generated highly similar construals of the different words.

Rules as attractors

If the lexicon is represented as regions of state space, what about rules? We have already seen that some aspects of grammar are captured in the tokenization of words, but this is a fairly limited sense of grammar. The well-formedness of sentences depends on relationships which are not readily stated in terms of simple linear order. Thus the proper generalization about why the main verb in (1b) is in the plural is that the main subject is plural, and not that the word 14 words prior was a plural noun. The ability to express such generalizations would seem to require a mechanism for explicitly representing abstract grammatical structure, including constituent relationships (e.g., the notion that some elements are part of others). Notations such as phrase structure trees (among others) provide precisely this capability. It is not obvious how complex grammatical relations might be expressed using distributed representations. Indeed, it has been argued that distributed representations (of the sort exemplified by the hidden unit activation patterns in the previous simulation) cannot have constituent structure in any systematic fashion (Fodor & Pylyshyn, 1988). (As a backup, Fodor and Pylyshyn suggest that if distributed representations do have a systematic constituent structure, then they are merely implementations of what they call the ``classical'' theory, in this case, the Language of Thought, Fodor, 1976.)

The fact that the grammar of the first simulation was extremely simple made it difficult to explore these issues. Sentences were all declarative and monoclausal. This simulation sheds little light on the grammatical potential of such networks.

A better test would be train the network to predict words in complex sentences which contain long-distance dependencies. This was done in Elman (1991b) using a strategy which was similar to the one outlined in the prior simulation, except that sentences had the following characteristics:

(1) Nouns and verbs agreed for number. Singular nouns required singular verbs; plural nouns selected plural verbs.

(2) Verbs differed with regard to their verb argument structure. Some verbs were transitive; others were intransitive; and others were optionally transitive.

(3) Nouns could be modified by relative clauses. Relative clauses could either be object-relatives (the head had the object role in the clause) or subject-relative (the head was the subject of the clause), and either subject or object nouns could be relativized.

As in the previous simulation, words were represented in localist fashion so that information about neither the grammatical category (noun or verb) nor the number (singular or plural) was contained in the form of the word. The network also only saw positive instances; only grammatical sentences were presented.

The three properties interact in ways which were designed to make the prediction task difficult. The prediction of number is easy in a sentence such as (2a), but harder in (2b).

2. (a) The boys(pl) chase(pl) the dogs. 
   (b) The boys(pl) who the dog(sg) chases(sg) run(pl) away.

In the first case, the verb follows immediately. In the second case, the first noun agrees with the second verb (run) and is plural; the verb which is actually closest to it (chase) is in the singular because it agrees with the intervening word (dog).

Relative clauses cause similar complications for verb argument structure. In (3), it is not difficult for the network to learn that chase requires a direct object, see permits (but does not require) one, and lives is intransitive.


3. (a) The cats chase the dog. 
   (b) The girls see. The girls see the car. 
   (c) The patient lives.

On the other hand, consider (4):

4. The dog who the cats chase run away.

The direct object of the verb chase in the relative clause is dog. However, dog is also the head of the clause (as well as the subject of the main clause). Chase in this grammar is obligatorily transitive, but the network must learn that when it occurs in such structures the object position is left empty (gapped) because the direct object has already been mentioned (filled) as the clause head.

These data illustrate the sorts of phenomena which have been used by linguists to argue for abstract representations with constituent structure (Chomsky, 1957); they have also been used to motivate the claim that language processing requires some form of pushdown store or stack mechanism. They therefore impose a difficult set of demands on a recurrent network.

However, after training a network on such stimuli (Elman, 1991b) it appeared the network was able to make correct predictions (mean cosine between outputs and empirically derived conditional probability distributions: 0.852; perfect performance would have been 1.0). These predictions honored the grammatical constraints which were present in the training data. The network was able to correctly predict the number of a main sentence verb even in the presence of intervening clauses (which might have the same or conflicting number agreement between nouns and verbs). The network also not only learned about verb argument structure differences, but correctly ``remembered'' when an object-relative head had appeared, so that it would not predict a noun following an embedded transitive verb. Figure 4 shows the predictions made by the network during testing with a novel sentence.

Figure 4. The predictions made by the network in the simulation with complex sentences, as the network processes the sentence ``boys who Mary chases feed cats.'' Each panel displays the activations of output units after successive words; outputs are summed across groups for purposes of displaying the data. ``V'' and ``N'' refer to verbs and nouns; ``sg'' and ``pl'' refer to singular and plural; ``prop'' refers to proper nouns; and ``t'', ``i'', and ``t/i'' refer to transitive verbs, intransitive verbs, and optionally transitive verbs.

How is this behavior achieved? What is the nature of the underlying knowledge possessed by the network which allows it to perform in a way which conforms with the grammar? It is not likely that the network simply memorized the training data, because the network was able to generalize its performance to novel sentences and structures it had never seen before. But just how general was the solution, and just how systematic?

In the previous simulation, hierarchical clustering was used to measure the similarity structure between internal representations of words. This gives us an indirect means of determining the spatial structure of the representation space. It does not let us actually determine what that structure is. So one would like to be able to visualize the internal state space more directly. This is also important because it would allow us to study the ways in which the network's internal state changes over time as it processes a sentence. These trajectories might tell us something about how the grammar is encoded.

One difficulty which arises in trying to visualize movement in the hidden unit activation space over time is that it is an extremely high-dimensional space (70 dimensions, in the current simulation). These representations are distributed, which typically has the consequence that interpretable information cannot be obtained by examining activity of single hidden units. Information is more often encoded along dimensions which are represented across multiple hidden units.

This is not to say, however, that the information is not there, of course, simply that one needs to discover the proper viewing perspective to get at it. One way of doing this is to carry out a principal components analysis (PCA) over the hidden unit activation vectors. PCA allows us to discover the dimensions along which there is variation in the vectors; it also makes it possible to visualize the vectors in a coordinate system which is aligned with this variation. This new coordinate system has the effect of giving a somewhat more localized description to the hidden unit activation patterns. Since the dimensions are ordered with respect to amount of variance accounted for, we can now look at the trajectories of the hidden unit patterns along selected dimensions of the state space.

Figure 5. Trajectories through hidden unit state space as the network processes the sentences ``boy hears boy'' and ``boys hear boy''. The number (singular vs. plural) of the subject is indicated by the position in state space along the second principal component.

Figure 6. Trajectories through hidden unit state space as the network processes the sentences ``boy chases boy'', ``boy sees boy'', and ``boy walks.'' Transitivity of the verb is encoded by is position along an axis which cuts across the first and third principal components.

Figure 7. Trajectories through hidden units state space (principal components 1 and 11) as the network processes the sentences ``boy chases boy'', ``boy chases boy who chases boy'', ``boy who chases boy chases boy'', and ``boy chases boy who chases boy who chases boy'' (to assist in reading the plots, the final word of each sentence is terminated with a ``]S'').

In Figures 5, 6, and 7 we see the movement over time through various plans in the hidden unit state space as the trained network processes various test sentences. Figure 5 compares the path through state space (along the second principal component) as the network processes the sentences boys hear boys and boy hears boy. PCA 2 encodes the number of the main clause subject noun, and the difference in the position along this dimension correlates with whether the subject is singular or plural. Figure 6 compares trajectories for sentences with verbs which have different argument expectations; chases requires a direct object, sees permits one, and walks precludes one. As can be seen, these differences in argument structure are reflected in a displacement in state space from upper left to lower right. Finally, Figure 7 illustrates the path through state space for various sentences which differ in degree of embedding. The actual degree of embedding is captured by the displacement in state space of the embedded clauses; sentences with multiple embeddings appear somewhat as spirals.

These trajectories illustrate the general principle at work in this network. The network has learned to represent differences in lexical items as different regions in the hidden unit state space. The sequential dependencies which exist among words in sentences are captured by the movement over time through this space as the network processes successive words in the sentence. These dependencies are actually encoded in the weights which map inputs (i.e., the current state plus new word) to the next state. The weights may be thought of as implementing the grammatical rules which allow well-formed sequences to be processed and to yield valid expectations about successive words. Furthermore, the rules are general. The network weights create attractors in the state space, so that the network is able to respond sensibly to novel inputs, as when unfamiliar words are encountered in familiar contexts.

Discussion

The image of language processing just outlined does not look very much like the traditional picture which we began with. Instead of a dictionary-like lexicon, we have a state space partitioned into various regions. Instead of symbolic rules and phrase structure trees, we have a dynamical system in which grammatical constructions are represented by trajectories through state space. Let me now consider what implications this approach might have for understanding several aspects of language processing.

Beyond sentences

Although I have focused here on processing of sentences, obviously language processing in real situations typically involves discourse which extends over many sentences. It is not clear, in the traditional scheme, how information which is represented in sentence structures might be kept available for discourse purposes. The problem is just that on the one hand, there are clearly limitations on how much information can be stored, so obviously not everything can be preserved; but on the other hand, there are many aspects of sentence-level processing which may be crucially affected by prior sentences. These include not only anaphora, but also such things as argument structure expectations (e.g., the verb to give normally requires a direct object and an indirect object, but in certain contexts these need not appear overtly if understood: Do you plan to give money to the United Way? No, I gave last week.).

The network's approach to language processing handles such requirements in a natural manner. The network is a system which might be characterized as highly opportunistic. It learns to perform a task, in this case prediction, doing just what it needs to do. Notice that in Figure 5, for example, the information about the number of the subject noun is maintained only until the verb which agrees with the subject has been processed. From that point on, the two sentences are identical. This happens because once the verb is encountered, subject number is no longer relevant to any aspect of the prediction task. (This emphasizes the importance of the task, because presumably tasks other than prediction could easily require that the subject number be maintained for longer.)

This approach to preserving information suggests that such networks would readily adapt to processing multiple sentences in discourse, since there is no particular reanalysis or re-representation of information which is required at sentence boundaries and no reason why some information cannot be preserved across sentences. Indeed, St. John (1992) and Harris & Elman (1990) have demonstrated that networks of this kind readily adapt to processing paragraphs and short stories. (The emphasis on functionality is reminiscent of suggestions made by Agre & Chapman (1987) and Brooks (1989). These authors argue that animals need not perfectly represent everything which is in their environment, nor store it indefinitely. Instead, they need merely be able to process that which is relevant to the task at hand.)

Types and tokens

Consider the first simulation, and the network's use of state space to represent words. This is directly relevant to the way in which the system addresses the types/token problem which arises in symbolic systems.

In symbolic systems, because representations are abstract and context-free, a binding mechanism is required to attach an instantiation of a type to a particular token. In the network, on the other hand, tokens are distinguished from one another by virtue of producing small but potentially discriminable differences in the state space. John23, John43, and John192 (using subscripts to indicate different occurrences of the same lexical item) will be physically different vectors. Their identity as tokens of the same type is captured by the fact that they are all located in a region which may be designated as the John space, and which contains no other vectors. Thus, one can speak of this bounded region as corresponding to the lexical type, John.

The differences in context, however, create differences in the state. Furthermore, these differences are systematic. The clustering tree in Figure 3 was carried out over the mean vector for each word, averaged across contexts. If the actual hidden unit activation patterns are used, the tree is of course quite large since there are hundreds of tokens of each word. Inspection of the tree reveals two important facts. First, all tokens of a type are more similar to one another than to any other type, so the arborization of tokens of boy and dog do not mix (although, as was pointed out, such overlap is not impossible and may in some circumstances be desirable). Second, there is a substructure to the spatial distribution of tokens which is true of multiple types. Tokens of boy used as subject occur more closely to one another than to the tokens of boy as object. This is also true of the tokens of girl. Moreover, the spatial dimension along which subject-tokens vary from object-tokens is the same for all nouns. Subject-tokens of all nouns are positioned in the same region of this dimension, and object-tokens are positioned in a different region. This means that rather than proliferating an undesirable number of representations, this tokenization of types actually encodes grammatically relevant information. Note that the tokenization process does not involve creation of new syntactic or semantic atoms. It is, instead, a systematic process. The state space dimensions along which token variation occurs may be interpreted meaningfully. The token's location in state space is thus at least functionally compositional (in the sense described by van Gelder, 1990).

Polysemy and accommodation

Polysemy refers to the case where a word has multiple senses. Accommodation is used to describe the phenomenon in which word meanings are contextually altered (Langacker, 1987). The network approach to language processing provides an account for both phenomena, and shows how they may be related.

Although there are clear instances where the same phonological form has entirely different meanings (bank, for instance), in many cases polysemy is a matter of degree. There may be senses which are different, although metaphorically related, as in (5):

5. (a) Arturo Barrios runs very fast! 
   (b) This clock runs slow. 
   (c) My dad runs the grocery store down the block.

In other cases, the differences are far more subtle, though just as real:

6. (a) Frank Shorter runs the marathon faster than I ever will. 
   (b) The rabbit runs across the road. 
   (c) The young toddler runs to her mother.

In (6), the construal of runs is slightly different, depending on who is doing the running. But just as in (5), the way in which the verb is interpreted depends on context. As Langacker (1987) has described the process:

It must be emphasized that syntagmatic combination involves more than the simple addition of components. A composite structure is an integrated system formed by coordinating its components in a specific, often elaborate manner. In fact, it often has properties that go beyond what one might expect from its components alone..... [O]ne component may need to be adjusted in certain details when integrated to form a composite structure; I refer to this as accommodation. For example, the meaning of run as applied to humans must be adjusted in certain respects when extended to four legged animals such as horses, dogs, and cats... in a technical sense, this extension creates a new semantic variant of the lexical item. (pp. 76-77).

In Figure 8 we see that the network's representations of words in context demonstrates just this sort of accommodation. Trajectories are shown for various sentences, all of which contain the main verb burn. The representation of the verb varies, depending on the subject noun. The simulations shown here do not exploit the variants of the verb, but it is clear that this is a basic property of such networks.

Figure 8. Trajectories through hidden units state space (principal components 1 and 2) as the network processes the sentences ``{john, mary, lion, tiger, boy, girl} burns house'', as well as ``{museum, house} burns'' (the final word of each sentences is terminated with ``]S''). The internal representations of the word ``burns'' varies slightly as a function of the verb's subject.

``Leaky recursion'' and processing complex sentences

The sensitivity to context which is illustrated in Figure 8 also occurs across levels of organization. The network is able to represent constituent structure (in the form of embedded sentences), but it is also true that the representation of embedded elements may be affected by words at other syntactic levels.

This means that the network does not implement a stack or pushdown machine of the classical sort, and would seem not to implement true recursion, in which information at each level of processing is encapsulated and unaffected by information at other levels. Is this good or bad?

If one is designed a programming language, this sort of ``leaky'' recursion is highly undesirable. It is important that the value of variables local to one call of a procedure not be affected by their value at other levels. True recursion provides this sort of encapsulation of information. I would suggest that the appearance of a similar sort of recursion in natural language is deceptive, however, and that while natural language may require one aspect of what recursion provides (constituent structure and self-embedding) it may not require the sort of informational firewalls between levels of organization.

Indeed, embedded material typically has an elaborative function. Relative clauses, for example, provide information about the head of a noun phrase (which is at a higher level of organization). Adverbial clauses perform a similar function for main clause verbs. In general, then, subordination involves a conceptual dependence between clauses. Thus, it may be important that a language processing mechanism facilitate rather than impede interactions across levels of information.

There are specific consequences for processing which may be observed in a system of this sort, which only loosely approximates recursion. First, the finite bound on precision means that right-branching sentences (such as 7a) will be processed better than center-embedded sentences (such as 7b):

7. (a) The woman saw the boy that heard the man that left. 
   (b) The man the boy the woman saw heard left.

It has been known for many years that sentences of the first sort are processed in humans more easily and accurately than sentences of the second kind, and a number of reasons have been suggested (e.g., Miller & Isard, 1964). In the case of the network, such an asymmetry arises because right-branching structures do not require that information be carried forward over embedded material, whereas in center-embedded sentences information from the matrix sentence must be saved over intervening embedded clauses.

But it is also true that not all center-embedded sentences are equally difficult to comprehend. Intelligibility may be improved in the presence of semantic constraints. Compare the following, in (8):

8. (a) The man the woman the boy saw heard left. 
   (b) The claim the horse he entered in the race at the last minute was a ringer was absolutely false.

In (8b) the three subject nouns create strong--and different--expectations about possible verbs and objects. This semantic information might be expected to help the hearer more quickly resolve the possible subject/verb/object associations and assist processing (Bever, 1970; King & Just, 1991). The verbs in (8a), on the other hand, provide no such help. All three nouns might plausible be the subject of all three verbs.

In a series of simulations, Weckerly & Elman (1992) demonstrated that a simple recurrent network exhibited similar performance characteristics. It was better able to process right-branching structures, compared to center-embedded sentences. And center-embedded sentences which contained strong semantic constraints were processed better compared to center-embedded sentences without such constraints. Essentially, the presence of constraints meant that the internal state vectors generated during processing were more distinct (further apart in state space) and therefore preserved information better than the vectors in sentences in which nouns were more similar.

The immediate availability of lexically-specific information

One question which has generated considerable controversy concerns the time-course of processing, and when certain information may be available and used in the process of sentence processing. One proposal is that there is a first-pass parse during which only category-general syntactic information is available (Frazier & Rayner, 1982). The other major position is that considerably more information, including lexically-specific constraints on argument structure, is available and used in processing (Taraban & McClelland, 1990). Trueswell, Tanenhaus, and Kello (in press) present empirical evidence from a variety of experimental paradigms which strongly suggests that listeners are able to use subcategorization information in resolving the syntactic structure of a noun phrase which would otherwise be ambiguous. For example, in (9), the verb forgotpermits both a noun phrase complement and also a sentential complement; at the point in time when the solution has been read, either (9a) or (9b) is possible.

9. (a) The student forgot the solution was in the back of the book.
   (b) The student forgot the solution.

In (10), on the other hand, hope is strongly biased toward taking a sentential complement.

10.(a) The student hoped the solution was in the back of the book.
   (b) *The student hoped the solution.

Trueswell and his colleagues found that subjects appeared not only to be sensitive to the preferred complement for these verbs, but that behavior was significantly correlated with the statistical patterns of usage (determined through corpus analysis). That is, insofar as the actual usage of a verb might be more or less biased in a particular direction, subjects' expectations were more or less consistent with that usage. This is exactly the pattern of behavior which would be expected given the model of processing which has been described here, and we are currently attempting to model these data.

Conclusions

Over recent years, there has been considerable work in attempting to understand various aspects of speech and language in terms of dynamical systems. Some of the most elegant and well-developed work has focused on motor control, particularly within the domain of speech (e.g., Fowler, 1980; Kelso, Saltzman, & Tuller, 1986). Some of this work makes explicit reference to consequences for theories of phonology (e.g., Browman & Goldstein, 1985; Pierrehumbert & Pierrehumbert, 1990).

More recently, attention has been turned to systems which might operate at so-called higher-levels of language processing. One of the principal challenges has been whether or not these dynamical systems can deal in a satisfactory way with the apparently recursive nature of grammatical structure.

I have attempted to show in this chapter that indeed, networks which possess dynamical characteristics have a number of properties which capture important aspects of language, including their embedded nature. The framework appears to differ from traditional view of language processors in the way in which it represents lexical and grammatical information. Nonetheless, these networks exhibit behaviors which are highly relevant for language. They are able to induce lexical category structure from statistical regularities in usage; and they are able to represent constituent structure to a certain degree. They are not perfect, but their imperfections strongly resemble those observed in human language users.

Let me close, however, with an obvious caveat. None of the work described here qualifies as a full model of language use. The range of phenomena illustrated is suggestive, but limited. As any linguist will note, there are many, many questions which remain unanswered. The models are also disembodied in a way which makes it difficult to capture any natural semantic relationship with the world. These networks are essentially exclusively language processors and their language use is unconnected with an ecologically plausible activity. Finally, and related to prior point, the view of language use in these networks is deficient in that it is solely reactive. These networks are input/output devices. Given an input, they produce the output which is appropriate for that training regime. The networks are thus tightly coupled with the world in a manner which leaves little room for endogenously generated activity. There is no possibility here for either spontaneous speech or for reflective internal language. Put most bluntly, these are networks that do not think!

These same criticisms may be levelled, of course, at many other current and more traditional models of language, so they should not be taken as inherent deficiencies of the approach. Indeed, I suspect that the view of linguistic behavior as deriving from a dynamical system probably allows for greater opportunities for remedying these shortcomings. One exciting approach involves embedding such networks in environments in which their activity is subject to evolutionary pressure, and viewing them as examples of artificial life (e.g., Nolfi, Elman, & Parisi, in press). But in any event, it is obvious that much remains to be done.

Guidelines for Further Reading

A good collection of recent work in connectionist models of language may be found in N. Sharkey (Ed.), Connectionist Natural Language Processing: Readings from Connection Science. Oxford: Intellect. The initial description of simple recurrent networks appears in Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14, 179-211. Additional studies with SRNs are reported in Cleeremans, A., Servan-Schreiber, D., & McClelland, J.L. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1, 372-381; and in Elman, J.L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195-225. A discussion of recurrent networks as dynamical systems is found in Pollack, J.B. (1990). The induction of dynamical recognizers. Machine Learning, 7, 227-252.

References

Agre, P.E., & Chapman, D. (1987). Pengi: An implementation of a theory of activity. In Proceedings of the AAAI-87. Los Altos, CA: Morgan Kaufmann.

Altmann, G.T.M. & Steedman, M.J. (1988). Interaction with context during human sentence processing. Cognition, 30, 191-238.

Bever, T. (1970a). The cognitive basis for linguistic structure. In J.R. Hayes (Ed.), Cognition and the development of language. New York: Wiley.

Blaubergs, M.S. & Braine, M.D.S. (1974). Short-term memory limitations on decoding self-embedded sentences. Journal of Experimental Psychology, 102, No.4, 745-748.

Brooks, R.A. (1989). A robot that walks: Emergent behaviors from a carefully evolved network. Neural Computation, 1, 253-262.

Browman, C.P., & Goldstein, L. (1985). Dynamic modeling of phonetic structure. In V. Fromken (Ed.), Phonetic linguistics. New York: Academic Press.

Chomsky, N. (1975). Reflections on Language. New York: Pantheon.

Churchland, P.S., & Sejnowski, T.J. (1992). The computational brain. Cambridge, MA: MIT Press.

Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14, 179-211.

Elman, J.L., (1991a). Representation and structure in connectionist models. In Gerald Altmann (Ed.), Computational and psycholinguistic approaches to speech processing. New York: Academic Press.

Elman, J.L. (1991b). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195-225.

Ferreira, F. & Henderson, J.M. (1990). The use of verb information in syntactic parsing: A comparison of evidence from eye movements and word-by-word self-paced reading. Journal of Experimental Psychology: Learning, Memory and Cognition, 16, 555-568.

Fodor, J. (1976). The language of thought. Sussex: Harvester Press.

Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis. In S. Pinker & J. Mueller (Eds.) Connections and symbols. Cambridge, MA: MIT Press.

Forster, K. (1976). Accessing the mental lexicon. In R.J. Wales & E. Walker (Eds.), New approaches to language mechanisms. Amsterdam: North-Holland.

Fowler, C. (1977). Timing and control in speech production. Bloomington, IN: Indiana University Linguistics Club.

Fowler, C. (1980). Coarticulation and theories of extrinsic timing control. Journal of Phonetics, 8, 113-133.

Frazier, L. (1987). Sentence processing: A tutorial review. In M. Coltheart (Ed.), Attention and Performance XII: The psychology of reading. Hillsdale, NJ: Erlbaum.

Frazier, L., & Rayner, K. (1982). Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology, 14, 178-210.

Harris, C. & Elman, J.L. (1989). Representing variable information with simple recurrent networks. In Proceedings of the Tenth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum.

Jordan, M. I. (1986). Serial order: A parallel distributed processing approach. Institute for Cognitive Science Report 8604. University of California, San Diego.

Jordan, M.I., & Rosenbaum, D.A. (1988). Action. Technical Report 88-26. Department of Computer Science, University of Massachusetts at Amherst.

Kelso, J.A.S., Saltzman, E., & Tuller, B. (1986). The dynamical theory of speech production: Data and theory. Journal of Phonetics, 14, 29-60.

King, J & Just, M.A. (1991). Individual differences in syntactic processing: the role of working memory. Journal of Memory and Language, 30, 580-602.

Langacker, R.W. (1987). Foundations of cognitive grammar: Theoretical perspectives. Volume 1. Stanford: Stanford University Press.

MacNeilage, P.F. (1970). Motor control of serial ordering of speech. Psychological Review, 77, 182-196.

Marslen-Wilson, W.D. (1980). Speech understanding as a psychological process. In J.C. Simon (Ed.), Spoken language understanding and generation. Dordrecht: Reidel.g

McClelland, J.L., & Elman, J.L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1-86.

McClelland, J.L., & Rumelhart, D.E. (1981). An interactive activation model of contexts effects in letter perception: Part 1. An account of basic findings. Psychological Review, 88, 365-407.

Miller, G.A., & Chomsky, N. (1963). Finitary models of language users. In R.D. Luce, R.R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (Vol. II). New York: Wiley.

Miller, G. & Isard, S. (1964). Free recall of self-embedded English sentences. Information and Control, 7, 292-303.

Morton, J. (1979). Word recognition. In J. Morton & J.C. Marshall (Eds.), Psycholinguistics 2: Structures and processes. Cambridge, MA: MIT Press.

Mozer, M.C. (1989). A focused back-propagation algorithm for temporal pattern recognition. Complex Systems, 3, 49-81.

Nolfi, S., Elman, J.L., & Parisi, D. (in press). Learning and evolution in neural networks. Adaptive Behavior.

Pearlmutter, B.A. (1989). Learning state space trajectories in recurrent neural networks. Proceedings of the International Joint Conference on Neural Networks, Washington, D.C., II-365.

Pierrehumbert, J.B., & Pierrehumbert, R.T. (1990). On attributing grammars to dynamical systems. Journal of Phonetics, 18, 465-477.

Pollack, J.B. (1990). The induction of dynamical recognizers. Machine Learning, 7, 227-252.

Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation. In D.E. Rumelhart & J.L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1). Cambridge, MA: MIT Press.

Selfridge, O.G. (1958). Pandemonium: A paradigm for learning. Mechanisation of thought processes: Proceedings of a symposium held at the National Physical Laboratory, November 1958. London: HMSO.

Siegelmann, H.T., & Sontag, E.D. (1992). Neural networks with real weights: Analog computational complexity. Report SYCON-92-05. Rutgers Center for Systems and Control, Rutgers University.

Simon, H. (1980). Physical symbol systems. Cognitive Science, 4, 135-183.

St. John, M. F. (1992). The story gestalt: A model of knowledge-intensive processes in text comprehension. Cognitive Science, 16, 271-306.

St. John, M., & McClelland, J.L. (1990). Learning and applying contextual constraints in sentence comprehension. Artificial Intelligence, 46, 217-457.

Taraban, R., & McClelland, J.L. (1988). Constituent attachment and thematic role expectations. Journal of Memory and Language, 27, 597-632.

Trueswell, J.C., Tanenhaus, M.K., & Kello, C. (in press). Verb-specific constraints in sentence processing: Separating effects of lexical preference from garden-paths. Journal of Experimental Psychology: Learning, Memory and Cognition.

Van Gelder, T. (1990). A connectionist variation on a classical theme. Cognitive Sciece, 14, 355-384.

Weckerly, J., & Elman, J.L. (1992). A PDP approach to processing center-embedded sentences. Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum.

Wiles, J., & Bloesch, A. (1992). Operators and curried functions: Training and analysis of simple recurrent networks. In J.E. Moody, S.J. Hanson, & R.P. Lippman (Eds.), Advances in Neural Information Processing Systems 4. San Mateo, CA: Morgan Kaufmann.