Linguistics 251: Probabilistic Methods in Linguistics (Fall 2012)
1 Instructor info
Instructor | Roger Levy (rlevy@ucsd.edu) |
Office | Applied Physics & Math (AP&M) 4220 |
Office hours | Wednesdays and Thursdays 10am-11am (subject to change) |
Class Time | TuTu 2:00-3:50pm (in general, Tuesdays 3-3:50pm will be practicum times for the first part of the quarter) |
Class Location | AP&M 4301 |
Class webpage | http://grammar.ucsd.edu/courses/lign251/ |
2 Course Description
This course is about probabilistic approaches to language knowledge, acquisition, and use. Today, studying language from a probabilistic perspective requires mastery of the fundamentals of probability and statistics, as well as familiarity with more recent developments in probabilistic modeling. In this course we'll move quickly through basic probability theory, then cover fundamental ideas in statistics–parameter estimation and hypothesis testing. We'll then cover a fundamental class of probabilistic models–the linear model–which as a side effect will familiarize you with the most widely used tools in statistics: linear regression, analysis of variance (ANOVA), and generalized linear models (including logistic regression). We'll cover these topics using both frequentist methods (what you need to use in order to write publishable data analyses) and Bayesian methods (which are becoming increasingly popular in all sorts of settings, especially in cognitive modeling of language). We'll then move on to the more advanced topic of hierarchical (a.k.a. multilevel or mixed-effects) modeling, and perhaps even a bit of probabilistic grammars if we have a chance.
The course will involve a hands-on approach to data and modeling, and we'll be using the open-source R programming language (and a bit of JAGS, which interfaces nicely with R, for Bayesian modeling). You'll learn the basics of data visualization and statistical analysis in R, and the class will involve periodic programming practica to ensure that your R programming questions are adequately addressed. Transcripts of programming practica will also be put up online. I encourage you to download R hereas soon as you can, get it running on your own computer, and go through the R tutorial found in Chapter 1 of Harald Baayen's book, or this hands-on introduction to R. You can also download JAGS here.
3 Target audience
The course assumes no expertise in linguistics, quantitative methods, or programming, but background in one or more of these areas will be useful. We'll start from elementary probability theory and build up briskly. We will make a fair amount of use of high school algebra, and also a bit of calculus and liner algebra; there's an appendix in the book that provides you with what you need for the latter two.
4 Reading material
The main reading material will be draft chapters of a textbook-in-progress, Probabilistic Models in the Study of Language, that I am writing. These draft chapters can be found here. There are also a number of other reference texts that may be of use in the course, including:
- Harald Baayen's book: Analyzing Linguistic Data. A Practical Introduction to Statistics. Cambridge University Press. Available online here
- Shravan Vasishth's book draft: The foundations of statistics: A simulation-based approach (free download)
- Keith Johnson's book on quantitative methods in linguistics ($40 on Amazon; no longer available as a free download)
- David MacKay's Information Theory, Inference, and Learning Algorithms – a great text available freely online
- Manning & Schuetze's Foundations of Statistical Natural Language Processing, available online through UC Libraries here.
- Jurafsky & Martin's Speech and Language Processing
- Christopher Bishop's Pattern Recognition and Machine Learning
- Gelman, Carlin, Stern, and Rubin's Bayesian Data Analysis
- John Rice's Mathematical Statistics and Data Analysis – a good general book for introductory statistics (mostly classical).
- Brian Roark and Richard Sproat's Computational Approaches to Morphology and Syntax for the Probabilistic Grammar Formalisms section
Finally, we may supplement these with additional readings, both from statistics/NLP texts and pertinent linguistics articles.
5 Syllabus
Week | Day | Topic | Reading | Materials | R practicum? | Homework Assignments |
---|---|---|---|---|---|---|
Week 0 | 27 Sep | Introduction and motivating material; Fundamentals, conditional probability, Bayes' rule, discrete random variables | Chapter 2.1-2.5 | Intro/Motivation Slides; Lect. 1 slides | Homework 1 | |
Week 1 | 2 Oct | Continuous random variables; the uniform distribution; expectation and variance; the normal distribution | Chapter 2.6-2.10 | Lecture 2 | Yes! Transcript | Homework 2 |
4 Oct | Estimating probability densities | Chapter 2.11 | Lecture 3 | Homework 3; Peterson & Barney dataset | ||
Week 2 | 9 Oct | Joint probability distributions; marginalization; introduction to graphical models; | Chapter 3.1-3.2, Appendix C.1-C.2, 3.3.1, 3.4 | Lecture 4 | Yes! Transcript | |
11 Oct | Covariance, correlation, linearity of expectation; the binomial distribution | Chapter 4.1-4.3 | Lecture 5 | Ad-hoc practicum transcript | ||
Week 3 | 16 Oct | Intro. parameter estimation; consistency, bias, variance; max. likelihood; Bayesian parameter & density estimation | Chapter 4.4-4.5 | Lecture 6 | Homework 4 | |
18 Oct | Bayesian confidence intervals and hypothesis testing | Chapter 5.1-5.2 | Lecture 7 | Yes! Transcript; Raw R code | ||
Week 4 | 23 Oct | Bayesian confidence intervals and hypothesis testing II | Chapter 5.2 | Homework 5; spillover word rts file | ||
25 Oct | Frequentist confidence intervals and hypothesis testing | Chapter 5.3-5.4 | Yes! Transcript; Raw R code | |||
Week 5 | 30 Oct | Intro to generalized linear models: linear models (incl. covariance, correlation, multivariate normal distribution) | Chapter 6.1-6.2 | Lecture 10 | ||
1 Nov | Linear models II | Chapter 6.3-6.5 | Lecture 11 | Yes! Transcript; Raw R code ; | ||
Week 6 | 6 Nov | Linear models III | Chapter 6.6 | Lecture 12 | Yes! Raw R code; Norms dataset | Homework 6; elp.txt; ELP readme |
8 Nov | Finish up linear models; logistic regression I | Chapter 6.7 | Lecture 13 | Yes! Raw R code | ||
Week 7 | 13 Nov | Roger out of town, no class | ||||
15 Nov | Logistic regression II | Chapter 6.8-6.9 | Lecture 14 | |||
Week 8 | 20 Nov | Hierarchical models I | Chapter 8.1-8.2 | Lecture 15 | Yes! Raw R code | Homework 7 |
22 Nov | Thanksgiving, no class | Yes! | ||||
Week 9 | 27 Nov | Hierarchical models II | Chapter 8.3 | Lecture 16 | Yes! Files for practicum | |
29 Nov | Hierarchical models III | Chapter 8.4 | Lecture 17 | Yes! Files for practicum | ||
Week 10 | 4 Dec | Estimating n-gram language models | Chapter 4.6 (to appear) | Lecture 18 | ||
6 Dec | Probabilistic grammars | Chapter 10 | Lecture 19 | |||
Finals | 11 Dec | Final projects due! |
6 Requirements
If you are taking the course for credit, there are four things expected of you:
- Regular attendance in class.
- Doing the assigned readings and coming ready to discuss them in class.
- Doing several homework assignments to be assigned throughout the quarter. Email submission of the homework assignments is encouraged, but please send it to lign251-homework@ling.ucsd.edu instead of to me directly. If you send it to me directly I may lose track of it.
You can find some guidelines on writing good homework assignments here. The source file to this PDF is here.
- A final project which will involve computational modeling and/or data analysis in some area relevant to the course. Final project guidelines are here.
7 Mailing List
There is a mailing list for this class, lign251-l@mailman.ucsd.edu. Please make sure you're subscribed to the mailing list by filling out the form at https://mailman.ucsd.edu/mailman/listinfo/lign251-l! We'll use it to communicate with each other.
8 Programming help
For this class I'll be maintaining an FAQ. Read the FAQ here.
I also run the R-lang mailing list. I suggest that you subscribe to it; it's a low-traffic list and is a good clearinghouse for technical and conceptual issues that arise in the statistical analysis of language data.
In addition, the searchable R mailing lists are likely to be useful.