CS 11-737: Multilingual Natural Language Processing

Course Description

CS 11-737 is an advanced graduate-level course on natural language processing techniques applicable to many languages. Students who take this course should be able to develop linguistically motivated solutions to core and applied NLP tasks for any language. This includes understanding and mitigating the difficulties posed by lack of data in low-resourced languages or language varieties, and the necessity to model particular properties of the language of interest such as complex morphology or syntax. The course will introduce modeling solutions to these issues such as multilingual or cross-lingual methods, linguistically informed NLP models, and methods for effectively bootstrapping systems with limited data or human intervention. The project work will involve building an end-to-end NLP pipeline in a language you don’t know.

Instructor

Lei Li (Office Hour: GHC 6403, book a slot here)

Teaching Assistants

TA Mailing list: cs11-737-fa2023-tas@cs.cmu.edu

Time and Location

Tuesday and Thursday, 2-3:20pm, DH 1212

Prerequisites

You must have taken a NLP (11-411 or 11-611 or 11-711) and Deep Learning (11-685 or 11-785) course previously. The assignments for the class will be done by creating neural network models, and examples will be provided using PyTorch. If you are not familiar with PyTorch, we suggest you attempt to familiarize yourself using online tutorials (for example Deep Learning for NLP with PyTorch) before starting the class.

Class Format

For each class there will be:

Homework Submission & Grading

Please submit your homework on canvas. The assignments will be given a grade of A+ (100), A (96), A- (92), B+ (88), B (85), B- (82), or below. The final grades will be determined based on the weighted average of discussion participation, assignments, and project. Cutoffs for final grades will be approximately 97+ A+, 93+ A, 90+ A-, 87+ B+, 83+ B, 80+ B-, etc., although we reserve some flexibility to change these thresholds slightly. The details of the assignments are elaborated on the assignments page.

Discussion Forum

We will use the Ed platform for discussions (sign up here), but emailing the TA mailing list and coming to office hours are also encouraged.

Policy

Please read the following link carefully!

Syllabus

#
Date
Topic
Reading
Homework
1
8/29
Class Introduction


2
8/31
Sequence Labeling


3
9/5 Typology: The Space of Languages

4
9/7 Translation and Translation Data

5
9/12
Translation Models


6
9/14 Data-driven Strategies for NMT

7
9/19
Language Contact and Change

8
9/21
Multilingual Training and Transfer


9
9/26 Unsupervised Machine Translation

10
9/28
Code Switching, Pidgins, Creoles

11
10/3
Multilingual Question Answering

12
10/5 Speech

13
10/10
Automatic Speech Recognition
14
10/12
Sequence-to-sequence Speech Recognition



10/17 Fall Break


10/19 Fall Break

15
10/24
Text-to-speech


16
10/26 Multilingual ASR and TTS

17
10/31
Morphological Analysis and Inflection


18
11/2
Syntax and Parsing



11/7 Democracy Day Holiday

19
11/9
Data Annotation
20
11/14
Active Learning

21
11/16 The LORELEI Project

22
11/21
Guest Lecture - Alon Lavie and Craig Stewart


11/23
Thanksgiving, no classes
23
11/28 Guest Lecture - Heng Ji

24
11/30
Guest Lecture - Jonathn Amith


12/5
Poster Presentations


12/7
Poster Presentations