Spring 2025 AI Alignment Fellowship
AISA is excited to announce its Spring 2025 AI Alignment Fellowship, an introductory program open to all members of the BU community, including students, alumni, and staff. This 8-week fellowship will explore crucial technical aspects of AI alignment through weekly 90-minute meetings. The fellowship will be starting on Saturday, February 8th, 2025 at 4:00 PM in CAS 310. Future meetings will be held on Saturdays at 4:00 PM in CAS 310.
Focus: The fellowship will cover key topics such as Reinforcement Learning from Human Feedback (RLHF), mechanistic interpretability, scalable oversight, and robustness.
Outcomes: Participants will gain a fundamental, understanding and application of AI alignment, its purpose, and promising approaches in the field
Flexibility: All meetings are self-contained, with no mandatory preparation between sessions.
Accessibility: No prior experience is required, though a technical or STEM background may be beneficial.
This fellowship is adapted from BlueDot's well-regarded AI Alignment Course [1]. It offers an excellent opportunity for the BU community to engage with this critical area of research. The curriculum provided below is subject to change and updates to the content will be made as soon as possible.
Curriculum Resources
Week 0: Introduction to Alignment
Miles, "Intro to AI Safety, Remastered" [1]
BlueDot, "What is AI Alignment?" [2]
Christiano, "What Failure Looks Like" [3]
Optional:
Week 1: Reinforcement Learning from Human Feedback (RLHF) I
Rational Animations, "The True Story of How GPT-2 Became Maximally Lewd" [7]
Lambert, Castricato, von Werra, et al., "Illustrating Reinforcement Learning from Human Feedback (RLHF)" [8]
Optional:
Christiano, "Thoughts on the impact of RLHF research" [9]
Zou, Wang, Carlini, et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" [10]
Rando, Tramèr, "Universal Jailbreak Backdoors from Poisoned Human Feedback" [11]
Volkov, "Badllama 3: removing safety finetuning from Llama 3 in minutes " [12]
Week 2: Reinforcement Learning from Human Feedback (RLHF) II
Bai and Kaplan, "Constitutional AI: Harmlessness from AI Feedback" [13]
Casper, Davies, et al., "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback" [14]
Optional:
Steenbrugge, "An introduction to Policy Gradient methods" [15]
Jaques, "RLHF: How to Learn from Human Feedback with Reinforcement Learning" [16]
Wu, Stiennon, and Ziegler, et al.,"Learning to summarize with human feedback" [17]
Christiano, Leike, and Brown, et al., "Deep Reinforcement Learning from Human Preferences" [18]
Ziegler, Stennon, Wu, et al., "Fine-Tuning Language Models from Human Preferences" [19]
Kantrowitz, "The Horrific Content a Kenyan Worker Had to See While Training ChatGPT" [20]
Rafailov, Sharma, and Mitchell, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" [21]
Huyen, "Chip Huyen's RLHF guide" [22]
Wolfe, "RLAIF: Reinforcement Learning from AI Feedback" [23]
Week 3: Scalable Oversight I
Jones, "Can we scale human feedback for complex AI tasks?" [24]
Christiano, Amodei, and Shlegeris, "Supervising strong learners by amplifying weak experts" [25]
Optional:
Bowman, "Measuring Progress on Scalable Oversight for Large Language Models" [26]
Hubinger, "An overview of 11 proposals for building safe advanced AI" [27]
Ought, "Factored cognition" [28]
Wu, Lowe, and Leike, "Summarizing Books with Human Feedback" [29]
Miles, "How to Keep Improving When You're Better Than Any Teacher – Iterated Distillation and Amplification" [30]
Leike, "Scalable agent alignment via reward modeling" [31]
Bai and Kaplan, "Constitutional AI: Harmlessness from AI Feedback" [32]
Armstrong, "Humans can be assigned any values whatsoever" [33]
Week 4: Scalable Oversight II
Irving, Christiano, and Amodei, "AI safety via debate" [34]
Burns, Izmailov, Kirchner, et al., "Weak-to-strong generalization: Eliciting Strong Capabilities With Weak Supervision" [35]
Optional:
Hadfield-Menell, Dragan, Abbeel, et al., "Cooperative inverse reinforcement learning" [36]
Wei, Zhou, and Google, "Language Models Perform Reasoning via Chain of Thought" [37]
Zhou, Scharli, Hou, et al., "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models" [38]
Hubinger, "AI safety via market making" [39]
Ramé, Vieillard, Hussenot, et al., "WARM: On the Benefits of Weight Averaged Reward Models" [40]
Wei, Huang, Lu, et al., "Simple synthetic data reduces sycophancy in large language models" [41]
Go, Korbak, Kruszewski, et al., "Compositional preference models for aligning LMs" [42]
Khan, Hughes, Valentine, et al., "Debating with More Persuasive LLMs Leads to More Truthful Answers" [43]
Week 5: Robustness, Unlearning, and Control I
Pârcălăbescu, "Adversarial Machine Learning explained with examples" [44]
Zou, Wang, Carlini, et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" [45]
Casper, "Deep Forgetting & Unlearning for Safely-Scoped LLMs" [46]
Optional:
Carlini, "Adversarial Machine Learning Reading List" [47]
Week 6: Robustness, Unlearning, and Control II
Li, Pan, et al., "Measuring and Reducing Malicious Use With Unlearning" [48]
Shlegeris, Roger, Greenblatt, et al., "AI Control: Improving Safety Despite Intentional Subversion" [49]
Optional:
Hendrycks, "CAIS: Adversarial Robustness Introduction" [50]
Week 7: Mechanistic Interpretability I
Hastings-Woodhouse, "Introduction to Mechanistic Interpretability" [51]
Olah, Cammarata, Schubert, et al., "Zoom In: An Introduction to Circuits" [52]
Alexander, "Let's Try To Understand AI Monosemanticity" [53]
Optional:
Nanda, "Concrete Steps to Get Started in Transformer Mechanistic Interpretability" [56]
McDougall, "ARENA Curriculum: Transformer Interpretability" [57]
Elhage, Hume, Olsson, et al., "Toy models of superposition" [58]
Greenblatt, Nanda, Shlegeris, et al., "How useful is mechanistic interpretability?" [59]
Bricken, Templeton, Batson, et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" [60]
Wang, Variengien, Conmy, et al., "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small" [61]
Goh, Cammarata, Voss, et al., "Multimodal Neurons in Artificial Neural Networks" [62]
Olsson, Elhage, Nanda, et al., "In-context learning and induction heads" [63]
Templeton, Conerly, et al., "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" [64]
Nanda, Doodles, Filan, "What is mechanistic interpretability?" [65]
Week 8: Mechanistic Interpretability II
Nanda, "A Longlist of Theories of Impact for Interpretability" [66]
Segerie, "Against Almost Every Theory of Impact of Interpretability" [67]
Optional:
Meng, Bau, Andonian, et al., "ROME: Locating and Editing Factual Associations in GPT" [68]
Hoogland, Gietelink, Murfet, et al., "Towards Developmental Interpretability" [69]
Bolukbasi, Pearce, Yuan, et al., "An Interpretability Illusion for BERT" [70]
Garriga-Alonso, Goldowsky-Dill, Greenblatt, et al., "Causal Scrubbing: a method for rigorously testing interpretability hypotheses" [71]
Christiano, Cotra, Xu, "Eliciting latent knowledge" [72]
Christiano, "Mechanistic anomaly detection and ELK" [73]
Burns, Ye, Klein, et al., "Discovering Latent Knowledge in Language Models Without Supervision" [74]
Li, Patel, "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model" [75]
Li, "Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task" [76]
Casper, "Robust Feature-Level Adversaries are Interpretability Tools" [77]
Olah, Wiblin, Harris, "Chris Olah on what the hell is going on inside neural networks" [78]