17-649: Artificial Intelligence for Software Engineering

Dates: Spring 2019, Tuesday/Thursday 1:30-2:50 PM
Instructor(s): Travis Breaux and Jaspreet Bhatia


Advances in artificial intelligence (AI) and machine learning (ML) offer new opportunities in software engineering to explore the design space and improve software quality. This includes discovering interactions among natural language requirements, prioritizing feature requests, and finding and fixing bugs. Consequently, software engineers must take on the role of data scientist, which entails curating datasets, understanding the trade-offs in statistical models, and learning to evaluate their models. This course aims to introduce students to advances in natural language processing (NLP), including symbolic and statistical NLP techniques, and in deep learning to analyze software artifacts. The course will emphasize algorithm setup and configuration, data preparation, analytic workflow, and evaluation. Datasets will be drawn from industrial requirements, mobile app reviews, bug reports and source code with documented vulnerabilities. At course end, students will understand terminology and have hands on experience to help guide their decisions in applying AI to contemporary engineering problems.





The final course grade is comprised of the following components:


Lecture Description of Lectures Exercises
1 Course Overview
  • Course objects, organization, grading
  • Datasets - how to download, organization and schema
2 Text Normalization
  • Stop lists, stemming, lemmatization
  • Morphology
  • Gensim and NLTK for text normalization
Readings: Jurafsky 3.1, 3.8-3.9
TF-IDF and bug report similarity, assigned
3 Information Retrieval
  • Document similarity with TF-IDF and BM25F
  • Precision, recall, F1-score, ROC curves
Readings: Jurafsky 23.1
4 Clustering
  • Topic models with latent Dirichlet allocation (LDA), and hierarchical agglomerative clustering
  • Gensim for LDA
Optional Readings: Blei et al., 2003
LDA and bug report similarity, assigned
5 Topic models, revisited
  • In-class: Reading the Tea Leaves
  • Studies with negative results
Readings: Change et al. 2009
6 Part-of-Speech Tagging
  • Tagsets and rule-based tagging
  • Probabilistic tagging
  • Transformation-based tagging
Readings: Jurafsky 5
Bug reports, due
Parsing, assigned
7 Syntactic Parsing
  • Consituency parsers
  • Typed dependency parsers
  • Stanford CoreNLP and NLTK
Readings: Jurafsky 12 and 13
8 Corpora and Coding
  • Corpora and manually labeling datasets
  • Coding heuristics
  • In-class: Coding assignment
Coding, assigned
9 Coding Evaluation
  • Cohen's and Fleiss' Kappa, Van Belle's Statistic
  • Crowdsourcing labeling tasks
  • Review: Coding assignment
Coding, due
10 Foundations of Ontology
  • Categories and Resemblances
  • Meronymy
  • Ontology and Description Logic
  • Working with OWL and Protege
Ontology, assigned
11 Frame Semantics
  • Case Frames and Scripts
  • Semantic Roles
  • Semantic Similarity
  • Working with FrameNet
12 Sentiment Analysis
  • Affect, sentiment and ethics (Facebook)
  • Sentiment lexicons
  • N-grams, Naive Bayes
  • SKlearn for Python
Ontology, due
App reviews, assigned
13 Supervised Machine Learning, 1
  • Training and test sets, cross-validation
  • Dataset imbalance
  • Confusion matrices
  • Logistic regression
  • Support vector machines
14 Supervised Machine Learning, 2
  • Decision Trees
  • Random Forests
  • Bias, variance trade-off
  • Boosting and Bagging
  • Feature importance
15 NLP pipeline development
  • Defining a theory, e.g., feature migration and app value
  • Acquiring and preparing a dataset
  • Extracting and clustering data
App reviews, due
16 Reflections on AI evaluation and SE
  • Local versus global models
  • Metrics for model evaluation
  • Natural and programming languages
17 Introduction to Deep Learning
  • Bias, hyperparameters, input, output and hidden layers
  • Activation functions
  • Architectures (FNN, CNN, RNN, LSTM
Codefix part 1, assigned
18 Deep Learning Pipeline
  • Vanishing and exploding gradient
  • Evaluating deep learning models
  • Overfitting and regularization
19 Word embeddings
  • Working with text data
  • Word2Vec
  • Evaluating word embeddings
20 Recurrent Neural Networks (RNN)
  • Feed forward and recurrent networks
  • Language models
  • Unfolding computation graphs
21 RNN Encoder-Decoder
  • Sequence-to-sequence models
  • Encoder
  • Decorder
  • Attention
Codefix part 1, due
Codefix part 2, assigned
22 Long-Short Term Memory, 1 (LSTM)
  • Input, output, forget gates
  • Peephole connections
  • Intro to Anaconda, Jupyter, Tenserflow, Keras
23 Long-Short Term Memory, 2
  • Applications in software traceability, and story point estimation
24 Convolutional Neural Networks
  • Kernel, stride, feature maps
  • convolutions, filters and pooling layers
  • Examples from machine vision, NLP and SE
25 Generative Adversarial Networks Codefix part 2, due
26 Practical advice for tuning deep nets
  • Data augmentation
  • Training
  • Visualization in Tensorflow
  • Ensemble learning
  • Transfer learning
27 Course summary and final exam Final exam