Research Projects

Detecting & Fixing Bugs Using Code Similarity

It has been shown that developers write not only similar code but also make similar mistakes and fix them similarly. Thus, bugs and the corresponding corrected code often show repetitive patterns. I design algorithms and build tools that automatically detect and repair code using such repetitive patterns. In particular, I studied the following bug categories:

Error Handling Bugs in C. Low-level languages like C often do not support any error handling primitives and leave it up to developers to create their own mechanisms for error propagation and handling. However, developers often make mistakes while writing the repetitive and tedious error handling code and inadvertently introduce bugs. For example, a large number of security flaws in SSL/TLS implementations result from incorrect error handling (e.g., CVE-2014-0092, CVE-2015-0208, CVE-2015-0288, CVE-2015-0285, CVE-2015-0292, etc.). These bugs are often hard to detect and localize using conventional automated bug-finding techniques because they do not display any obviously erroneous behaviors (e.g., crashes or assertion failures) but instead cause subtle inaccuracies that violate the intended security and privacy guarantees of SSL/TLS. Therefore, it is extremely important to detect and fix these bugs at an earlier stage of the development cycle, to minimize their disastrous side effects when unknowingly released to production. To understand the nature of error handling bugs that occur in widely used C programs, I conducted a comprehensive study of real-world post-production error handling bugs and their eventual fixes. Leveraging this knowledge, I then designed, implemented, and evaluated a toolchain [Usenix’16, FSE’17, ASE’16] based on under-constrained symbolic execution and AST analysis, that not only detects and characterizes different types of error handling bugs but also automatically fixes them. Several of these bug fixes were adopted by OpenSSL developers and resulted in one CVE. This work received the best paper award in FSE’17.       



Copy-paste Bugs. To maintain different variants of the same projects, developers often port similar features or bug-fixes from one system to another. This requires lot of repeated work, which is often tedious and error-prone. To automatically identify the repeated work, we designed Repertoire, an source code change analysis tool that compares the edit contents and the corresponding operations of program patches to identify similar changes. Repertoire identifies repetitive changes with 94% precision and 84% recall (see our FSE'12 tool-demo paper).
Using Repertoire, we showed that a significant amount duplicate work takes place across the BSD product family. In each BSD release, on average, more than twelve thousand lines are ported from peer projects, and more than 25% of active developers participate in cross-system porting in each release (see our FSE'12 paper).
              Patch1
**** Old ****
L1  for(i=0;i<MAX;i++){
L2 ! x = array[i]+x;
L3 ! y = foo(x);
L4 - x = x-y;
L5 }

**** New ****
L6  for(i=0;i<MAX;i++) {
L7 +  y = x+y;
L8 !  x = array[i]+x;
L9 !  y = foo(x,y);
L10 }
         Patch2
**** Old ****
L1 for(j=0;j<MAX;j++) {
L2   q = p + q;
L3 ! q = array[j]+p;
L4 ! p = foo1(q);
L5 }

**** New ****
L6 for(j=0;j<MAX;j++) {
L7   q = p + q;
L8 ! q = array[j] + q;
L9 ! p = foo1(p,q);
L10 }
These results confirmed that the overhead of repetitive changes is non-trivial and suggests that automated tools would be helpful to facilitate the process.
Check out the code @ https://github.com/baishakhir/RepertoireTool
Identifying Repetitive Changes

To implement repetitive changes, developers usually copy code from an existing implementation and then paste and adapt it to fit the new context. An incorrect adaptation often leads to a copy-paste error. Such errors are quite common in practice—in the last 3 years 182 copy-paste errors were reported in Linux.
In order to automatically detect copy-paste errors, We investigated: (1) What are the common types of copy-paste errors? (2) How can they be automatically detected? By analyzing the version histories of FreeBSD and Linux, We found five common types of copy-paste errors and then leveraging this categorization We designed a two-stage analysis technique to detect and characterize copy-paste errors. The first stage of the analysis, SPA, detects and categorizes inconsistencies in repetitive changes based on a static control and data dependence analysis. SPA successfully identifies copy-paste errors with 65% to 73% precision, an improvement by 14 to 17 percentage points with respect to previous tools (see our ASE'13 paper).
The second stage of the analysis, SPA++, uses the inconsistencies computed by SPA to direct symbolic execution in order to generate program behaviors that are impacted by the inconsistencies. SPA++ further compares these program behaviors leveraging logical equivalence checking (implemented with z3 theorem prover) and generates test inputs that exercise program paths containing the reported inconsistencies. A case study shows that SPA++ can refine the results reported by SPA and help developers analyze copy-paste inconsistencies. We collaborated with researchers from NASA for this work.

      



Generic Bugs (using NLP models). Researchers have demonstrated that the predictability of source code written by ordinary developers to solve real-world problems is similar to that of natural language. Such naturalness/predictability have been successfully captured by statistical language models and used to build suggestion engines, porting tools, coding standards checkers, and idiom miners. This suggests that code that appears improbable, or surprising, to a good statistical language model is “unnatural” in some sense, and thus possibly suspicious, i.e., buggy. I investigated this hypothesis in work published at ICSE'16: I considered a large corpus of bug fix commits, from 10 popular widely-used Java projects, and built a language model of the source code and evaluated the naturalness (i.e., entropy) of buggy code and the corresponding fixes as computed by the model. I found that code with bugs tends to be more entropic (i.e., unnatural), becoming less so as bugs are fixed. This approach can be used as an effective code inspection tool by focusing on highly entropic lines. I also showed that such entropy-based techniques could complement generic static bug finders (PMD, FindBugs).       


Differential Testing. Nowadays in open software market, multiple software are available to users that provide similar functionality. For example, there exists a pool of popular SSL/TLS libraries (e.g., OpenSSL, GnuTLS, NSS, CyaSSL, GnuTLS, PolarSSL, MatrixSSL, etc.) for securing network connections from man-in-the-middle attacks. Certificate validation is a crucial part of SSL/TLS connection setup. Though implemented differently, the certificate validation logic of these different libraries should serve the same purpose, following the SSL/TLS protocol, i.e. for a given certificate, all of the libraries should either accept or reject it.
In collaboration with security researchers at the University of Texas at Austin, we designed the first large-scale framework for testing certificate validation logic in SSL/TLS implementations. First, we generated millions of synthetic certificates by randomly mutating parts of real certificates and thus induced unusual combinations of extensions and constraints. A valid SSL implementation should be able to detect and reject the unusual mutants. Next, using a differential testing framework, we checked whether one SSL/TLS implementation accepts a certificate while another rejects the same certificate. We used such discrepancies as an oracle for finding flaws in individual implementations. We uncovered 208 discrepancies between popular SSL/TLS implementations, many of them are caused by serious security vulnerabilities (see our S&P'14 paper).
Check out the code @ https://github.com/sumanj/frankencert

      

Testing Machine Learning-Based Systems

DNN-based autonomous cars, using sensors like camera, LiDAR, etc., can drive without human intervention. Many major manufacturers including Tesla, GM, Ford, BMW, and Waymo/Google are working on building and testing different types of autonomous vehicles. However, despite their spectacular progress, DNNs, just like traditional software, often demonstrate incorrect or unexpected corner-case behaviors -- bugs -- that can lead to potentially fatal collisions. Several real-world accidents involving autonomous cars have already happened including one that resulted in a fatality. Most existing testing techniques for DNN-driven vehicles are heavily dependent on the manual collection of test data under different driving conditions, which becomes prohibitively expensive as the number of test conditions increases. In collaboration with Suman Jana and Junfeng Yang from Columbia University and Yinzhi Cao from Lehigh University, we are designing, implementing, and evaluating a systematic testing tool for automatically detecting erroneous behaviors of DNN-driven vehicles. Our tool is designed to automatically generate test cases leveraging real-world changes in driving conditions like rain, fog, lighting conditions, etc. We systematically explore different parts of the DNN logic by generating test inputs that maximize the numbers of activated neurons. So far we have found thousands of erroneous behaviors under different realistic driving conditions, many of which could lead to potentially fatal crashes, in the three top performing DNNs of the Udacity self-driving car challenge (check our ICSE’18 paper).

Natural Language Models for Source Code

Source code often follows code templates, i.e., frequently repeating code structures where the only differences are in parameters and variable names. This observation is, in particular, true for App development platforms like Android, where the apps are heavily dependent on the platform APIs. There are only a few valid ways of using these APIs, which result in repetitive code templates. I am investigating a novel language model for such API driven programming in collaboration with NLP expert Prof. Kai-Wei Chang of UCLA. We are in the process of designing a template-based neural language model that predicts the structural properties of the code by learning from the templates, while predicting local code properties by learning from sequential code structure. Android applications often undergo many code transformations, e.g., API update, obfuscation, etc. Such transformations are also repetitive and show similar patterns due to heavy API usage. For example, when an API is updated all the applications using the API eventually need to update their code accordingly. I am also leveraging machine translation models to learn from previous transformations. In this case, the models are trained with the parallel corpora of original and transformed code versions. I believe that such models have significant potential for a plethora of Software Engineering applications including automatic code generation, bug fix, feature update, deobfuscation, malware detection, etc. For instance, I am collaborating with Miltos Allamanis of Microsoft Research (an NLP expert) to develop NLP models that can automatically repair buggy Android applications. Further, I am working with Gail Kaiser and her students at Columbia University, and Jonathan Bell from George Mason to automatically deobfuscate obfuscated APIs.

 

Analyzing Configuration Parameters of Big-Data Frameworks

Big data frameworks, e.g., MapReduce, Spark, etc. are often so complex that users don't know how to configure them to achieve their desired properties (safety, security, runtime performance, etc.). We are in the process of developing a meta-heuristic search technique for searching configuration space for combinatorial optimization of such qualities based on three main hypotheses: (1) scale-up: small big-data jobs can serve as low-cost proxies for evaluating objective functions with significant performance benefits for much larger jobs; (2) scale-out: highperforming configurations found for one class of jobs provide significant benefits for related classes; and (3) centaur effectiveness: humans can profitably guide algorithmic searches for high-performing configurations. Initials result for the Hadoop platform shows up to 70% performance improvement. We further plan to develop a domain-specific language and static analysis framework to express the constraints of configuration spaces. We believe these additional techniques will help to find bugs in a configuration space, and will further reduce the search space of the meta-heuristic technique to a great extent. I am collaborating with Prof. Kevin Sullivan of UVa in this project.

 

Analytical Support for Improving Software Reliability

Thanks to the large number of diverse open source projects available in software forges such as GitHub, it becomes possible to evaluate some long-standing questions about software engineering practices. Each of these project repositories hosts source code along with entire evolution history, description, mailing lists, bug database, etc. I implemented a number of code analysis and text analysis tools to gather different metrics from GitHub project repositories. Then applying a series of advanced data analysis methods from machine learning, random networks, visualization, and regression analysis techniques, I shed some light on how to improve software quality and developers’ productivity.

Effect of programming languages on software quality. To investigate whether a programming language is the right tool for the job, we gathered a very large data set from GitHub (728 projects, 63M lines of code, 29K authors, 1.5M commits, in 17 languages). Using a mixed-methods approach, combining multiple regression modeling with visualization and text analytics, we studied the effect of language features such as static v.s. dynamic typing, strong v.s. weak typing on software quality. By triangulating findings from different methods, and controlling for confounding effects such as code size, project age, and contributors, we observed that a language design choice does have a significant, but modest effect on software quality (see our FSE'14 paper).

API stability and adoption in the Android Ecosystem. In today’s software ecosystem, which is primarily governed by web, cloud, and mobile technologies, APIs perform a key role to connect disparate software. Big players like Google, FaceBook, Microsoft aggressively publish new APIs to accommodate new feature requests, bugs fixes, and performance improvements. We investigated how such fast paced API evolution affects the overall software ecosystem? Our study on Android API evolution showed that the developers are hesitant to adopt fast evolving, unstable APIs. For instance, while Android updates 115 APIs per month on average, clients adopt the new APIs rather slowly, with a median lagging period of 16 months. Furthermore, client code with new APIs is typically more defect prone than the ones without API adaptation. To the best of my knowledge, this is the first work studying API adoption in a large software ecosystem, and the study suggests how to promote API adoption and how to facilitate growth of the overall ecosystems (see our ICSM'13 paper).

Assert Use in GitHub Projects. Assertions in a program are believed to improve software quality. We conducted a large scale study on how developers typically use assertions in C and C++ code and showed that they play positive role in improving code quality. We further characterized assertion usage along different process and product metrics. Such detailed characterization of assertions will help to predict relevant locations of useful assertions and eventually will improve code quality (see our ICSE'15 paper).