The overfitting problem in automatic program repair

Those who are familiar with machine learning may have heard about overfitting, it is when the the machine learning model overfits the training data and does not generalize. But did you know that in automatic program repair, we have a different kind of overfitting problem. It is when the patch that the program repair tool generates does not generalize. In this blog post I will explain what it is and our current progress on solving this problem. In the text below, the word “overfitting” is only referring to the overfitting problem in automatic program repair.

Presidential election time series models from ckcd, https://xkcd.com/1122/
Presidential election time series models from ckcd, https://xkcd.com/1122/

What is overfitting in automatic program repair

I will use the following example to explain the overfitting problem.

A function that is supposed to return the sum of two integers:

def sum(a, b):
  if(a >= 10):
    return a
  else:
    return a+b

Two test cases, “test_1” respective “test_2”, for the “sum” function:

assert(sum(1, 2) == 3)
assert(sum(10, 20) == 30)

We can see that that the first test case, “assert(sum(1, 2) == 3)”, will pass. But not the second one, “assert(sum(10, 20) == 30)”. This bug can easily see fixed by applying the following patch:

def sum(a, b):
-  if(a >= 10):
-    return a
-  else:
    return a+b

However, let’s say our new automatic program repair tool modified the “sum” function to the following source code:

def sum(a, b):
  if(a == 1 and b ==2):
    return 3
  elif(a == 10 and b == 20):
    return 30

The two test cases will pass and the tool will say that we have fixed the bug. But we as human (developers), we can clearly see that it does generalize. This is the overfitting problem in automatic program repair. It is when the patch generated by automatic program tool is only correct according to the program specification, but does not generalize. This happens when the program specification is too weak to speicify the program. In this example, the program specification is the two test cases, and by ‘does not generalize’, we mean that for new input like “assert(sum(100, 200) == 300)”, the assertion will fail.

Impact of the overfitting problem

This paper from 2015, analyzed patches generated by GenProg, RSRepair and AE, three earlier automatic problem repair tools. They found that:

Our analysis indicated that only 5 of the 414 GenProg patches are correct… Only 4 of the 120 RSRepair patches are correct… Only 3 of the 54 AE patches are correct.

By correct they mean that the authors have manually looked at the patch and the human ground truth fix, and determined if they are correct.

Previously, we have assumed that all patches that are correct according to the program specification are correct. And we only report number of patches that are correct according to the program specification. But this is no longer true after we have discovered the overfitting problem. Nowadays, papers usually reports number of bug fixes, and number of bug fixes that are semantically equivalent to the human fix (For example in our paper, we have reported 61 bug fixes where 18 of them are semantically equivalent to the human fix). Claire Le Goues, one of GenProg’s authors, has said that it (patch quality) is the next big challenge for the program repair community.

Current progress in mitigating the overfitting problem

Currently, I can see that there are two research directions: 1) Program repair tools that can prioritize the correct patch before the overfitting one, or 2) Validate patches generated by any program repair tool. In the first research direction, the program repair tool is build with the overfitting problem in mind. While in the second research direction, the overfitting detection tool is separated from the program repair tool. In the both research direction we are trying to distinguish the overfitting patches and the correct patch, we refer the reader to the papers on how it is done in practice.

Pointer to papers that belong to the first research direction:

Pointer to papers that belong to the second research direction:

This is of course not an exhaustive list, but we can already see that the idea of ‘similarity’ (code similarity, structure similarity, behavioral similarity etc.) are used a lot in both research directions.

Last words

The overfitting problem is a hard problem, it requires that the tool (program repair or overfitting detection) to work with incomplete specification and ‘guess’ what the correct patch should look like. We have just started to work towards a solution for this problem. But it is also part of a bigger problem, the patch quality, where we have more problems such as explainable patch, minimal patch diff, code formatting, program performance and etc.

Written on October 1, 2019