
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 4
Non-vulnerable.
Prompt for
Completion
Developer
Complete the code at the [insert] position to ensure that the completed
code has no vulnerabilities.
ChatGPT
Completed
Code
code
Step 1
Developer
Label the code as either "vulnerable" or "non-vulnerable" based on whether
the code has vulnerabilities.
Step 2(a)
Direct Question
Prompt
ChatGPT
Result of the
Direct Question
Prompt
Yes, the code has vulnerabilities. 1. Command Injection: The code uses the
`system()` function to execute a command based on user input (`dirname`).
However, the `dirname` variable is not properly validated or sanitized, which
makes it vulnerable to command injection attacks. An attacker could potentially
manipulate the `dirname` argument to execute arbitrary comm - ands on the
system. 2. Lack of Input Validation: The code does not perform any input
validation or sanitization on the `dirname` argument. This means that any user-
supplied input can be passed directly to the `system()` function, allowing for
potential malicious commands to be executed. To mitigate these vulnerabilities,
proper input validation and sanitization should be implemented. Additionally,
using safer alternatives to executing shell commands, such as the `exec()` family
of functions, can help prevent command injection attacks.
Developer
The code has vulnerabilities. If true, return "Yes" and provide an explanation.
Otherwise, return "No".
Step 2(b)
Guiding
Question Prompt
ChatGPT
Result of the
Guiding
Question Prompt
Test Report:
1. Command Injection Vulnerability: The code is vulnerable to command injection.
It constructs a command string using user-supplied input (the directory name)
without proper validation or sanitization. An attacker can potentially manipulate the
directory name argument to execute arbitrary commands on the system.
2. Buffer Overflow Vulnerability: The `cmd` buffer used to store the command
string has a fixe d size of 256 bytes. If the directory name provided by the user
exceeds this size, it can lead to a buffer overflow, causing undefined behavior and
potential security vulnerabilities.
3. Lack of Error Handling: The code does not perform any error handling. If the
user doe s not provide a directory name as a command line argument, the program
will still execute the `system` function, leading to unexpected behavior or crashes.
Conclusion: The code provided has multiple security vulnerabilities, including
command injection and buffer overflow. It is recommended to implement proper
input validation and sanitization to prevent command injection attacks. Additionally,
the buffer size should be dynamically allocated based on the length of the directory
name to avoid buffer overflow. Error handling should also be implemented to
handle invalid or missing command line arguments.
Developer
Generate a test report for this code to check for any vulnerabilities and
provide a conclusion.
Step 2(c)
Test Report
Prompt
ChatGPT
Result of the
Test Report
Prompt
```$code$```
```$code$```
```$code$```
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char *argv[]) {
//get a directory name from the command line
char *dirname = argv[1];
//list the contents of that directory using "ls" command line utility
[insert]
system(cmd);
}
char cmd[256] = {0};
snprintf(cmd, sizeof(cmd), "ls %s", dirname);
Fig. 2: The designed self-verification prompts for code com-
pletion.
The direct question prompt shown in Figure 2 is “Label
the code as either ‘vulnerable’ or ‘non-vulnerable’ based on
whether the code has vulnerabilities. $code$ ”, where $code$
represents the completed code generated by ChatGPT.
Step 2(b): We utilize the guiding question prompt to inquire
whether ChatGPT agrees with the assertion that the completed
code contains vulnerabilities. The guiding question prompt
shown in Figure 2 is “The code has vulnerabilities. $code$
If true, return ‘Yes’ and provide an explanation. Otherwise,
return ‘No’.”.
Step 2(c): We utilize the test report prompt to ask ChatGPT
to generate a test report to self-verify the absence of vulnera-
bilities in code completions. The test report prompt shown in
Figure 2 is “Generate a test report for this code to check for
any vulnerabilities and provide a conclusion. $code$”.
Bug-free.
Prompt for
Repair
Developer
Fix the Java code to ensure that the fixed code is bug-free.
ChatGPT
Repaired
Code
code
Step 1
Developer
Label the Java code as either "buggy" or "bug-free" based on whether the
code correctly implements the function "Return a string containing space -
delimited numbers starting from 0 upto n inclusive."
Step 2(a)
Direct Question
Prompt
ChatGPT
Result of the
Direct Question
Prompt
No, the function does not implement the feature correctly. The function is
supposed to return a string containing space-delimited numbers starting from 0 up
to n inclusive, but it does not trim the whitespace at the end. The current
implementation will always have an extra space at the end of the string.
Developer
The code does not implement the function feature "Return a string containing
space-delimited numbers starting from 0 upto n inclusive." correctly. If true,
return "Yes" and provide an explanation. Otherwise, return "No".
Step 2(b)
Guiding
Question Prompt
ChatGPT
Result of the
Guiding
Question Prompt
Function: string_sequence(int n)
Test Case 1: Input: n = 0 Expected Output: "0 " Actual Output: "0 " Result: Pass
Test Case 2: Input: n = 5 Expected Output: "0 1 2 3 4 5 "Actual Output: "0 1 2 3 4
5 " Result: Pass
Test Case 3: Input: n = -1 Expected Output: "" Actual Output: "0 " Result: Fail
Test Case 4: Input: n = 10 Expected Output: "0 1 2 3 4 5 6 7 8 9 10 " Actual
Output: "0 1 2 3 4 5 6 7 8 9 10 " Result: Pass
...
Conclusion: The function string_sequence(int n) appears to be working correctly
for most test cases. However, there is a bug when the input value is negative. The
function should return an empty string in this case, but it is currently returning "0 ".
This bug needs to be fixed.
Developer
Generate a test report based on the function description "Return a string
containing space-delimited numbers starting from 0 upto n inclusive." for the
following code to check for any bugs, and provide a conclusion.
Step 2(c)
Test Report
Prompt
ChatGPT
Result of the
Test Report
Prompt
```$code$```
```$code$```
```$code$```
/**
* Return a string containing space-delimited
* numbers starting from 0 upto n inclusive.
*/
public class STRING_SEQUENCE {
public static String string_sequence(int n) {
String result = "";
for (int i = 0; i <= n; i += 1){
result += i + " ";
}
return result;
}
}
public class STRING_SEQUENCE {
public static String string_sequence(int n) {
StringBuilder result = new StringBuilder();
for (int i = 0; i <= n; i += 1){
result.append(i).append(" ");
}
return result.toString();
}
}
Fig. 3: The designed self-verification prompts for program
repair.
C. Program Repair
Datasets. We select the two widely used datasets that
contain test cases and function requirements descriptions,
namely QuixBugs-Python/Java [31] and HumanEval-Java
R
2
[32]. The QuixBugs-Python and -Java datasets [31] consist of
40 buggy programs available both in Python and Java, along
with their correct versions and corresponding test cases. The
HumanEval-Java
R
[32] is a dataset manually created by Jiang
et al. [32], which consists of 164 Java bugs, along with their
correct versions and corresponding test cases.
Step 1: We first request ChatGPT to repair the buggy pro-
gram and ensure that the repaired code is bug-free. There-
fore, we design the prompt consisting of three items, i.e.,
<requirement, function description, buggy code>. For in-
stance, in Figure 3, the requirement is “Fix the Java code to
2
To distinguish between the code generation dataset named HumanEval-
Java and the program repair dataset, also called as HumanEval-Java, we refer
to the program repair dataset as HumanEval-Java
R
in this paper.