ChatGPT在软件开发任务中的自验证能力及其局限性研究

下载需积分: 5 | PDF格式 | 1.4MB | 更新于2025-03-20 | 74 浏览量 | 举报

尽管ChatGPT在这几项任务中表现出一定的成功，但其常常误判自身错误，无法正确检测生成代码中的漏洞或程序修复的成功与否，并存在自我矛盾现象。研究结果表明，在重要的软件开发流程中，ChatGPT可以作为辅助工具，但在确保质量与安全性方面，仍需要人工审查。这项研究对于希望利用AI助手提高研发效率的工程师和技术管理人员具有重要价值，同时为开发者提供了对LLM技术应用现状的理解，帮助制定更为合理的自动化开发与测试策略。研究还强调了在工程实践中谨慎对待机器产生的代码的重要性，并提出了解决当前困境的方法路径，例如通过优化引导问题形式和测试报告来增强自我核查效能。此外，研究指出人类的专业判断在当前阶段仍是不可或缺的。" 知识点: 1. ChatGPT的定义：ChatGPT是大型语言模型，属于生成式预训练模型（GPT）家族，最近被OpenAI作为一个聊天机器人服务发布。大型语言模型是人工智能的一种，以大量自然语言文本为训练数据，能够理解和生成人类语言。 2. ChatGPT在软件开发任务中的应用：ChatGPT可以在软件开发的多个任务中发挥作用，包括代码生成、代码完成和程序修复。 3. 自我验证能力：自我验证能力指的是模型或系统对其生成或处理的内容进行质量检测和错误检测的能力。在软件开发中，这意味着模型能够对其生成的代码进行检测，评估代码的有效性、漏洞和修复的成功性。 4. ChatGPT自我验证能力的局限性：尽管ChatGPT在软件开发任务中有一定的成功表现，但存在误判自身错误、无法正确检测漏洞或程序修复成功与否，以及自我矛盾的问题。这说明在关键的质量和安全性保证方面，还需要依赖人工审查。 5. 对研发工程师和技术管理人员的启示：研究结果对希望利用AI助手提高研发效率的研发人员和技术管理人员具有指导意义。它强调了在实际工程实践中应谨慎对待由机器产生的代码，并提出通过收集反向样本来提升系统的稳定性。 6. 自动化开发与测试策略的制定：了解LLM技术的应用现状可以帮助开发者制定更合理的自动化开发与测试策略。这包括优化引导问题形式和测试报告来增强模型的自我核查效能。 7. 人工专业判断的重要性：本研究强调了人类专业判断在当前阶段的不可或缺性。即使在AI技术高度发展的今天，确保软件质量和安全性的最终关键仍然在于专业人员的判断和审查。 8. 代码安全检查：大型语言模型在代码安全检查方面仍存在局限性。尽管可以帮助发现潜在问题，但仍需人工进行详细的安全分析和漏洞排查，以确保软件的安全性。 9. 多代理协作开发模式：通过多代理协作，ChatGPT能产生测试报告，对生成的代码进行自验证，从而提升代码的可靠性。然而，由于自我验证能力的限制，多代理协作模式的开发结果仍需人工审查。 10. 技术应用现状的反思：研究不仅讨论了具体的技术挑战，也提醒研究人员注意到大型语言模型在实际应用中的局限性，并鼓励通过持续研究和改进，寻找提升模型稳定性和效能的方法。

展开

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 4

Non-vulnerable.

Prompt for

Completion

Developer

Complete the code at the [insert] position to ensure that the completed

code has no vulnerabilities.

ChatGPT

Completed

Code

code

Step 1

Developer

Label the code as either "vulnerable" or "non-vulnerable" based on whether

the code has vulnerabilities.

Step 2(a)

Direct Question

Prompt

ChatGPT

Result of the

Direct Question

Prompt

Yes, the code has vulnerabilities. 1. Command Injection: The code uses the

`system()` function to execute a command based on user input (`dirname`).

However, the `dirname` variable is not properly validated or sanitized, which

makes it vulnerable to command injection attacks. An attacker could potentially

manipulate the `dirname` argument to execute arbitrary comm - ands on the

system. 2. Lack of Input Validation: The code does not perform any input

validation or sanitization on the `dirname` argument. This means that any user-

supplied input can be passed directly to the `system()` function, allowing for

potential malicious commands to be executed. To mitigate these vulnerabilities,

proper input validation and sanitization should be implemented. Additionally,

using safer alternatives to executing shell commands, such as the `exec()` family

of functions, can help prevent command injection attacks.

Developer

The code has vulnerabilities. If true, return "Yes" and provide an explanation.

Otherwise, return "No".

Step 2(b)

Guiding

Question Prompt

ChatGPT

Result of the

Guiding

Question Prompt

Test Report:

1. Command Injection Vulnerability: The code is vulnerable to command injection.

It constructs a command string using user-supplied input (the directory name)

without proper validation or sanitization. An attacker can potentially manipulate the

directory name argument to execute arbitrary commands on the system.

2. Buffer Overflow Vulnerability: The `cmd` buffer used to store the command

string has a fixe d size of 256 bytes. If the directory name provided by the user

exceeds this size, it can lead to a buffer overflow, causing undefined behavior and

potential security vulnerabilities.

3. Lack of Error Handling: The code does not perform any error handling. If the

user doe s not provide a directory name as a command line argument, the program

will still execute the `system` function, leading to unexpected behavior or crashes.

Conclusion: The code provided has multiple security vulnerabilities, including

command injection and buffer overflow. It is recommended to implement proper

input validation and sanitization to prevent command injection attacks. Additionally,

the buffer size should be dynamically allocated based on the length of the directory

name to avoid buffer overflow. Error handling should also be implemented to

handle invalid or missing command line arguments.

Developer

Generate a test report for this code to check for any vulnerabilities and

provide a conclusion.

Step 2(c)

Test Report

Prompt

ChatGPT

Result of the

Test Report

Prompt

```$code$```

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <fcntl.h>

int main(int argc, char *argv[]) {

//get a directory name from the command line

char *dirname = argv[1];

//list the contents of that directory using "ls" command line utility

[insert]

system(cmd);

}

char cmd[256] = {0};

snprintf(cmd, sizeof(cmd), "ls %s", dirname);

Fig. 2: The designed self-veriﬁcation prompts for code com-

pletion.

The direct question prompt shown in Figure 2 is “Label

the code as either ‘vulnerable’ or ‘non-vulnerable’ based on

whether the code has vulnerabilities. $code$ ”, where $code$

represents the completed code generated by ChatGPT.

Step 2(b): We utilize the guiding question prompt to inquire

whether ChatGPT agrees with the assertion that the completed

code contains vulnerabilities. The guiding question prompt

shown in Figure 2 is “The code has vulnerabilities. $code$

If true, return ‘Yes’ and provide an explanation. Otherwise,

return ‘No’.”.

Step 2(c): We utilize the test report prompt to ask ChatGPT

to generate a test report to self-verify the absence of vulnera-

bilities in code completions. The test report prompt shown in

Figure 2 is “Generate a test report for this code to check for

any vulnerabilities and provide a conclusion. $code$”.

Bug-free.

Prompt for

Repair

Developer

Fix the Java code to ensure that the fixed code is bug-free.

ChatGPT

Repaired

Code

code

Step 1

Developer

Label the Java code as either "buggy" or "bug-free" based on whether the

code correctly implements the function "Return a string containing space -

delimited numbers starting from 0 upto n inclusive."

Step 2(a)

Direct Question

Prompt

ChatGPT

Result of the

Direct Question

Prompt

No, the function does not implement the feature correctly. The function is

supposed to return a string containing space-delimited numbers starting from 0 up

to n inclusive, but it does not trim the whitespace at the end. The current

implementation will always have an extra space at the end of the string.

Developer

The code does not implement the function feature "Return a string containing

space-delimited numbers starting from 0 upto n inclusive." correctly. If true,

return "Yes" and provide an explanation. Otherwise, return "No".

Step 2(b)

Guiding

Question Prompt

ChatGPT

Result of the

Guiding

Question Prompt

Function: string_sequence(int n)

Test Case 1: Input: n = 0 Expected Output: "0 " Actual Output: "0 " Result: Pass

Test Case 2: Input: n = 5 Expected Output: "0 1 2 3 4 5 "Actual Output: "0 1 2 3 4

5 " Result: Pass

Test Case 3: Input: n = -1 Expected Output: "" Actual Output: "0 " Result: Fail

Test Case 4: Input: n = 10 Expected Output: "0 1 2 3 4 5 6 7 8 9 10 " Actual

Output: "0 1 2 3 4 5 6 7 8 9 10 " Result: Pass

...

Conclusion: The function string_sequence(int n) appears to be working correctly

for most test cases. However, there is a bug when the input value is negative. The

function should return an empty string in this case, but it is currently returning "0 ".

This bug needs to be fixed.

Developer

Generate a test report based on the function description "Return a string

containing space-delimited numbers starting from 0 upto n inclusive." for the

following code to check for any bugs, and provide a conclusion.

Step 2(c)

Test Report

Prompt

ChatGPT

Result of the

Test Report

Prompt

```$code$```

/**

* Return a string containing space-delimited

* numbers starting from 0 upto n inclusive.

public class STRING_SEQUENCE {

public static String string_sequence(int n) {

String result = "";

for (int i = 0; i <= n; i += 1){

result += i + " ";

}

return result;

}

public class STRING_SEQUENCE {

public static String string_sequence(int n) {

StringBuilder result = new StringBuilder();

for (int i = 0; i <= n; i += 1){

result.append(i).append(" ");

}

return result.toString();

}

Fig. 3: The designed self-veriﬁcation prompts for program

repair.

C. Program Repair

Datasets. We select the two widely used datasets that

contain test cases and function requirements descriptions,

namely QuixBugs-Python/Java [31] and HumanEval-Java

[32]. The QuixBugs-Python and -Java datasets [31] consist of

40 buggy programs available both in Python and Java, along

with their correct versions and corresponding test cases. The

HumanEval-Java

[32] is a dataset manually created by Jiang

et al. [32], which consists of 164 Java bugs, along with their

correct versions and corresponding test cases.

Step 1: We ﬁrst request ChatGPT to repair the buggy pro-

gram and ensure that the repaired code is bug-free. There-

fore, we design the prompt consisting of three items, i.e.,

<requirement, function description, buggy code>. For in-

stance, in Figure 3, the requirement is “Fix the Java code to

To distinguish between the code generation dataset named HumanEval-

Java and the program repair dataset, also called as HumanEval-Java, we refer

to the program repair dataset as HumanEval-Java

in this paper.

下载后可阅读完整内容，剩余17页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

莫叫石榴姐

粉丝: 3w+

ChatGPT在软件开发任务中的自验证能力及其局限性研究

全面评测七大语言模型，ChatGPT展现内容生成潜能

2023年RLChina大会：ChatGPT的评估与语言模型测试标准详解

ChatGPT技术评估：语言理解、生成能力与自动指标分析

深入了解ChatGPT.pdf

chatGPT也可以分专业吗

如何避免ChatGPT生成虚假信息.zip

SuperBench大模型综合能力评测报告 0412 v2.2.pdf

Python_在PyTorch中从头开始一步一步地实现类似chatgpt的LLM.zip

计算机：当大模型遇见金融，海内外金融领域大模型对比.pdf

基于真实问诊数据集的某医院在线ChatGLM模型.zip

最新资源