Queryloop

Research

Introduction to SWE Bench & Patch Centric Approach

Zain ul Abideen
January 17, 2025
12 min read

A comprehensive explanation of SWE-bench for evaluating AI coding agents and a patch-centric approach to solving SWE-bench issues.

The Software Engineering (SWE) Bench was created to evaluate AI coding agents like Devin, which automate tasks such as bug fixes and code improvements. It provides a dataset of repositories with known issues to test how effectively these tools identify and fix bugs. Agentic workflows are submitted to the SWE, tested on these repositories, and evaluated based on the success of their fixes.

The SWE-bench was created to evaluate AI coding agents like Devin, which automate tasks such as bug fixes and code improvements.

Blog image

The Princeton Dataset

A big part of the SWE Project's success lies in how we evaluate the tools and models designed to fix bugs automatically. That's where the Princeton SWE-bench Dataset comes in.

There are many metadata fields in the repository dataset, but the important ones are discussed below:
  • Repository Name: This tells us which codebase the issue belongs to, there are mostly famous repositories like Astropy.
  • Instance ID: Every bug or issue gets a unique ID so we can keep track of what's being tested and fixed.
  • Problem Statement: Think of this as the bug report — it explains what's wrong mentioning a file sometimes or general text.
  • Hint: Some extra help, but as of now SWE do not count those tools if they are solving with the help of hint.
  • Fail-to-Pass: This is the ultimate goal — turning a failing test into a passing one. This mentions a test file along a function in that were failed.
The Princeton Dataset

SWE-bench: Scoreboards

The SWE-bench features a suite of benchmarks for different levels of evaluation and testing. Here's an overview:

SWE-bench Lite

  • A lightweight version with 300 curated instances for resource-efficient testing.
  • Focuses on functional bug fixes across 11 repositories.
  • Ideal for systems with limited computational capacity.

SWE-bench Verified

  • Developed in collaboration with OpenAI, featuring 500 human-validated samples.
  • Ensures clarity and solvability of test cases.
  • Offers a reliable benchmark for real-world software challenges.

SWE-bench Full

  • The most comprehensive version with 2,294 issue-commit pairs from 12 Python repositories.
  • Designed for in-depth evaluation of AI capabilities in handling diverse software tasks.
SWE-bench Full

How a General Tool Works in SWE-bench

A typical tool submitted to SWE-bench follows these steps to evaluate and address software issues:

Clone the Repository

  • The tool begins by cloning the repository specified in the dataset. This provides the environment and context needed to understand and resolve the issues.

Fix the Issues

  • It analyzes the problem statement associated with the instance and applies its logic or algorithms to fix the identified issue in the codebase. You must also keep track of logs or steps your model did to solve the issue.

Generate the Patch

  • Once the modifications are complete, the tool creates a patch file that contains the changes made to resolve the issue.

Store the Patch with Metadata

  • The patch, often referred to as the "prediction," is saved along with key metadata such as the instance ID and repository name. This allows for systematic evaluation and traceability of the fixes applied.

Submitting Your Model to SWE-bench

To have your model featured on the SWE-bench leaderboard, follow these steps:

1: Fork and Clone the Repository:

Fork the SWE-bench/experiments repository.

2: Create a Submission Directory:

  • According to your preference navigate to the appropriate evaluation split (evaluation/lite/ or evaluation/test).
  • Create a new folder named with the submission date and your model's name (e.g., 20240415_sweagent_gpt4).

3: Prepare Submission Files:

  • Within your submission folder, include:
  • all_preds.jsonl: Your model's predictions.
  • logs/: Directory containing evaluation artifacts for each task instance. These files are generated automatically by SWE-bench during evaluation.
metadata.yaml: Metadata for your submission, including:
  • name: Your leaderboard entry name.
  • oss: true if your system is open source.
  • site: URL for more information about your system.
  • verified: false initially (see verification process below).
  • trajs/: Directory containing reasoning traces for each task instance, detailing the steps your system took to solve the problem. Ensure each file is named to reflect the corresponding task instance ID.
  • README.md: Additional information about your model.

4: Run Evaluation Script:

  • Execute the following command to generate results:
1python -m analysis.get_results evaluation/<split>/<date_model>

5: Submit via Pull Request:

Push your changes to your forked repository. Create a pull request to the original SWE-bench/experiments repository with your submission folder.

6: Verification Process:

To receive a "verified" checkmark (✓) on the leaderboard:

Create an issue in the SWE-bench/experiments repository. Provide instructions for running your model on SWE-bench. The SWE-bench team will run your model on a random subset to verify results.

Our First Approach to Solving Repository Issues

We've built an automated workflow designed to resolve repository issues using SWE Dataset. Using tools like LangChain and LangGraph.

The process starts by cloning the repository, where agents start analyzing and resolving the issues. With a agents specified for each role a chain of steps and loops fixes the issue.

The Patch-Centric Approach: How It Worked

The Patch-Centric Approach: How It Worked

1. Entry Point

The workflow starts where dataset is being initialized, and the workflow is cloned to local system where some of its data like 1- instance_id 2- problem_statement 3- example_patch

from the Princeton SWE dataset are given to the workflow.

2. Search Issue

The main crux of our workflow is this agent that, with the help of LangChain, integrates basic tools for repository browsing with the LLM. The tools are as shown below.
After identifying files and line numbers, agents retrieve the required content and pass it to the next phase.
Blog image 1
Blog image 2

3. Extracting Content and Text from Files

This Tool keeps a record of the Data within the files among the state of the workflow that may be useful to solve the error. After the data is retrieved, it is provided to the patch generation agent to generate the patch file.

4. Generating a Patch

This Tool uses an LLM to generate the patch based on all the files provided by the above agent. Then, using the format as in the example_patch, a new patch is generated.

5. Applying the Patch

In this step, after generating the patch, we encountered several challenges that necessitated testing before applying it. The primary issue stemmed from the fact that the patch, generated by the LLM, often lacked precision. This was due to the tool used by the LLM to view files, which was limited to processing 100 lines of code at a time. This limitation resulted in header inaccuracies within the generated patch file.
To address these issues, we implemented multiple testing phases to ensure the patch was accurate and could be successfully applied to the repository. These tests were designed to validate and refine the patch, mitigating the shortcomings introduced during its generation.

Static Testing

11. Understanding Patches — Git Pocket Guide [Book]

  • Header and Syntax Checks: During static testing, we validate the patch headers for syntax errors and ensure that end-of-line comments (commonly caused by file edits) are correct.
  • We used third-party text editors like vim alongside manual functions to evaluate the code's syntax effectively.
Static Testing

11. Understanding Patches — Git Pocket Guide [Book]

  • Patch Syntax Compliance: Patch files rely on specific syntax to represent changes in the codebase, using " " for unchanged lines, "+" for additions, and "-" for deletions. Lines beginning with a blank space " " provide context and remain unchanged, helping to frame the modifications within their surrounding code. Lines prefixed with "+" represent new additions to the file, while lines starting with "-" indicate deletions.
  • Git Path Relevance: We checked that the file paths referenced in the patch's "a/" (before) and "b/" (after) headers align with the repository's local file structure.
Blog image

Dynamic Testing

Dynamic testing involves loading the content of the relevant file and carefully comparing the lines mentioned in the patch header against the actual file content. This step is crucial to verify that the changes suggested in the patch correspond to the correct sections of the codebase.
To address line mismatches between the patch and the file, multiple approaches were considered.
One option was to prioritize the line numbers suggested by the LLM in the patch, while another focused on using the actual line numbers from the file content. Ultimately, we adopted a hybrid method to balance accuracy and efficiency. This approach begins by identifying the first line in the header, locating where it exists in the file content, and setting it as the starting point in the header. Using this as a pivot, all subsequent lines in the header are adjusted to maintain consistency and alignment with the file content.
Some other refined processes ensure that files generated by the LLM but not found in the repository are excluded. Corrupted headers or irrelevant paths are removed to maintain consistency.
If the patch passes both static and dynamic testing, it is applied to the repository:
  • If Successful: The workflow concludes successfully.
  • If Unsuccessful: The process enters a Debug Patch Loop, where the patch is further refined using search tools and additional debugging techniques before being tested again.

6. Debugging the Patch

In cases where the patch application fails, a debugging process is initiated to identify and resolve the issues. The errors encountered during the patch application, along with the patch itself and the content of the relevant files, are passed through a loop using agents with LLM designed to iteratively refine the patch
During this process, search tools are employed to pinpoint specific errors within the patch, such as line mismatches, missing headers, or incorrect file paths. Once the issues are identified, the patch is adjusted accordingly to address the problems. The refined patch then loops back to the
  • Apply Patch step for another attempt.
This iterative debugging mechanism ensures that the patch evolves with each cycle, progressively improving its accuracy and compatibility until it can be successfully applied to the repository. By combining error analysis with patch refinement, this step significantly enhances the reliability of the workflow.

7. End Node

At the end the patch is saved against the instance ID along the steps and logs.

Key Takeaways from Strategy 1

Efficiency vs. Accuracy
  • While the approach showed potential, it often struggled with edge cases like line-range mismatches and subtle syntax errors, requiring repeated manual intervention.
  • And since the editing, which is in fact the actual fixing of issues through a patch file, is limited in context, meaning mostly our workflow was solving up to 2 chunks and maximum 2 files in one go.
  • This constraint highlighted the need for a better solution capable of handling larger and more complex changes.
Formatting Pitfalls
  • Git patch headers and formatting posed significant challenges, with minor mistakes (e.g., misplaced line endings) leading to patch rejection in up to 80% of cases.
Need for Iteration
  • The approach highlighted the necessity of a more granular solution, involving direct, line-by-line edits rather than relying solely on large, monolithic patches.

Summary

While the initial approach showed promise in resolving repository issues, it was far from efficient. The approach relied heavily on extensive manual checks to ensure patches were correctly formatted and aligned with the repository structure. Although it demonstrated the capability to solve issues, its dependency on manual validation made it resource-intensive and time-consuming. These limitations highlighted the need for a more streamlined and precise workflow that's discussed in our next blog.
AISoftware EngineeringSWE-benchPatch-CentricLLMLangChainBug FixingQueryloop