Introduction to SWE Bench & Patch Centric Approach
A comprehensive explanation of SWE-bench for evaluating AI coding agents and a patch-centric approach to solving SWE-bench issues.
The Software Engineering (SWE) Bench was created to evaluate AI coding agents like Devin, which automate tasks such as bug fixes and code improvements. It provides a dataset of repositories with known issues to test how effectively these tools identify and fix bugs. Agentic workflows are submitted to the SWE, tested on these repositories, and evaluated based on the success of their fixes.
The SWE-bench was created to evaluate AI coding agents like Devin, which automate tasks such as bug fixes and code improvements.

The Princeton Dataset
A big part of the SWE Project's success lies in how we evaluate the tools and models designed to fix bugs automatically. That's where the Princeton SWE-bench Dataset comes in.
- Repository Name: This tells us which codebase the issue belongs to, there are mostly famous repositories like Astropy.
- Instance ID: Every bug or issue gets a unique ID so we can keep track of what's being tested and fixed.
- Problem Statement: Think of this as the bug report — it explains what's wrong mentioning a file sometimes or general text.
- Hint: Some extra help, but as of now SWE do not count those tools if they are solving with the help of hint.
- Fail-to-Pass: This is the ultimate goal — turning a failing test into a passing one. This mentions a test file along a function in that were failed.

SWE-bench: Scoreboards
SWE-bench Lite
- A lightweight version with 300 curated instances for resource-efficient testing.
- Focuses on functional bug fixes across 11 repositories.
- Ideal for systems with limited computational capacity.
SWE-bench Verified
- Developed in collaboration with OpenAI, featuring 500 human-validated samples.
- Ensures clarity and solvability of test cases.
- Offers a reliable benchmark for real-world software challenges.
SWE-bench Full
- The most comprehensive version with 2,294 issue-commit pairs from 12 Python repositories.
- Designed for in-depth evaluation of AI capabilities in handling diverse software tasks.

How a General Tool Works in SWE-bench
Clone the Repository
- The tool begins by cloning the repository specified in the dataset. This provides the environment and context needed to understand and resolve the issues.
Fix the Issues
- It analyzes the problem statement associated with the instance and applies its logic or algorithms to fix the identified issue in the codebase. You must also keep track of logs or steps your model did to solve the issue.
Generate the Patch
- Once the modifications are complete, the tool creates a patch file that contains the changes made to resolve the issue.
Store the Patch with Metadata
- The patch, often referred to as the "prediction," is saved along with key metadata such as the instance ID and repository name. This allows for systematic evaluation and traceability of the fixes applied.
Submitting Your Model to SWE-bench
1: Fork and Clone the Repository:
Fork the SWE-bench/experiments repository.
2: Create a Submission Directory:
- According to your preference navigate to the appropriate evaluation split (
evaluation/lite/
orevaluation/test
). - Create a new folder named with the submission date and your model's name (e.g.,
20240415_sweagent_gpt4
).
3: Prepare Submission Files:
- Within your submission folder, include:
all_preds.jsonl
: Your model's predictions.logs/
: Directory containing evaluation artifacts for each task instance. These files are generated automatically by SWE-bench during evaluation.
metadata.yaml
: Metadata for your submission, including:name
: Your leaderboard entry name.oss
:true
if your system is open source.site
: URL for more information about your system.verified
:false
initially (see verification process below).trajs/
: Directory containing reasoning traces for each task instance, detailing the steps your system took to solve the problem. Ensure each file is named to reflect the corresponding task instance ID.README.md
: Additional information about your model.
4: Run Evaluation Script:
- Execute the following command to generate results:
5: Submit via Pull Request:
Push your changes to your forked repository. Create a pull request to the original SWE-bench/experiments repository with your submission folder.
6: Verification Process:
Create an issue in the SWE-bench/experiments repository. Provide instructions for running your model on SWE-bench. The SWE-bench team will run your model on a random subset to verify results.
Our First Approach to Solving Repository Issues
We've built an automated workflow designed to resolve repository issues using SWE Dataset. Using tools like LangChain and LangGraph.
The Patch-Centric Approach: How It Worked

1. Entry Point
from the Princeton SWE dataset are given to the workflow.
2. Search Issue


3. Extracting Content and Text from Files
4. Generating a Patch
5. Applying the Patch
Static Testing
11. Understanding Patches — Git Pocket Guide [Book]
- Header and Syntax Checks: During static testing, we validate the patch headers for syntax errors and ensure that end-of-line comments (commonly caused by file edits) are correct.
- We used third-party text editors like vim alongside manual functions to evaluate the code's syntax effectively.

11. Understanding Patches — Git Pocket Guide [Book]
- Patch Syntax Compliance: Patch files rely on specific syntax to represent changes in the codebase, using " " for unchanged lines, "+" for additions, and "-" for deletions. Lines beginning with a blank space " " provide context and remain unchanged, helping to frame the modifications within their surrounding code. Lines prefixed with "+" represent new additions to the file, while lines starting with "-" indicate deletions.
- Git Path Relevance: We checked that the file paths referenced in the patch's "a/" (before) and "b/" (after) headers align with the repository's local file structure.

Dynamic Testing
- If Successful: The workflow concludes successfully.
- If Unsuccessful: The process enters a Debug Patch Loop, where the patch is further refined using search tools and additional debugging techniques before being tested again.
6. Debugging the Patch
- Apply Patch step for another attempt.
7. End Node
Key Takeaways from Strategy 1
- While the approach showed potential, it often struggled with edge cases like line-range mismatches and subtle syntax errors, requiring repeated manual intervention.
- And since the editing, which is in fact the actual fixing of issues through a patch file, is limited in context, meaning mostly our workflow was solving up to 2 chunks and maximum 2 files in one go.
- This constraint highlighted the need for a better solution capable of handling larger and more complex changes.
- Git patch headers and formatting posed significant challenges, with minor mistakes (e.g., misplaced line endings) leading to patch rejection in up to 80% of cases.
- The approach highlighted the necessity of a more granular solution, involving direct, line-by-line edits rather than relying solely on large, monolithic patches.