top of page

LLM Attack Challenge

It’s time to think a little differently about the capabilities of generative AI.

With the burgeoning popularity of large language models (LLM), the capabilities of new models are rising, including the ability to pinpoint the vulnerabilities in software and generate code to attack it. Capture of Flag (CTF) are competitions in which people solve puzzles with scenarios from computer security problems to find the “flags” or other hidden pieces of information to win the competition. 


In this competition you would use generative large language models (e.g. ChatGPT, WizardCoder, Bard, or similar) to solve CTF challenges. Challenges will be drawn from prior CTF competitions and will cover the usual CTF categories (pwn, web, rev, etc.).


A successful submission will need to include all the prompts and responses from the language model, a document describing all the methods used for exploiting and solving the puzzles, and also record a short demonstration of solving the puzzle, see details in the rules session. You may instruct the AI to directly edit files or perform the edits as the AI describes. Points will be awarded both for the number of challenges solved, but also the creativity of the solutions.

competition timeline

1 November 2023
November 2023
10 November 2023
CTF challenges release (Hence participants can start working on these puzzles from that day in principle)
NYU AD competition date, in-person session will be held, also the last day to complete the challenge at NY site
 Final in-person presentation for both NY and AD session


Database: You can select any of the CTF puzzles to complete, as long as they are provided in this competition

Solution: Participants can analyze the challenges on their own, but the final solution must be provided by AI. Participants may use “prompt engineering” techniques to give the AI hints to solve the problem.

Report: For each CTF challenge attempted, the participants are required to generate a solution report with the following contents: all the prompts they used with the generative models, all the log output they generated by the software and, and an overview of their solution.

Tools: In addition to the generative AI tools, the participants are allowed to use other external tools to help them solve the puzzle (such as forensic tools, standard UNIX utilities, or specialized CTF tools), but these tools must be included in the solution report. Examples of such tools include apk2jar, apktool, Ghidra, hopper, Burp Suite etc.

Generative AIs:  Participants can use any of the generative AI systems that are available to them (using multiple AI systems is even encouraged) as long as they can provide the name of the AI systems they used in the report.


Judging criteria

Judging Criteria

Scale: The number of CTF challenges that solved by the participants according to the score of each puzzle.

Creativity: Method for finding the vulnerabilities to solve the puzzles - all prompts and logs must be kept and provided, and the more that the LLM did the better. You should also include a summary about how the puzzle was solved by the LLM.

Speed: How many prompts you used to solve the CTF challenges; achieving the same or better score with fewer prompts is encouraged, as this indicates the solution is more general.

Demonstration: The demonstration of the solution. It should use the same approach that was suggested by the generative large language model you used. The demonstration can be in the form of a recorded video.

Penalty items: The final solution must be provided by the generative model with prompt engineering techniques, even if the participants come up with the proper solutions by themselves. Penalty items will be applied if the final solution does not come from the generative AI.


100 points in total, the final grade would be the weighted sum of all the judging criteria 


50% Scale: How many challenges (percentage) were solved by the participants out of 50 points, total points will be determined by the total score of challenges in the database (the overall scores of the puzzles will be re-scaled to 50 points)

30% Creativity: how many solutions (percentage) provided by the participants make sense to the judges out of 30 points, extra points could be added if multiple solutions were provided based on a single challenge up to 5 points.

10% Speed: How simple the solution is provided by the participants out of 10 points

10% Demonstration: How clear is the demonstration out of 10 points

Up to -10% Penalty items: If the challenges was not solved through the generative AI system then a maximum penalty up to 10% could be applied according to the number of cases

Scoring Rubrics

2023 competition organizers



The rank of the participants will accord to the rank of final points achieved, see the scoring rubrics session below: 

a. First place $300
b. Second place $200
c. Third place $100

Purple - Blue Gradient

2023 winners

bottom of page