Goda Nagakalyani, Saurav Chaudhary, et al.
SIGCSE 2025
In a typical introductory programming course, grading of student submitted programs is often done manually by examining the source code prior to assigning the final grade for reasons such as checking for compliance to some criteria (e.g. 'Use iteration, not recursion', or 'do not use additional arrays'), or for allotting partial marks. A rubric is often used by graders to grade according such criteria. However, manual grading of source code can be labor-intensive and impractical for large-scale online courses. Therefore, in this paper, we propose techniques based on Large Language Models (LLM) for code to automatically grade student programs according to instructor-specified rubrics. Leveraging a dataset of 27966 datapoints that we created, we study a total of 44 combinations of different open source LLMs and methodologies including Zero-Shot prompting, Few-Shot prompting, Supervised Fine Tuning, QLoRA, Direct Preference Optimization (DPO), code scrambling and code augmentation. To our knowledge, we are the first to address the generalized source code grading problem and to propose a solution with promising results. We find that among the models we studied, while Codestral 22B achieves a high micro-accuracy of 85% without any fine-tuning, Qwen-2.5-Coder-7B-Instruct with DPO fine-tuning achieves the same micro-accuracy with only 35 % of the GPU memory usage and 10 % of the inference time taken by Codestral 22B.
Goda Nagakalyani, Saurav Chaudhary, et al.
SIGCSE 2025