Training LLMs for Code Generation: Data, Evaluation & Best Practices

In today’s AI world, large language models are becoming powerful tools not only for natural language processing but also for automatic code generation. The ability of models to deliver functional software solutions opens new horizons for developers, accelerating development, testing, and integration. However, the effectiveness of such models largely depends on the quality of training data, evaluation methods, and the application of best practices in their creation and use.

A comprehensive approach to preparing software code corpora that includes different programming languages, coding styles, and usage scenarios is important.

Key Takeaways

  • Supervised fine-tuning (SFT) improves accuracy, readability, and security for coding tasks.
  • APPS and HumanEval are essential benchmarks for measuring unit-test success and the effort required to repair them.
  • PEFT/LoRA and QLoRA (NF4) enable fine-tuning on single-GPU setups.

Map user intent and success metrics for code generation LLM training

User Intent

Description

Success Metrics

Metric Explanation

Generate syntactically correct code

User wants code that compiles without errors

Compilability Rate

Percentage of generated code snippets that pass compilation or syntax check

Perform specific functionality

Code should correctly solve the given task

Functional Correctness

Percentage of tests passed by generated code, or probability that at least one of k generated variants is correct

Optimize performance

User expects efficient and fast code

Performance Metrics (Time/Memory)

Execution time and memory usage of generated code compared to a reference solution

Readability and maintainability

Code should be understandable for developers

Code Readability/ Maintainability Scores

Metrics of code style, comments, adherence to best practices

Code security

Generated code should not contain vulnerabilities

Security Vulnerability Detection

Number of potential security issues or unsafe constructs detected

Cross-language generation

Code should be generated in multiple programming languages

Cross-Language Accuracy

Percentage of functionally correct code across different programming languages

Generate specific APIs/libraries

User expects usage of certain APIs or libraries

API Coverage/ Correct Usage

Percentage of correct usage of target APIs or libraries

Creative or alternative solutions

User wants multiple approaches to the same task

Diversity Metrics 

Number of distinct working solutions that solve the same task

Building training data pipeline: sources, curation, and labeling

Pipeline Stage

Description

Examples/ Sources

Curation Methods

Labeling Methods

Sources

Initial data for LLM training

Open-source repositories (GitHub, GitLab)

-

-

Curation

Cleaning, normalizing, and filtering data

Removing duplicates, filtering low-quality code, standardizing formatting, removing sensitive/private info

Syntax & semantic validation, style checks, licensing checks

-

Labeling

Adding metadata or categories for training

Functionality tags (sorting, math, API usage), language labels, difficulty levels

-

Manual annotation, semi-automatic labeling (e.g., using test outputs), automated scripts for labeling code features

Dataset design for code LLMs: structure, chunking, and FIM

When designing code training data for an LLM that generates program code, it is important to properly structure and organize the programming dataset. Each dataset should include code spanning multiple programming languages, styles, and levels of complexity to ensure the model's universality. To assess the quality of data and models, it is useful to use standardized benchmarks, such as HumanEval or the MBPP benchmark, which allow for checking code correctness and the model's effectiveness on real programming tasks.

One key aspect is chunking, i.e., breaking large code files or projects into logical blocks. This allows the model to more easily assimilate the code structure, reduces the risk of losing context, and improves training performance. Chunks can be based on functions, classes, modules, or even logical segments of documentation and tests, which provides a more flexible representation of the data.

To improve training efficiency, FIM (Fill-in-the-Middle), an approach that allows the model to restore missing code fragments within existing blocks, is also used. Using FIM improves an LLM's ability to understand internal dependencies between lines of code and generate correct, functional output, which is especially important for code correctness on the HumanEval and MBPP benchmarks.

Model tuning and training stack with PEFT and quantization

The PEFT (Parameter-Efficient Fine-Tuning) approach allows you to adjust some model parameters while preserving the knowledge learned during previous training. For LLMs trained on code, this will be useful, as it allows you to quickly adapt the model to specific programming tasks or new languages without requiring a complete retraining of a large model.

Another key component is quantization, i.e., reducing the precision of model parameters without a significant loss of performance. Using quantization reduces memory requirements and speeds up inference, which is critical for large models working with large programming datasets. Combining PEFT with quantization enables high-performance, resource-efficient code generation.

The training stack for such models typically includes a pre-trained LLM, modules for processing code training data, a pipeline for evaluation on benchmarks such as HumanEval and the MBPP benchmark to assess code correctness, and optimization components such as PEFT and quantization to enable fast, efficient fine-tuning.

Hands-on configuration: from TrainingArguments to reproducibility

  • Define TrainingArguments, including learning rate, batch size, number of epochs, and weight decay.
  • Set seed values for reproducibility across PyTorch, NumPy, and random libraries.
  • Enable gradient accumulation to handle large programming datasets on limited GPU memory.
  • Configure logging and checkpointing to track training metrics and allow recovery.
  • Apply mixed-precision (FP16) training to accelerate training and reduce memory usage.
  • Use an evaluation strategy with metrics such as code correctness and pass rates on the HumanEval and MBPP benchmarks.
  • Enable distributed training if multiple GPUs or nodes are available.
  • Apply PEFT or other parameter-efficient fine-tuning methods to reduce training costs.
  • Save model configuration, tokenizer, and training arguments for exact reproducibility.
  • Document dataset splits, chunking, and preprocessing steps to ensure consistent results across runs.
LLM Annotation | Keymakr

Code generation LLM training with SFT

Training LLMs for code generation using SFT (Supervised Fine-Tuning) is a fundamental approach to achieving high accuracy and robustness in models for practical programming problems. In this process, the model first learns from a large code training dataset that spans different programming languages, coding styles, and complexity levels. Then, fine-tuning occurs on specialized programming datasets, where each example consists of an input description of the problem and its corresponding correct solution.

One key aspect of SFT is the use of well-annotated datasets to verify code correctness. To assess the quality of the generated code, standardized benchmarks such as HumanEval and the MBPP benchmark are often used to verify that the model produces functionally correct solutions and adheres to expected behavior.

SFT also enables adapting large language models to specific domains or projects, reducing the risk of errors and improving code-generation efficiency. During training, it is important to control the training parameters, ensure appropriate data chunking, and apply validation on test sets to maintain a high level of code correctness.

Where LLMs fail when generating code: common semantic and syntactic errors

Error Type

Description

Examples

Impact on Code Correctness/Evaluation

Mitigation Strategies

Syntax Errors

Code does not follow programming language rules

Missing colons, unmatched brackets, incorrect indentation

Fails to compile; low pass rates on HumanEval and MBPP benchmark

Use syntax validation during training; include more syntactically correct examples in code training data

Variable/ Scope Errors

Use of undefined or incorrectly scoped variables

Using undefined variables, shadowing issues

Causes runtime errors; fails functional tests

Include examples with proper variable scoping in programming dataset; apply linting

Incorrect Logic/ Semantic Errors

Code is syntactically correct but does not perform the intended task

Off-by-one errors, wrong algorithm, incorrect API usage

Fails functional tests; lowers code correctness scores

Use unit tests and functional tests from HumanEval/ MBPP benchmark; emphasize problem-solving patterns in training data

Incomplete Implementations

Parts of a function or algorithm are missing

Missing return statements, unhandled edge cases

Fails some test cases; partial correctness

Properly chunk code training data; use FIM-style training to fill missing segments

Dependency/ Import Errors

Incorrect library or module imports

Wrong import paths, missing packages

Runtime errors; fails execution

Include diverse examples from real projects in programming dataset; validate dependencies

Performance/ Inefficiency Issues

Code runs but is slow or resource-intensive

Nested loops, inefficient recursion

Low efficiency scores; may fail real-world tasks

Include performance-oriented examples; evaluate on large test inputs

Prompt and instruction design for coding tasks

Prompt Aspect

Description

Example / Implementation

Effect on Code Correctness

Problem Description

Clearly define what the task requires

“Write a function that sorts an array of integers in ascending order”

Helps the model understand the task, reducing semantic errors

Input/Output Examples

Provide input and expected output examples

Input: [3,1,2], Output: [1,2,3]

Guides the model to generate code that passes tests on HumanEval and MBPP benchmark

Constraints & Requirements

Specify limitations or additional requirements

“Do not use external libraries; function must handle empty arrays”

Reduces generation errors and improves code correctness

Style & Formatting

Instructions for coding style

Pythonic style, PEP-8, use functions/classes

Improves readability and maintainability of generated code

Chunked Context

Provide context for code blocks from larger projects

Include related functions or module headers from programming dataset

Reduces errors and improves functional correctness

Structured Templates

Use templates for repetitive tasks

Template: Problem description + Examples + Constraints

Helps LLM align generation with expected outcomes, increasing code correctness

Using RLHF, DPO, and RAG for stronger coding performance

  • RLHF (Reinforcement Learning from Human Feedback) – learning through reinforcement based on human feedback; the model receives a score for the quality of the generated code and adjusts the generation to increase code correctness and compliance with best practices.
  • DPO (Direct Preference Optimization) – direct optimization of preferences; the model learns to prefer more correct or desirable code variants based on pairwise comparisons with a programming dataset, without the need for complex reward models.
  • RAG (Retrieval-Augmented Generation) – generation using retrieval; during code generation, the model refers to external sources, such as code training data or examples from repositories, to improve the accuracy, functionality, and consistency of the results.

FAQ

What is the role of code training data in LLMs for code generation?

Code training data provides examples of real programming tasks that the model learns from. High-quality datasets enhance code correctness and generalization across multiple programming languages.

Why is dataset curation important for programming datasets?

Curation ensures removal of duplicates, low-quality code, and sensitive information while standardizing formatting. This improves model learning efficiency and reduces errors during code generation.

What is chunking, and why is it applied in code LLMs?

Chunking divides large code files into logical segments, such as functions or classes. It allows the model to maintain context and generate syntactically and semantically correct code.

What benefits does FIM (Fill-in-the-Middle) provide for code generation?

FIM trains the model to complete missing segments within existing code blocks. This approach improves code correctness by teaching the model internal dependencies and partial code completion.

What is the purpose of SFT (Supervised Fine-Tuning) in code LLMs?

SFT adapts pretrained LLMs to task-specific programming datasets using input-output examples. It enables the model to generate functional code that passes benchmarks such as HumanEval and the MBPP benchmark.

In what way do prompts and instructions influence code generation?

Well-structured prompts clarify the task, provide input-output examples, and specify constraints. They guide the model to produce code that is syntactically correct and meets code correctness standards.

Which types of failures are common when LLMs generate code?

LLMs often produce syntax errors, incorrect logic, incomplete implementations, or dependency issues. Evaluating outputs on curated programming datasets and standardized benchmarks helps identify and mitigate these errors.

What advantages do PEFT and quantization offer for training code LLMs?

PEFT fine-tunes only a subset of model parameters, and quantization reduces memory requirements. Together, they enable efficient training on large programming datasets while maintaining high code correctness.

What contributions do RLHF, DPO, and RAG make to coding performance?

RLHF aligns outputs with human preferences, DPO optimizes for preferred code choices, and RAG incorporates relevant retrieved code examples. These methods collectively enhance functional accuracy and code correctness in real-world coding tasks.

Why are benchmarks like HumanEval and MBPP essential for evaluation?

They provide standardized tests to measure model performance and code correctness. Using these benchmarks ensures that LLMs generalize effectively across diverse programming datasets and practical coding tasks.