Worked example: evaluating AI-generated Python code
Let us take a simple prompt:
Prompt: Write a Python Flask endpoint that accepts two numbers and returns their sum as JSON.
A typical AI-generated answer might look like this:
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/sum', methods=['POST'])
def sum_numbers():
data = request.get_json()
a = data['a']
b = data['b']
return jsonify({"result": a + b})
if __name__ == '__main__':
app.run(debug=True)
At first glance, this looks fine. It is short, readable, and works for a basic request like:
{
"a": 5,
"b": 7
}
The response will correctly return:
{
"result": 12
}
A casual review might approve this quickly. A rubric-based review would not.
Quick review
The code is functionally correct for the happy path, but it misses several practical concerns:
- no validation that JSON exists
- no check that a and b are present
- no type validation
- no error handling for malformed requests
- no handling for string inputs or null values
- debug mode enabled
- no structured logging
- no clear production configuration pattern
Rubric-based scoring
- Correctness — 4/5 — Works for expected numeric inputs
- Instruction adherence — 5/5 — It does what the prompt requested
- Edge-case handling — 1/5 — Missing validation and failure handling
- Security — 2/5 — Debug mode should not be used in production
- Readability — 5/5 — Easy to read
- Maintainability — 3/5 — Simple, but lacks defensive structure
- Performance — 5/5 — Fine for this use case
- Testability — 3/5 — Testable, but not designed with error scenarios in mind
- Production readiness — 1/5 — Missing validation, logging, and safe configuration
What the code missed
This is the key lesson. The code works, but it is not production-ready. That gap is exactly what a rubric exposes.
Now compare it to an improved version:
from flask import Flask, request, jsonify
import logging
import os
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@app.route('/sum', methods=['POST'])
def sum_numbers():
data = request.get_json(silent=True)
if not data:
return jsonify({"error": "Invalid or missing JSON body"}), 400
if 'a' not in data or 'b' not in data:
return jsonify({"error": "Fields 'a' and 'b' are required"}), 400
try:
a = float(data['a'])
b = float(data['b'])
except (TypeError, ValueError):
return jsonify({"error": "'a' and 'b' must be numeric"}), 400
result = a + b
logger.info("Sum calculated successfully")
return jsonify({"result": result}), 200
if __name__ == '__main__':
debug_mode = os.getenv("FLASK_DEBUG", "false").lower() == "true"
app.run(debug=debug_mode)
This version is still simple, but it is much stronger operationally. It validates input, handles malformed requests, avoids hardcoded debug assumptions, and includes basic logging.
That is the difference between code generation and engineering judgment.