Reviewing student code in the age of LLMs

Some observations during this years code review

Code Reviews for Schools?

For as long as I can remember, our company Info Support has collaborated with a few universities. Part of that collaboration is in the form of Code Reviews for projects built by the 2nd year students. It’s always been a great learning experience (both for the students and for myself)… but this year was quite special. Last year GPT and LLMs were still pretty new. This year… well that’s what this post is about. Let’s dive into my observations when comparing this year of code reviews to other years.

Context

Before I continue, a few things of note:

  • The system they had to build was a UI + Backend, which also integrated with 2 other APIs to get data;
  • Teams had to use Spring Boot, which they only touched in class for the first time 2 weeks before;
  • We did 3 rounds of reviews, each with about a month of time in between;

Observation 1: Sheer Amount of Code ⬆️

I don’t have exact numbers on the amount of lines which were produced (a pretty meaningless number in many regards) but it was significantly more than previous years. This should be no surprise that, given a thing which can spew out code at a high frequency, this was to be expected.

Observation 2: Average Quality of the code ⬆️

We could argue about what “quality” code means, but we have a pretty objective measurement in that we use the same Sonar profile for all teams across the last couple of years. Comparing to previous years, the quality of the code seems to have improved drastically when looking at those reports. Way less bugs and vulnerabilities, especially in the last review round. It seems like teams had more time (or willingness) to tackle the issues the Sonar report provided.

That being said, there is one thing a lot of students fell for. When you copy-paste from StackOverflow (admit it… we’ve been there), at least there are not too many inline code comments. When using ChatGPT to generate code, you often get output like this:

        // Comment
        try {
            HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
            // Comment
            if (response.statusCode() == 200 || response.statusCode() == 201) {
                String responseBody = response.body();
                System.out.println("Response: " + responseBody);
            } else {
                System.out.println("Error: " + response.statusCode());
            }

Notice the interleaving of comments with a few lines of code. This is a clear sign of LLM generated code. Some teams just plainly copy-pasted it all into their repositories, even when the code comments added no value or were nonsensical. I’ve even seen words I’ve never ever seen before in the Dutch language… like “VERZOEKLICHAAM”.

“VERZOEKLICHAAM” is the literal translation of “Request Body” into Dutch… and it sounds either like “the body you want for a summer on the beach” or “like something a Dutch serialkiller would be a provider for”.

I had a lot of fun starting the feedback round with “alright folks, what is a ‘VERZOEKLICHAAM’?!” and seeing the confused look on their faces. And this specific word didn’t show up just once, I had 2 distinct teams in 2 distinct review rounds who added that comment into their codebase. Luckily, most teams understood that if a comment in code adds no value (or worse, confuses the hell out of people) they should remove it. And hey, at least I have an idea for a word I want to print on a t-shirt.

Observation 3: Comprehension of the code 🔀

During our code review we try to not just give feedback on the code, but also do a quick check if the teams actually know what is in their codebase. Here we noticed an interesting divergence. Some teams were way ahead compared to students in the previous years, other students had no clue what was in their code.

This is something I’ve been seeing a lot and I’ve talked this to many people. Students currently live in a universe where they have an amazing coach available to them 24/7. They can ask questions to LLMs and those will generally give answers which are pretty decent. Students who use this power are getting seriously ahead of the curve, while others (the copy-paste enthusiasts) are falling behind.


frustration.png

One might assume that this makes LLMs great equalizers, solving a serious issue in education right? Yes, but not everyone is equal here. Students commented multiple times that those who had access to a paid ChatGPT subscription got way better answers than those who don’t. While LLMs are getting better in general, it’s still worth noting here that students who paid money… got better help. Those who can’t afford the subscription… though luck?

Observation 4: Complexity 🚀

When you have an LLM capable of generating complex code, chances are it might just give you something which is way too complex for your usecase. I have 2 examples here people in the Java space might appreciate.

The first example was a team who asked an LLM: “How do I make a performant REST API in Spring?”. If you ask that question to an experienced Java developer, they might ask the question “what do you mean with ‘performant’?”. They might even give you a small lecture on the folly of pre-mature performance optimization. But not LLMs, they are eager to please! The LLM recommended the use of Spring Webflux!?!

Remember, these students only just learned to work with Spring Boot 2 weeks before. Pushing Webflux on people who are only just starting out programming in Spring is like dumping them in a big icy lake. Sure, they might be able to navigate their way through it… but it’s quite a shock to the system. I recommended they stick to the non-reactive Spring REST for now, at least until they get more familiar with Spring to begin with.


complexity.png

The second example is more of an architectural concern. One of the teams was concerned about hitting an external API too many times and wanted to employ caching. If you know a bit of Spring, @Cachable might immediately come to mind. Unfortunately for this team, they forgot to add “Spring” in their request to the LLM. So the LLM gave them a very popular caching solution: Redis!

This lead the team away from using a simple annotation and instead they added this new Redis component into their architecture. The code they wrote to keep the Redis cache in sync with what the Spring code produces was monstrous and really hurt their productivity immensely!

So in short, complexity went through the roof, leaving some students demotivated thinking that Java is too hard. Not realizing that it’s their LLM of choice which made it this hard to begin with!

To the future!

Since we do these code reviews every year, I’m quite curious to see where we will be ending up next! Will these observations be confirmed and amplified next year… or will we see something totally different we can’t predict yet! I guess we’ll find out in due time!