How to Prevent ChatGPT Cheating

This was originally posted on Linkedin, by the CTO of Hatchways.

Lately, many companies have expressed concerns about candidates using ChatGPT in their interview process. They've tested their questions with the tool and discovered that it's effortless for candidates to use it to answer the questions. This also has caused some companies to become wary of using take-home assessments, as it’s difficult to verify whether candidates used ChatGPT for their solutions. As a result, I’ve often been asked to write this article: a guide on how to create a take-home assessment that’s less vulnerable to ChatGPT.

Before we dive into this topic, it's important to note that ChatGPT is a powerful tool that can assist with almost any question. This shouldn't be a cause for concern, as it can also aid in job performance. However, a bigger concern is when companies use interview questions that are easily solvable by ChatGPT and it can provide a passable solution for a candidate to submit without using any critical thinking. For instance, when using HackerRank questions, you can feed the whole prompt into ChatGPT and ask it to generate a solution. Furthermore, you can use simple follow-up questions to clean up the solution and make it unique (see an example of this here). The fear is that candidates without much coding experience can use ChatGPT to circumvent traditional plagiarism filters and move on to the next round.

But not to worry, as the nature of our jobs evolves due to ChatGPT, the interview process will also adapt to assess the new skills required in the workplace. In this article, we'll offer guidance on crafting a take-home assessment that accommodates the use of ChatGPT during the interview process.

Assess New Skills In Light of ChatGPT Usage

Many companies that rely on LeetCode-style questions use automated tests to determine if candidates should proceed to the next round. However, this approach primarily assesses a candidate's ability to produce correct solutions rather than evaluating their overall engineering competence. It also overlooks the likelihood that candidates use tools like ChatGPT to assist them.

Instead, we should assume that candidates have access to such tools and adjust what we evaluate. As generative AI tools become widely adopted, the future of coding will change, but developers will not become obsolete. Instead, the skills used to evaluate good developers will evolve. Developers are likely to concentrate less on writing code and more on critically analyzing the output of tools like ChatGPT to ensure the generated code is well-written and addresses the specific problem at hand.

In this context, merely asking candidates to find a correct solution in an assessment will not suffice as an adequate evaluation, especially for questions that require minimal context (more on this in the next section). It is crucial to adapt our assessments to evaluate a broader range of skills.

For instance, rather than asking candidates to write code, consider having them assess the quality of existing code (essentially performing a code review). The code to review could even be generated by ChatGPT, as developers' jobs will increasingly involve this kind of evaluation. Alternatively, when asking candidates to code, evaluate not only the correctness of their solution but also their code quality and ability to identify edge cases.

In one example, we asked ChatGPT to write code for one of our retired assessments. The task involved writing an API route that fetched blog data from a third-party API, deduplicated the data, sorted it based on query parameters, and returned the results. Here's a snippet of code that ChatGPT provided for the sorting and deduplication task:

Though the provided code functioned, it had several fundamental issues, such as using `findIndex` within a loop, leading to inefficiency, and incorrect post comparisons due to the tool not recognizing the uniqueness of post ids. When I requested GPT-4 to enhance the code, it only refactored for better readability without addressing the core problems. Even when asked to improve efficiency, the code became increasingly complex as the tool failed to understand the simplicity of comparing post ids for uniqueness. However, with effective prompt engineering, GPT-4 could be guided to improve the code. This highlights a new skill to assess: can a candidate critically review GPT-4 responses and provide accurate prompts to aid problem-solving?

In conclusion, as AI tools like ChatGPT become increasingly prevalent in coding, it's crucial to modify assessments to evaluate a broader skill set. The focus should shift from simply seeking correct solutions to examining additional skills, assuming candidates utilize ChatGPT. This approach allows companies to effectively assess candidates' abilities in a world where AI-generated code is commonplace.

Incorporate More Context

Interview questions often lack context, usually involving writing code in isolation. For instance, most LeetCode questions are under 200 words and don't require a starting codebase. This makes it incredibly easy to paste the question into ChatGPT and receive a suitable response.

To counter this, include more context in your questions. GPT-4's ability to retain and respond to context is limited by its token limit. For instance, GPT-4's limit is 32,768 tokens, or approximately 25,000 words. If a 200-word question is given, GPT-4 can respond with 24,800 words, granting it a considerable advantage.

Incorporating more context allows you to create practical assessments that mimic real-world development tasks on existing codebases. This approach significantly increases the context while maintaining a manageable challenge length for candidates. For example, a Hatchways assessment might include a starting codebase and a ticket, requiring close to the 32,000 tokens of context. Despite the extensive context needed, candidates can complete such a question in about an hour, as it focuses on understanding and retaining context rather than extensive output.

It's crucial to remember that technology will continue to improve, and this strategy alone is insufficient to prevent GPT-4 cheating. However, by providing more context, you can create an additional barrier to cheating, as copying an entire codebase and challenge into GPT-4 becomes unwieldy. When combined with evaluating diverse skills, this strategy can lead to more nuanced GPT-4 responses that necessitate critical thinking to assess their quality.

To see how we incorporate more context in a Hatchways assessment, you can check out the experience of a Hatchways assessment here.

Ask Follow-Up Questions In a Live Setting

An effective strategy for take-home assessments is to conduct a synchronous interview as a follow-up discussion for candidates who pass. This discussion allows you to delve into the decisions they made and gain insight into the trade-offs they considered.

This approach helps you understand their thought process and determine their comprehension, even if they used ChatGPT as an aid. It also enables you to extend the assessment and create a more seamless interview experience for the candidate.

In addition, these follow-up discussions can help you evaluate the candidate's problem-solving skills and detect cheating attempts. During the follow-up discussion, it's important to be attentive to any red flags that may suggest cheating. For example, you might ask them to walk you through a particular decision they made or to discuss the implications of their choices. If the candidate struggles to explain their thought process, or if their answers seem too rehearsed, it could be a sign that they received outside help. By asking probing questions, you can gain a better sense of the candidate's skills and knowledge, and also identify cheating attempts. Here are a few questions you could ask in a follow-up discussion:

Can you walk me through the reasoning behind this specific decision?
What alternative approaches did you consider, and why did you choose this one?
How do you think the performance of your solution would be impacted by larger datasets?
How would you improve your solution to address potential scalability issues?

Design Questions With ChatGPT Limitations in Mind

It's a good practice to test your questions using ChatGPT to gauge how effortlessly the tool can answer them and anticipate potential responses from candidates who might paste the question into ChatGPT. This allows you to adjust your expectations and avoid accepting superficial answers that merely rely on ChatGPT's initial output.

By creating questions with ChatGPT's limitations in mind, you can exploit its current constraints. For instance, ChatGPT's knowledge cutoff is September 2021, which is outdated in the rapidly evolving software engineering field. Consider asking questions requiring more current knowledge.

Instead of using LeetCode-style questions that don't demand knowledge beyond September 2021, you can focus on assessing candidates' practical skills. Incorporating newer technologies or updated versions of existing dependencies in your assessment can give you an edge over ChatGPT.

For example, we designed an Angular assessment using the ngx-charts library. When we attempted to use ChatGPT to produce an answer, the solution encountered several issues. This was because the latest version of ngx-charts in our codebase was 20.1.0, whereas ChatGPT believed the latest version was 16.0.2 (a difference of four major versions).

No alt text provided for this image — Asking ChatGPT about the latest version of ngx-charts

This concept can be applied more broadly, as ChatGPT tends to default to older language syntax due to its training (e.g., Vue 2 syntax instead of Vue 3, which has significant differences and breaking changes). In such cases, candidates are better off referring to framework documentation rather than using ChatGPT, as its output may be buggy and time-consuming to fix.

It's important though not to directly test candidates on their knowledge of updated frameworks. Instead, by giving them a practical assessment, you evaluate their ability to work with these frameworks and locate relevant information through documentation and online resources – essential skills for developers in their daily work.

Conclusion

In summary, creating a practical assessment that evaluates diverse skills, includes more context, is carried through subsequent interview rounds and is designed with ChatGPT limitations in mind will reduce the likelihood of candidates using the tool to cheat.

Like Excel for accountants, ChatGPT can offer significant advantages to developers. But, as with accountants, we shouldn't ask engineers to perform tasks that ChatGPT can handle without any engineering expertise.

Want me to tear down your current interview question with ChatGPT? Send me a LinkedIn message and I’ll send you some ways to improve your question.

‍