ChatGPT outperforms medical students and matches residents on neurosurgical board-style questions

William Mack

An analysis published in the Journal of Neurosurgery, which assessed ChatGPT’s (OpenAI) performance on a surrogate for neurosurgical board-style questions, has found that the natural language processing (NLP) algorithm was able to significantly outperform medical students and was only slightly below the performance of residents currently studying for self-assessment neurosurgery (SANS) board exams.

In light of the fact that “machine learning in neurosurgery has quickly become a topic of great interest”, authors William Mack (University of Southern California, Los Angeles, USA) and colleagues set out to assess the current performance ability of the artificial intelligence (AI) model ChatGPT on the primary written neurosurgical boards. They note that their article is “merely a proof of concept” on the future of how such technologies could be used in neurosurgical clinical practice, with a specific focus on training and continued education.

A total of 643 Congress of Neurological Surgeons (CNS) SANS questions—one of the main study mechanisms for residents preparing for the written board exam—were used as a surrogate for the overall question performance of ChatGPT. For each, the entire question stem alongside all multiple-choice responses was copied and pasted into the input section of the model one question at a time.

Because ChatGPT is an NLP model that creates prompts one word at a time based on output probabilities—often allowing for a non-finite quantity of automatically generated answers to the same input—a total of three regenerations was attempted for each question stem to maximise the likelihood of extracting the correct answer. And, similarly, due to the fact ChatGPT does not accept input in the form of imaging, only question stems were input into the model, with partial stems being included while omitting any imaging features or diagrams.

Results were calculated using several different methods, including raw overall score; overall score within three regenerations; raw overall score excluding questions referencing images/diagrams within a stem; and overall score within three regenerations excluding questions referencing images or diagrams within stems.

For comparative analyses, the 643 questions were completed for the first time by two postgraduate residents, each of whom had not passed the written exam at the time, and four medical students interested in neurosurgery also completed all questions to establish a “reasonable benchmark for model performance”. Mean performance per category was gathered from the SANS website to provide comparisons with all first-time test takers—a group that was not limited to residents or those who had not yet passed the boards—to ascertain the average performance of all subscribers to the neurosurgery SANS question bank.

Despite three independent input regenerations, ChatGPT refused to answer 25 questions (3.9%), all of which included images or diagrams, Mack and colleagues report. On the first iteration throughout all questions responded to, ChatGPT answered 329 questions (53.2%) correctly. When excluding the 166 questions containing images or diagrams, its performance increased slightly to 54.9%. And, when evaluating the performance on each question within three independent input regenerations, model performance again improved to 58.7%, while allowing for three regenerations and excluding questions with images or diagrams led to ChatGPT achieving its best performance, answering 60.2% of questions correctly.

Despite performing better than the four medical students (26.3%) and similarly to the two actively studying residents (61.5%) used as comparators, ChatGPT did worse than the average SANS user (69.3%) in its current state. However, the model did outperform the present resident cohort in the categories of ‘functional’, ‘paediatrics’, and ‘pain/peripheral nerve’, and performed at an approximately similar level to the average SANS user in such categories—even surpassing their mean performance in pain/peripheral nerve. Proffering an explanation for this, Mack and colleagues state this could be “a reflection of the average neurosurgical trainee’s lack of sufficient knowledge in these categories relative to other categories, as opposed to ChatGPT’s inherent performance”.

They further note that ChatGPT performed “considerably worse” on ‘spine’ questions versus other categories and that, while this relationship is not well understood, this outcome “may be reflective of the immense amount of information publicly available about spine”, adding to the incorrectness of training data, or spine test questions requiring increased scrutiny of images and diagrams compared to many other categories.

“Further research into AI and NLP models is necessary to best understand their place in the future of neurosurgical education and practice,” Mack and colleagues conclude.


Please enter your comment!
Please enter your name here