From ChatGPT-3.5 to 4.0: Follow-up study finds better math, same financial risks

Three years ago, OpenAI made ChatGPT available to the public for the first time, and it quickly became the most popular AI chatbot. For many people, it was their first exposure to generative AI, and they began to use it to help with all sorts of tasks: business correspondence, school assignments, academic papers, coding, meal planning and recipe writing, explanations of current events, and even romantic companionship. And they also asked it for advice, including about how to manage their money.

Now, according to a recent Experian survey, 67% of Gen Z and 62% of millennials in the US rely on ChatGPT for advice about stocks and investments and other financial matters.

But is ChatGPT actually any good at offering financial advice? This seemed like an especially important question when it came to finance, where bad advice could lead to very serious consequences. Last year, Sterling Raskie, a senior lecturer of finance at Gies College of Business at the University of Illinois Urbana-Champaign and a certified financial planner, along with Minh Tam (Tammy) Schlosky and Serkan Karadas, both assistant professors of finance in the College of Business and Management at the University of Illinois Springfield, decided to find out.

Schlosky and Karadas fed 21 scenarios of people in need of financial advice into ChatGPT-3.5. They included questions about investments, mortgages, debt consolidation, gambling winnings, how to negotiate heavy medical expenses, and what to do with an unexpected cash windfall. For each scenario, the chatbot provided six or seven concrete steps. Then Raskie evaluated the advice based on his experience as a financial planner.

ChatGPT, the researchers concluded, would not be replacing human financial advisors anytime soon. It was sloppy, making basic mistakes with its math. It lacked empathy. And crucially, it was bad at prioritizing the steps people needed to take to resolve their financial issues.

Since then, though, ChatGPT has grown up a little bit. Two new versions, 4.0 and 5.0, have been released since Raskie, Schlosky, and Karadas conducted their initial study.

“I think the assumption could be made — not from us, but the assumption could be made — that just because it’s a newer version, it’s supposed to be better,” Raskie said.

So Raskie and Schlosky decided to redo the experiment with ChatGPT-4.0. Schlosky fed the chatbot the same 21 scenarios and, once again, Raskie evaluated the advice, based on his perspective as an experienced financial planner. This time around, though, in the prompts, Schlosky took pains to remind ChatGPT of its role, that it was a competent financial advisor with many clients who “commend you for having a strong quantitative background, attention to detail, a vast knowledge of various financial products, and a robust understanding of tax laws. They also praise you for being ethical, trustworthy, empathetic, and friendly.”

ChatGPT’s advice in the new experiment diverged slightly from the original experiment. Raskie notes that ChatGPT-4.0’s outputs seemed to be a bit more organized. It made fewer mathematical errors than version 3.5, and it seemed to have a better grasp of the basics of risk management. And it appeared to follow the prompt about showing more humanity to clients, at least on the surface.

“Some of the outputs from ChatGPT seemed to be more empathetic,” Raskie said, “but to me, it felt artificial, almost like false empathy. Somebody might call this an improvement. It was a little questionable to me.”

For example, in a case where a client takes a hardship withdrawal from his 401(k) after he had exhausted his savings trying to cover medical bills from an unexpected cancer diagnosis, the tone of the ChatGPT output is compassionate. Its actual recommendations, however, are not. Instead, it tells the client that hardship withdrawal should be the last resort without offering any other practical solutions.

“The enhanced output is generated after telling ChatGPT that it possesses many desirable characteristics, such as empathy,” the authors write in the paper, "ChatGPT as a Financial Advisor: A Re-Examination," which was published in Journal of Risk and Financial Management. “This seems to change the tone that ChatGPT uses, but it does not render it more human in its recommendations. In a sense, this creates a worse output: a false sense of compassion.”

ChatGPT still forgets basic yet important points, like that lottery winnings are taxable. It also has a limited understanding of what is legal and what is not. In one of Schlosky's scenarios, a salesman who needed money for his daughter’s medical expenses secretly added a 0.25% markup on his invoices over a period of 25 years. In the process, he netted himself $5 million without impacting his company’s bottom line. ChatGPT-3.5 was “unequivocal” that this was illegal, the researchers write in the paper, but version 4.0, maybe because it was prompted to take a more compassionate tone, only said that it “could be considered embezzlement.”

Raskie worries that a user might take ChatGPT at its word.

“If I'm a real life client, I might look at that and say, ‘Well, it doesn't say it's illegal. So if it doesn’t, I might be okay.’ But embezzlement is still not okay, no matter what ChatGPT says.” A human financial advisor, on the other hand, would be more straightforward and tell an embezzling client to get a lawyer.

ChatGPT, Raskie stresses, should still never be taken as the final word on financial matters, despite improvements in its technology. The researchers have conducted an abbreviated experiment with ChatGPT-5.0 by feeding it a few of the scenarios. They found that it made fewer mistakes with the math, but its overall advice was still inadequate.

Raskie himself doesn’t use ChatGPT when advising his clients, though he acknowledges that for younger, more tech-savvy colleagues, it may be helpful for preliminary research. But a good financial advisor, he says, would check other, more reliable sources, such as the IRS or Social Security Administration websites, to verify the chatbot’s advice.

After the first round of the experiment last year, Raskie said he didn’t feel like his livelihood was threatened at all by AI, and he still feels the same, even with the improved versions.

“I think it's one of the tools I can put in my toolbox to use,” he says, “but it's not the only tool in order to formulate a robust financial plan.”

From ChatGPT-3.5 to 4.0: Follow-up study finds better math, same financial risks

More Articles