Generative AI models are able to produce effective retrofit decisions but do less well identifying which ones can produce the best result most quickly and at the least cost, according to analysis by researchers at Michigan State University. The study, “Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models," is one of the first to examine how large language models, or LLMs, perform in determining efficient and effective building energy retrofits.
Identifying the optimal retrofit solution can be critical from a cost standpoint. Light to medium retrofits can unlock between 10% and 40% in energy savings, or $0.49 to $1.94 per square foot of savings on average, according to JLL research published last September. Despite these savings, these actions aren’t being implemented at the scale required to meet decarbonization targets because of their capital-intensive nature, the report says. Decision-making complexity and the inadequacy of data and tools are also problem, according to the report.
To determine the potential of generative AI in addressing these limitations, MSU researchers tasked seven LLMs with generating energy retrofit decisions under two contexts: a technical context focused on maximum CO2 reduction and a sociotechnical context focused on minimum packback period.
The AI-generated retrofit decisions were evaluated based on whether they matched the top-ranked retrofit measure or fell within the top three or the top five measures. The researchers then used a sample of 400 homes from ResStock 2024.2 data, spanning 49 states, to evaluate LLM performance based on accuracy, consistency, sensitivity and reasoning.
Researchers evaluated each LLM by issuing prompts, which included an overview of 16 potential retrofit packages and building-specific information. The overview described each retrofit measure’s features like heat pump efficiency, whether insulation is upgraded and whether major appliances are electrified, alongside associated costs.
Once fed these potential scenarios and building specific information, the LLMs were assigned the role of a “house retrofit specialist” tasked with evaluating multiple buildings, comparing costs and efficiency across all packages, and identifying which retrofit delivered the greatest CO2 reduction and which provided the lowest payback period.

The study found that while LLMs are capable of producing effective retrofit decisions, they struggle to pinpoint the best one. The models are also notably stronger at identifying retrofit packages in a technical context that minimizes CO2 reduction than in sociotechnical contexts that minimize payback years.
“This difference likely reflects the relative clarity and consistency of technical optimization objectives, which are more easily captured by model reasoning, whereas sociotechnical considerations involve trade-offs between economic, behavioral, and contextual factors that may be more difficult for LLMs to interpret and prioritize accurately,” the researchers said.
The accuracy of the models in deciding the best option was as high as 54.5%, with a 92.8% accuracy rate for top-five matching, without fine-tuning. There was also low overall agreement among models, with higher-accuracy models more likely to diverge from others and lower-accuracy models showing greater alignment.
“Overall, while the decision logic and feature focus of LLMs are generally reasonable, their accuracy, consistency, and contextual understanding must be improved before they can be reliably applied to retrofit decision-making in practice,” the authors said.