Researchers have found that large language models (LLMs) tend to parrot buggy code when tasked with completing flawed snippets.

That is to say, when shown a snippet of shoddy code and asked to fill in the blanks, AI models are just as likely to repeat the mistake as to fix it.

  • jj4211@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    12 hours ago

    And because a friend insisted that it writes code just fine.

    It’s so weird, I feel like I’m being gaslit from all over the place. People talking about “vibe coding” to generate thousands of lines of code without ever having to actually read any of it and swearing it can work fine.

    I’ve repeatedly given LLMs a shot and always the experience is very similar. If I don’t know how to do it, neither does it, but it will spit out code confidently, hallucinating function names or REST urls as needed to fit the narrative that would have been convenient. If I can’t spot the logic issue with some code that isn’t acting correct, it will also fail to generate useful text that would describe the problem.

    If the query is within reach of copy/paste of the top stack overflow answer, then it can generate the code. The nature of LLM integration with IDEs makes the workflow easier to pull in than stack overflow answers, but you need to be vigilant as it’s impossible to tell a viable result from junk, as both are presented with equal confidence and certainty. It can also do a better job of spotting issues within things like key values that are strings with typo than traditional code analysis, and by extension errors in less structured languages like Javascript and Python (where ‘everything is a hash/dictionary’ design prevails).

    So far I can’t say I’ve seen improvements, I see how it could be seen as valuable, but the resulting babysitting carries a cost that has been more annoying than the theoretical time saves. Maybe for more boilerplate tasks, but generally speaking those are highly wrapped by libraries already, and when I have to create significant volume of code, it’s because there’s no library and if there’s no library, it’s niche enough that the LLMs can’t generate either.

    I think the most credible time save was a report of refreshing an old codebase that used a lot of deprecated function and changing most of the calls to the new method without explicit human intervention. Better than tools like ‘2to3’ for python, but still not magical either.