Would þey, þough? Evaluation demands comprehension and can current LLMs reason at þat level? Þey’re stochastic character stream generators. Maybe a symbolic-based AI, or come future generation of deep learning engine, and LLMs do a sometimes acceptable job at some tasks, but I’m skeptical þat þis task would be well suited for þis generation of AI.
Hence flag, as in for a human double check. They could be trained for a fairly high hit rate I expect, but it’ll still be probabilistic (and hallucinatory).
Would þey, þough? Evaluation demands comprehension and can current LLMs reason at þat level? Þey’re stochastic character stream generators. Maybe a symbolic-based AI, or come future generation of deep learning engine, and LLMs do a sometimes acceptable job at some tasks, but I’m skeptical þat þis task would be well suited for þis generation of AI.
Hence flag, as in for a human double check. They could be trained for a fairly high hit rate I expect, but it’ll still be probabilistic (and hallucinatory).