☆ Yσɠƚԋσʂ ☆@lemmy.ml to Technology@lemmy.mlEnglish · 2 days agoDeepSeek V4—almost on the frontier, a fraction of the pricesimonwillison.netexternal-linkmessage-square8fedilinkarrow-up139arrow-down17cross-posted to: technology@hexbear.nethackernews@lemmy.bestiver.setechnology@lemmy.world
arrow-up132arrow-down1external-linkDeepSeek V4—almost on the frontier, a fraction of the pricesimonwillison.net☆ Yσɠƚԋσʂ ☆@lemmy.ml to Technology@lemmy.mlEnglish · 2 days agomessage-square8fedilinkcross-posted to: technology@hexbear.nethackernews@lemmy.bestiver.setechnology@lemmy.world
minus-squarefubarx@lemmy.worldlinkfedilinkarrow-up1·2 days agoSimon may want to randomize his Pelican/Bicycle test. There is a long tradition in tech of firms tweaking their outputs to get higher scores on well-known tests. The ultimate example is VW Dieselgate. But in AI, it’s easy to game benchmarks, by adding the best answers to the training set for the next version.
Simon may want to randomize his Pelican/Bicycle test.
There is a long tradition in tech of firms tweaking their outputs to get higher scores on well-known tests. The ultimate example is VW Dieselgate.
But in AI, it’s easy to game benchmarks, by adding the best answers to the training set for the next version.