This is the technology worth trillions of dollars huh

  • JustTesting@lemmy.hogru.ch
    link
    fedilink
    English
    arrow-up
    12
    ·
    3 hours ago

    They don’t look at it letter by letter but in tokens, which are automatically generated separately based on occurrence. So while ‘z’ could be it’s own token, ‘ne’ or even ‘the’ could be treated as a single token vector. of course, ‘e’ would still be a separate token when it occurs in isolation. You could even have ‘le’ and ‘let’ as separate tokens, afaik. And each token is just a vector of numbers, like 300 or 1000 numbers that represent that token in a vector space. So ‘de’ and ‘e’ could be completely different and dissimilar vectors.

    so ‘delaware’ could look to an llm more like de-la-w-are or similar.

    of course you could train it to figure out letter counts based on those tokens with a lot of training data, though that could lower performance on other tasks and counting letters just isn’t that important, i guess, compared to other stuff

    • fading_person@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 hours ago

      Wouldn’t that only explain errors by omission? If you ask for a letter, let’s say D, it would omit words containing that same letter when in a token in conjunction with more letters, like Da, De, etc, but how would it return something where the letter D isn’t even in the word?

    • MangoCats@feddit.it
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 hours ago

      Of course, when the question asks “contains the letter _” you might think an intelligent algorithm would get off its tokens and do a little letter by letter analysis. Related: ChatGPT is really bad at chess, but there are plenty of algorithms that are super-human good at it.