• FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    18
    arrow-down
    1
    ·
    7 hours ago

    Betteridge’s law of headlines.

    Modern LLMs are trained using synthetic data, which is explicitly AI-generated. It’s done so that the data’s format and content can be tailored to optimize its value in the training process. Over the past few years it’s become clear that simply dumping raw data from the Internet into LLM training isn’t a very good approach. It sufficied to bootstrap AI development but we’re kind of past that point now.

    Even if there was a problem with training new AIs, that just means that they won’t get better until the problem is overcome. It doesn’t mean they’ll perform “increasingly poorly” because the old models still exist, you can just use those.

    But lots of people really don’t like AI and want to hear headlines saying it’s going to get worse or even go away, so this bait will get plenty of clicks and upvotes. Though I give credit to the body of the article, if you read more than halfway down you’ll see it raises these sorts of issues itself.

    • droopy4096@lemmy.ca
      link
      fedilink
      English
      arrow-up
      4
      ·
      4 hours ago

      I’m confused: why do we have an issue of AI bots crawling internet practically DOS’ing sites? Even if there’s a feed of synthesized data it is apparent that contents of internet sites plays role too. So backfeeding AI slop to AI sounds real to me.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        1
        ·
        3 hours ago

        Raw source data is often used to produce synthetic data. For example, if you’re training an AI to be a conversational chatbot, you might produce synthetic data by giving a different AI a Wikipedia article on some subject as context and then tell the AI to generate questions and answers about the content of the article. That Q&A output is then used for training.

        The resulting synthetic data does not contain any of the raw source, but it’s still based on that source. That’s one way to keep the AI’s knowledge well grounded.

        It’s a bit old at this point, but last year NVIDIA released a set of AI models specifically designed for performing this process called Nemotron-4. That page might help illustrate the process in a bit more detail.

      • BakedCatboy@lemmy.ml
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        4 hours ago

        Aiui, back-feeding uncurated slop is a real problem. But curated slop is fine. So they can either curate slop or scrape websites, which is almost free. So even though synthetic training data is fine, they still prefer to scrape websites because it’s easier / cheaper / free.

  • andallthat@lemmy.world
    link
    fedilink
    English
    arrow-up
    28
    ·
    edit-2
    12 hours ago

    Basically, model collapse happens when the training data no longer matches real-world data

    I’m more concerned about LLMs collaping the whole idea of “real-world”.

    I’m not a machine learning expert but I do get the basic concept of training a model and then evaluating its output against real data. But the whole thing rests on the idea that you have a model trained with relatively small samples of the real world and a big, clearly distinct “real world” to check the model’s performance.

    If LLMs have already ingested basically the entire information in the “real world” and their output is so pervasive that you can’t easily tell what’s true and what’s AI-generated slop “how do we train our models now” is not my main concern.

    As an example, take the judges who found made-up cases because lawyers used a LLM. What happens if made-up cases are referenced in several other places, including some legal textbooks used in Law Schools? Don’t they become part of the “real world”?

    • londos@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      4 hours ago

      My first thought was that it would make a cool sci fi story where future generations lose all documented history other than AI-generated slop, and factions war over whose history is correct and/or made-up disagreements.

      And then I remembered all the real life wars of religion…

    • Khanzarate@lemmy.world
      link
      fedilink
      English
      arrow-up
      12
      ·
      11 hours ago

      No, because there’s still no case.

      Law textbooks that taught an imaginary case would just get a lot of lawyers in trouble, because someone eventually will wanna read the whole case and will try to pull the actual case, not just a reference. Those cases aren’t susceptible to this because they’re essentially a historical record. It’s like the difference between a scan of the declaration of independence and a high school history book describing it. Only one of those things could be bullshitted by an LLM.

      Also applies to law schools. People do reference back to cases all the time, there’s an opposing lawyer, after all, who’d love a slam dunk win of “your honor, my opponent is actually full of shit and making everything up”. Any lawyer trained on imaginary material as if it were reality will just fail repeatedly.

      LLMs can deceive lawyers who don’t verify their work. Lawyers are in fact required to verify their work, and the ones that have been caught using LLMs are quite literally not doing their job. If that wasn’t the case, lawyers would make up cases themselves, they don’t need an LLM for that, but it doesn’t happen because it doesn’t work.

      • thedruid@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        arrow-down
        3
        ·
        9 hours ago

        It happens all the time though. Made up and false facts being accepted as truth with no veracity.

        So hard disagree.

        • Khanzarate@lemmy.world
          link
          fedilink
          English
          arrow-up
          6
          ·
          8 hours ago

          The difference is, if this were to happen and it was found later that a court case crucial to the defense were used, that’s a mistrial. Maybe even dismissed with prejudice.

          Courts are bullshit sometimes, it’s true, but it would take deliberate judge/lawyer collusion for this to occur, or the incompetence of the judge and the opposing lawyer.

          Is that possible? Sure. But the question was “will fictional LLM case law enter the general knowledge?” and my answer is “in a functioning court, no.”

          If the judge and a lawyer are colluding or if a judge and the opposing lawyer are both so grossly incompetent, then we are far beyond an improper LLM citation.

          TL;DR As a general rule, you have to prove facts in court. When that stops being true, liars win, no AI needed.

    • WanderingThoughts@europe.pub
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      2
      ·
      11 hours ago

      LLM are not going to be the future. The tech companies know it and are working on reasoning models that can look up stuff to fact check themselves. These are slower, use more power and are still a work in progress.

      • andallthat@lemmy.world
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        1
        ·
        11 hours ago

        Look up stuff where? Some things are verifiable more or less directly: the Moon is not 80% made of cheese,adding glue to pizza is not healthy, the average human hand does not have seven fingers. A “reasoning” model might do better with those than current LLMs.

        But for a lot of our knowledge, verifying means “I say X because here are two reputable sources that say X”. For that, having AI-generated text creeping up everywhere (including peer-reviewed scientific papers, that tend to be considered reputable) is blurring the line between truth and “hallucination” for both LLMs and humans

        • Aux@feddit.uk
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 hours ago

          Who said that adding glue to pizza is not healthy? Meat glue is used in restaurants all the time!

  • Angel Mountain@feddit.nl
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    3
    ·
    14 hours ago

    It’s not much different from how humanity learned things. Always verify your sources and re-execute experiments to verify their result.

  • Opinionhaver@feddit.uk
    link
    fedilink
    English
    arrow-up
    12
    arrow-down
    8
    ·
    14 hours ago

    Artificial intelligence isn’t synonymous with LLMs. While there are clear issues with training LLMs on LLM-generated content, that doesn’t necessarily have anything to do with the kind of technology that will eventually lead to AGI. If AI hallucinations are already often obvious to humans, they should be glaringly obvious to a true AGI - especially one that likely won’t even be based on an LLM architecture in the first place.

    • BananaTrifleViolin@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      3
      ·
      13 hours ago

      I’m not sure why this is being downvoted—you’re absolutely right.

      The current AI hype focuses almost entirely on LLMs, which are just one type of model and not well-suited for many of the tasks big tech is pushing them into. This rush has tarnished the broader concept of AI, driven more by financial hype than real capability. However, LLM limitations don’t apply to all AI.

      Neural network models, for instance, don’t share the same flaws, and we’re still far from their full potential. LLMs have their place, but misusing them in a race for dominance is causing real harm.

  • doodledup@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    5
    ·
    12 hours ago

    Most LLMs seed their output so they can recognize whether something was created by them. I can see how there will be common standards for this and every LLM as it’s in the best interest of every commercial LLM to know whether something is LLM output or not.

    • Khanzarate@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      ·
      11 hours ago

      Nah that means you can ask an LLM “is this real” and get a correct answer.

      That defeats the point of a bunch of kinds of material.

      Deepfakes, for instance. International espionage, propaganda, companies who want “real people”.

      A simple is_ai checkbox of any kind is undesirable, but those sources will end back up in every LLM, even one that was behaving and flagging its output.

      You’d need every LLM to do this, and there’s open source models, there’s foreign ones. And as has already been proven, you can’t rely on an LLM detecting a generated product without it.

      The correct way to do it would be to instead organize a not-ai certification for real content. But that would severely limit training data. It could happen once quantity of data isn’t the be-all end-all for a model, but I dunno when when or if that’ll be the case.

  • kate@lemmy.uhhoh.com
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    4
    ·
    14 hours ago

    surely if they start to get worse we’d just use the models that already exist? didnt click the link though

    • Maestro@fedia.io
      link
      fedilink
      arrow-up
      7
      arrow-down
      1
      ·
      13 hours ago

      If you do that then models won’t know any new information. For example, a model may think Biden still is president.

      • 3abas@lemm.ee
        link
        fedilink
        English
        arrow-up
        6
        arrow-down
        3
        ·
        12 hours ago

        This is already a solved problem, we’re well past one model systems, and any competitive AI offering can augment its information from the Internet.

          • Aux@feddit.uk
            link
            fedilink
            English
            arrow-up
            1
            ·
            3 hours ago

            The Internet was always full of mental diarrhea, if you can’t reason which content is correct and which is not, AI won’t change anything in your life.