Training from scratch and retraining is expensive. Also, they want to avoid training on ML outputs as samples, they want primarily human made works as samples, and after the initial public release of LLMs it has become harder to create large datasets without ML stuff in them
There was a good paper that came out recently saying that training on ml data will result in a collapse of cohesion. It's going to be real interesting, I don't know if they'll be able to train as easily ever again
I recall spotting a few things about Image Generators having their training data contaminated using generated images, and the output becoming significantly worse. So yeah, I guess LLMs and IGA's need natural sources, or it gets more inbred than the Habsburgs.
Hey, did you know your profile is set to appear as a bot and as a result many may be filtering your posts and comments? You can change this in your Lemmy settings.
Unless you are a bot... In which case where did you get your data?
I really don't know, I'm speculating, but neither does openai know, that's sure. So we have the most popular ML system used by millions based on...what exactly?
To be fair this tweet doesn't say anything about training data but simply that it theoretically can use present day data if it looks it up online.
For gpt4 i think its was initially trained up to 2021 but it has gotten updates where data up to december 2023 was used in training. It “knows” this data and does not need to look ut up.
Whether they managed to further train the initial gpt4 model to do so or added something they trained separately is probably a trade secret.