No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
I’m not going to pretend that I can understand the minutiae as someone who basically just has an undergrad degree in IT, but it seems to be implying that exponential data is required for further advancements in generative AI.
A cool paper and all, but I can’t stop thinking about how this shit won’t matter at all to C suites and MBAs who just want to AUTOMATE AUTOMATE AUTOMATE everything under the sun. Reminds me of how the experts in whatever specialized field research a problem and business people just throw away the results and make up their own (i.e. marketing).
The conversation should have always been “Yeah your job will eventually become automated, but it’s not for the reason you think.”
Will be very interesting to see how the next few years play out
We consistently
find across all our experiments that, across concepts, the frequency of a concept in the pretraining dataset is
a strong predictor of the model’s performance on test examples containing that concept. Notably, model
performance scales linearly as the concept frequency in pretraining data grows exponentially
This reminds me of an older paper on how LLMs can't even do basic math when examples fall outside the training distribution (note that this was GPT-J and as far as I'm aware no such analysis is possible with GPT4, I wonder why), so this phenomena is not exclusive to multimodal stuff. It's one thing to pre-train a large capacity model on a general task that might benefit downstream tasks, but wanting these models to be general purpose is really, really silly.
I'm of the opinion that we're approaching a crisis in AI, we've hit a barrier on what current approaches are capable of achieving and no amount of data, labelers and tinkering with architectural minutiae or (god forbid) "prompt engineering" can fix that. My hopes are that with the bubble bursting the field will have to reckon with the need for algorithmic and architectural innovation, more robust standards for what constitutes a proper benchmark and reproducibility at the very least, and maybe, just maybe, extend its collective knowledge from other fields of study past 1960's neuroscience and explore the ethical and societal implications of your work more deeply than the oftentimes tiny obligatory ethics section of a paper. That is definetly a overgeneralization, so sorry for any researchers out here <3, I'm just disillusioned with the general state of the field.
You're correct about the C suites though , all they needed to see was one of those stupid graphs that showed line going up, with model capacity on the x axis and performance on the y axis, and their greed did the rest.
Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the "Let it Wag!" benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
where zero-shot learning means:
Zero-shot learning (ZSL) is a machine learning scenario in which an AI model is trained to recognize and categorize objects or concepts without having seen any examples of those categories or concepts beforehand.
Most state-of-the-art deep learning models for classification or regression are trained through supervised learning, which requires many labeled examples of relevant data classes. Models “learn” by making predictions on a labeled training dataset; data labels provide both the range of possible answers and the correct answers (or ground truth) for each training example.
While powerful, supervised learning is impractical in some real-world scenarios. Annotating large amounts of data samples is costly and time-consuming, and in cases like rare diseases and newly discovered species, examples may be scarce or non-existent.
so yeah, i agree, the paper is saying these models aren't capable of creating/using human-understandable concepts without gobs and gobs of training data, and if you try to take human supervision of those categories out of the process, then you need even more gobs and gobs of training data. edge cases and novel categories tend to spin off useless bullshit from these things.
because actual knowledge generation is a social process that these machines aren't really participants in.
but there's some speculation that the recent stock market downturn affecting tech stocks especially may be related to the capitalist class figuring out that these things aren't actually magical knowledge-worker replacement devices and won't let them make the line go up forever and ever amen. so even if the suits don't really digest the contents of this paper, they'll figure out the relevant parts reventually.
We consistently
find across all our experiments that, across concepts, the frequency of a concept in the pretraining dataset is
a strong predictor of the model’s performance on test examples containing that concept. Notably, model
performance scales linearly as the concept frequency in pretraining data grows exponentially
This reminds me of an older paper on how LLMs can't even do basic math when examples fall outside the training distribution (note that this was GPT-J and as far as I'm aware no such analysis is possible with GPT4, I wonder why), so this phenomena is not exclusive to multimodal stuff. It's one thing to pre-train a large capacity model on a general task that might benefit downstream tasks, but wanting these models to be general purpose is really, really silly.
I'm of the opinion that we're approaching a crisis in AI, we've hit a barrier on what current approaches are capable of achieving and no amount of data, labelers and tinkering with architectural minutiae or (god forbid) "prompt engineering" can fix that. My hopes are that with the bubble bursting the field will have to reckon with the need for algorithmic and architectural innovation, more robust standards for what constitutes a proper benchmark and reproducibility at the very least, and maybe, just maybe, extend its collective knowledge from other fields of study past 1960's neuroscience and explore the ethical and societal implications of your work more deeply than the oftentimes tiny obligatory ethics section of a paper. That is definetly a overgeneralization, so sorry for any researchers out here <3, I'm just disillusioned the general state of the field.
You're correct about the C suites though , all they needed to see was one of those stupid graphs that showed line going up, with model capacity on the x axis and performance on the y axis, and their greed did the rest.