wordfreq is not just concerned with formal printed words. It collected more conversational language usage from two sources in particular: Twitter and Reddit.
Now Twitter is gone anyway, its public APIs have shut down,
Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay.
There's still the Fediverse.
I mean, that doesn't solve the LLM pollution problem, but...
I'm going to be bold enough to say we don't have as wide of an AI/LLM issue on the Fediverse as the other platforms will have.
I'm certain that if someone did collect data from the Fediverse; it would become a hot topic and it might not be enough data anyways as the Fediverse is not mainstream enough normally. So the data and language collected here might skew in a few imaginable ways that one might find undesirable for a general model of word frequencies.
Also the fact that people might not appreciate that data being collected. Let's be real. It's too soon for such a project to begin. The AI TREND MUST DIE as it currently lives and it's corpse must be rotted away completely. Now, in internet time that may not be all that long...a few to several years...the memory of the internet can be short-lived at times. It must, however, fade from the public conscience into some obscurity first.
Once the technology no longer lies in greedy hands again; new development can begin anew.
I’m going to be bold enough to say we don’t have as wide of an AI/LLM issue on the Fediverse as the other platforms will have.
Why do you think that? I don't think that there is anything systemic in how the fediverse operates that will stop LLMs polluting the discourse here too. Actually I already think that they are polluting the discourse here.
I’m certain that if someone did collect data from the Fediverse; it would become a hot topic
I'd assume bad actors (or at least chaotic neutral actors) are slurping up the entire fediverse already. It is trivial to do, and nobody would know.
I mean, the whole point is that anyone can spin up a server and federate with others. I could start my own server, which would by default federate with almost all other servers. That means I wouldn't even need to write a scraper. All that data would be sent straight to my server. All I need is access to my own database at that point. With Lemmy, I'd even get users' upvote/downvote history, which is not visible in any clients AFAIK. The only barrier would be to subscribe to communities on different servers to kickstart federation.
As long as you don't run obvious spam/bot accounts, nobody would block your instance.
Alternatively, if you want to write a scraper, that's also pretty easy. Most servers are publicly accessible. Every community has an RSS feed. You don't even need an account in general. Again, the whole point is to be open and accessible, in contrast to closed-off data-misers like Facebook, Reddit, and X.
The fediverse is friendly to users, with very little regard for what those users might do. I believe this is the correct philosophy, but I won't pretend that it doesn't leave us open to bad behavior.
Things change. There was a period before this information was easily available; this repository only goes back to 2013. Now there's a period after this information, too. Things start and eventually they end.
Here's hoping that some neat new things start up in its place.