hachyderm.io is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hachyderm is a safe space, LGBTQIA+ and BLM, primarily comprised of tech industry professionals world wide. Note that many non-user account types have restrictions - please see our About page.

Administered by:

Server stats:

9.4K
active users

The enshittification of AI has lead to the choice of AI used by VLC to be groaned at. I even saw a post cross my feed of someone looking for a replacement for VLC.

VLC is working on on-device realtime captioning. This has nothing to do with generating images or video using AI. This has nothing to do with LLMs.

(edit: There's claims VLC is using a local LLM. It will use whisper.cpp, and not be using OpenAI's models. I don't know which models they will be using. I cannot find any reference to VLC using a LLM.)

While it would be preferred to use human generated captions for better accuracy, this is not always possible. This means a lot of video media is inaccessible to those with hearing impairment.

What VLC is doing is something that will contribute to accessibility in a big way.

AI transcription is still not perfect. It has its problems. But this is one of those things that we should be hoping to advance.

I'm not looking to replace humans in creating captions. I think we're very far from ever being able to do this correctly without humans. But as I said, there's a ton of video content that simply do not have captions available, human generated or not.

So long as they're not trying to manipulate the transcription using GenAI means, this is the wrong one to demonize.

Cassandrich

@bedast Then don't call it AI. Call it speech to text. But if it uses a language model to more effectively predict words based on context rather than doing an analyzable mechanical local transformation, it is at least partly the "bad kind of AI" - it has the capacity to introduce biases from training data making output that "sounds right" but means the wrong thing, which is much worse than substituting nonsensical homophones now and then (which the reader will immediately recognize as mistakes). Same principle as why autocorrected text is worse than text with typos.

@bedast Enthusiastically calling new functionality "AI" signals to your audience that you're aligned with the scams and makes them distrust you.

This is not hard.

If you have privacy respecting, on-device, non-plagiarized, ethically built statistical model based processing, DON'T CALL IT "AI".

@dalias @bedast I agree. This is why "AI" transcription is a downgrade from previous technologies. It's contributing to the plausible disinformation slop we've still been drowning in lately.

I think automated captions have a place but I'm wary of using genai to do it.

@dalias @bedast speech recognition has used language models for decades now. It was one of original applications of language models, way before they scaled up to aping shakespeare.

But even without language models, the act of transcription is very close to generative ai, as its the task of predicting the next text token, given previous tokens and encoded audio sequence.

@varavs @bedast Then don't call it "AI".

But also, question what harms are coming out of the predictive models. The more they force the output to sound natural and fix misrecognitions, the greater the chance they're altering meaning. Same as autocorrect vs typed text with typos and misspellings.

@varavs @bedast Also ask if the model is ethically and legally sound. Was it produced from professional training material with compatible license terms? Or stolen from millions of movies or YouTube videos?

@dalias @bedast @varavs Aren't basically all the embeddable models that don't have absurd spec requirements sourced & produced by university projects?

@lispi314 @bedast @dalias

Just because it came from university doesn't mean it's "ethical" in the modern copyrightist sense. Academia did not care much about the copyright and scraped, cleaned and used web data at will. It is only now after commercial entities have labour-replacing AIs in the oven that copyright maximalism has gained ground.

@varavs @dalias @bedast I must preface with the fact I care not for copyright (could digress long on its inherent wrongness, consider me on the abolitionism end of that debate), but I meant this more in the sense that they are (or at least were) often either created by the projects wholesale or otherwise generated from public domain works (most or all of Kaldi's models are like this, for instance, with a lot coming from public domain audiobook projects).

@lispi314 @bedast @dalias
ye, librispeech is a good counterexample. Had forgotten about that.

Probably not enough for a modern asr baseline, but you could probably do smth useful with that.

@dalias @bedast Didn't mathematical/rule-based language modeling start showing massively diminishing returns back like... two~three decades ago or is my information wrong?

As far as I'm aware it would be preferable to start from a rule-based language, and then be able to specifically train a small model on a different captioned sample set of the speaker(s) to eliminate its flakiness.

@lispi314 @bedast It started showing diminishing returns when researchers figured out you could churn out degrees/products without needing any new ideas just throwing machine learning at the problem and ignoring all the potential harms from that.