As a longtime writer and lover of the arts, I must express my deep concern over the alarming use of Generative AI in the creative industries. Having spent decades honing my craft, it’s disheartening to see my work being used without my consent or recognition by these advanced systems.
From its beginning, Generative AI has faced challenges within the creative sectors due to concerns it might displace jobs traditionally performed by humans. Although some advocate for these technologies, an expanding group of artists, writers, actors, filmmakers, and others associated with these fields are voicing concerns – they claim their intellectual property is being misused or stolen.
In simpler terms, the legal status of copyright and AI, particularly when it comes to chatbots and Language Learning Models (LLMs), remains unclear because their creators argue that all data used is obtained from openly accessible sources and thus legitimate under fair use principles. However, this perspective seems to overlook some complexities in practical application.
The Atlantic’s Alex Reisner recently published a shocking report, on how LLMs have been trained on over 139,000 Film and TV scripts.
It’s been verified that numerous AI systems have been educated using content from thousands of television shows and films, with more than 53,000 movies and 85,000 TV series being part of their learning materials.
This collection of dialogues, utilized by prominent corporations such as Apple and Meta, comprises lines from popular TV series like ‘The Simpsons,’ ‘The Sopranos,’ ‘Breaking Bad,’ and movies that have been nominated for the Best Picture award between 1950 and 2016.
The data encompasses spoken dialogues from real-life galas such as the Golden Globes and the Academy Awards, among others. This vast trove of material allows artificial intelligence to convincingly portray characters or generate entire productions without requiring a large team of scriptwriters.
To those who have thoroughly explored the technology, it’s evident that modern generative AI functions primarily as an advanced rephrasing tool. It lacks the ability to arrive at its outcomes independently; instead, it relies on gathering information, whether for text or image generation.
It’s exactly the people that AI wants to replace that are its lifeblood. But, this is likely a topic for a different discussion. Let’s go back to the topic at hand, clearly, the works cited in the original reports as being scrapped by the AI are copyrighted, so how do the tech companies get away with scraping all this dialogue? Wel…
AI training data is sourced not from conventional texts but rather from subtitle files available on OpenSubtitles.org. These subtitles are gathered through specialized software from various sources such as DVDs, Blu-rays, and internet streaming platforms.
It’s interesting to note that subtitles can be beneficial for AI as they represent conversational speech, aiding AI systems, such as chatbots, in developing more human-like communication. This kind of data is particularly useful because well-crafted dialogue isn’t common in the typical AI training resources like scholarly texts or news articles.
Research shows that companies like Anthropic, Meta, Apple, and Nvidia have used subtitles to train their AI systems, including ChatGPT competitor Claude and models like OPT and NeMo Megatron.
Besides Salesforce, Bloomberg, and EleutherAI, other entities have adopted similar headings to establish more than a hundred open-source artificial intelligence models. Notably, these models, capable of matching human-like writing skills, were created without seeking explicit consent from the original authors.
Naturally, the companies did not want to comment on these findings.
OpenSubtitles can be downloaded by anyone, but the content within may not be immediately clear. It’s a 14-gigabyte file containing dialogue that doesn’t specify who is speaking or from which movie or TV show it originates. The files for individual movies and shows are distributed across 446,612 separate files, with each folder labeled using IMDb ID numbers.
As a passionate enthusiast, I’ve been diving deep into this vast trove of multimedia content. What’s fascinating is that I managed to isolate approximately 139,000 distinct film titles by meticulously sifting through the various versions of movies and episodes. To further enrich my understanding, I leveraged supplementary data from OpenSubtitles, which helped me categorize and link information about actors and directors, making this collection even more intriguing!
Indeed, copyright laws continue to exist in a somewhat ambiguous state. It’s plausible that subtitles might be regarded as derivative works and thus afforded protection. However, the courts have yet to officially rule on this matter.
For a comprehensive understanding of the original research (including specific data), I highly recommend checking out the detailed report authored by Alex Reisner.
Read More
- SOL PREDICTION. SOL cryptocurrency
- SUI PREDICTION. SUI cryptocurrency
- CSPR PREDICTION. CSPR cryptocurrency
- SKL PREDICTION. SKL cryptocurrency
- PEOPLE PREDICTION. PEOPLE cryptocurrency
- Loner Life in Another World Anime Announces New Cast and Key Visual
- UXLINK PREDICTION. UXLINK cryptocurrency
- IQ PREDICTION. IQ cryptocurrency
- CHR PREDICTION. CHR cryptocurrency
- IDEX PREDICTION. IDEX cryptocurrency
2024-11-26 12:44