Sam Altman and the Sorcerer's Stone
A muggle's guide to the wizarding world of the greatest showmen in tech
Originally published on LinkedIn on the same day as the two new class action lawsuits against OpenAI — one from the public over privacy, one from authors over copyrights.
Sam Altman, Emad Mostaque and David Holz, heads of some of the most hyped generative AI companies of our time, are proud to say their products are perfectly magical, thank you very much. They were the last people you'd expect to be involved in anything as mundane as licensing or piracy, because they just didn't hold with such earthly categories.
In interviews they will tell you their product "learns like humans" and may very well become Godlike one day. And as many fawning journalists will tell you, on assumption and in good faith, all data these magical products magically learned from comes from the open internet.
Well.
So does this:
And it does look awfully similar to this:
Finding and downloading most books in the world for free isn't exactly difficult these days. It no longer requires you to visit dark alleys and whisper secret words, nor to install dubious software with geeky names and lewd ads, as it might have in days of yore.
Just Bing a title, follow the links, dodge the fishy ones, and there you go.
– Presto!
The everyday magic of the internet, at your service.
There are any number of justifications for copyright infringement. Some are perfectly valid. Research and reporting for instance – such as this very text! – falls under "fair use" exception in most legal systems that have such a concept; parody, commentary and criticism, too. Idea being, copyright is only there to incentivize investment of human creativity for the long term public good anyway; unless there is direct market harm to the author – why wait for heaven?
As long as the copy at hand is "lawfully obtained" anyway.
You will often hear tech CEOs invoke those higher powers – human flourishing, amplifying human ability for the greater good, democratization of creativity, innovation, inevitable progress towards Godlike technology.
After all, if better predictive text can't cure cancer, fix climate change, make us all millionaires and colonize Mars, all in one fell swoop – whatever could?
(Unless, of course, it kills us all first.)
Their proponents therefore feel mightily justified when casting the spell of transformative fair use! in online skirmishes – along with Game over, luddite!, Take that, copyright industry! and various riffs on the proverbial toothpaste having well and truly exited the tube.
Most are unaware that "fair use" only ever functions as a counter-spell – an "affirmative defense" in muggle court parlance.
And they were most likely fed the “luddites as anti-tech” version of history.
But let's be nice to them.
Most mean well or don't know better.
And some of them are smart enough to get behind your firewall.
Sharing is caring: Bibliotik
Uploading a book to the interwebs is just as easy as downloading it.
Collecting ALL OF THE BOOKS, however, and in good quality, takes serious effort. Which is where – let's use a friendly euphemism here – sharing communities come in.
One of the more popular of these used to be (or potentially still is, provided you know the right people) bittorrent tracker Bibliotik. At its 2012 peak, it boasted 197,000 e-book titles – an impressive trove.
When Bibliotik went dark eleven years ago, parts of the community were loathe to let such a cultural treasure go amiss, so they brought a new band together to carry on the torch.
Enter: The Eye
The Eye today is not only host to the now somewhat prosaically titled Books3 collection, but all manner of magical media, neatly arranged and most certainly without any trace of pesky rights management contraptions and needlessly restrictive site Terms of Service or on-page copyright information they may have once been encumbered by. The Eye also hosts a bustling Discord community, all illuminated by its holy mission:
“A persistent push towards democratizing access and research”
Now hang on, you might say. If 197,000 books is just PART of ONE of those ...
– Is that really, y'know .. legal?
After all, an author or publisher might let the occasional freebie slip, but to have all of their property re-distributed for free, at scale – should they happen to be aware of it, that is – just might be a unicorn of a different color.
This is where things start to get interesting.
The Eye are very happy to tell you they are and always were perfectly DMCA compliant, using a long page full of standard-looking, lawyery-sounding words.
Which, if nothing else, should tell you others have indeed asked that question before.
To really drive home the point, they even host a video demonstrating to the world how compliant they are:
To top off that reassurance, the front page of The Eye tells us that should they ever actually have to take anything down for whatever reason, it'll still be, y'know, around.
Thank goodness!
Won't somebody please think of the data!
One task best left to robots
Chasing down individual pirate sites would be a net loss for most authors. Some writers, photographers, artists and illustrators do sit down every once a year or so for the tedious task of manually searching the net for unlicensed commercial uses of their content, and initiate prickly letter exchanges or promptly invoice site-owners accordingly.
For some, it pays off. Others relish the simple satisfaction of justice served. But most employ agents or third party rights enforcement services, such as Identifyy.com for music, DejaVu AI for images and DMCA Takedown for mixed content, and let that be that.
After all, some tasks really ARE better left to robots. And accountants.
I'm certainly not implying here that content creators ought to instigate a riot – although such a massive nerd-on-nerd battle would no doubt be hilarious to watch (Colloseum, take note) – but this is the part where they should start paying attention.
Because this story has bigger fish to fry.
From Mystic Eye to Hugging Face
Now picture yourself in the shoes of a gifted tinkerer with time on his hands, access to this treasure trove, and a desire for recognition by his peer community of data democratizers and acolytes of the Future Space-Cancer-Curing Robot God. One who even found limited fame for having made ChatGPT's granddad perform novel tricks just before lockdown.
Why not aim higher? Why not prepare an even larger data offering to said deity? After all, the more data, the merrier.
And as the tech frontrunners repeatedly assure their acolytes, everyone does it anyway.
– Presto, Alakazam!
A dubiously sourced e-book collection transmogrified for the noble cause of science, and added into popular open source LLM research dataset The Pile:
But there is reason to hold on to your carrier-owl to the Magical Law Enforcement Squad just a little while yet.
© Johan C. Brandstedt. All rights reserved.
Which is incidentally the legal default for original content on the internet, unless otherwise noted. How about that?
This text available for re-publication with advance written permission by the author.