A bit of context regarding Project Gutenberg. Its intake process is far from casual. Take a look at Project Gutenberg Distributed Proofreaders (PGDP, [0],[1]), one of the oldest "crowd-sourcing" projects on the net (est. 2000). As you can see from [0], every book goes through three rounds of proofing, where volunteers read each page of text and compare it to the scanned image; then through two rounds of format review, where other volunteers insert or review format markup.
From that 5-pass process the marked-up text is handed to a volunteer "post-processor" who assembles the final HTML or e-book file; then the completed book gets one more "smooth reading" pass before it is posted to PG.
This it the process that produces the books input to Standard Ebooks. That they can still find scanner errors ("tne" for "the", a typical "scanno") demonstrates how difficult it is to see those. But their presence isn't from carelessness or disregard for the value of the books.
In the 20-teens I put in hundreds of volunteer hours at PGDP in all the above roles, and it was very satisfying work. I'd recommend it to anyone wanting an online hobby that feels constructive. Volunteering time to Standard Ebooks would probably feel good as well.
The work done by Distributed Proofreaders is pretty amazing. I try to contribute my 35 pages as often as I can. The backlog there is pretty insane even while finishing upwards of 150 ebooks per month
it truly is an "online hobby that feels constructive". you get these tiny glimpses into our shared literary/cultural history while knowing that the work you're doing is for the benefit of all (benefit of the public domain)
> The backlog there is pretty insane even while finishing upwards of 150 ebooks per month
Isn't the backlog there mostly in the post-processing step, though? To the point where they're taking finished texts and running them again through the page-by-page proofreading in hope of fishing out more OCR typos and improving the format markup?
You can also contribute at Wikisource if you prefer, that doesn't really have a post-processing step and has much less of a fixed pipeline. (There are explicit "proofreading" and "verification" steps per page, but not much beyond that.)
In a similar vein, there is Wikisource.[0] Wikisource has the advantage of allowing for extensive formatting to closely match the source works due to its wiki-based format, but doesn't have quite as robust processes. Its flexibility is unparalleled though -- it covers virtually any form of scanned print work and even some old movies, and contributors can focus on whatever niches they're interested in if they want.
true, it wouldn't do a 100% job, but it would be another line of defense. the reason I was wondering about it was that the gp cited an example that was easy for humans to miss, but would be caught at once with a spell checker.
there are also statistical methods to detect words that are changed into other, valid words - check out the grammar checker in google docs for instance. again, not 100%, but every bit helps.
It would probably also throw out a lot of false positives which would take time to check. Especially in works of fiction, writers could take liberties with non-standard spelling.
Unless tne is an abbreviation and so it should pass. Names are a common place where people make up weird spellings and so spell checkers are annoying. I have terrible spelling, and yet most of the time I run spellcheck it is tripping up on words that are spelled correct but not in the dictionary (in large part because I run spell check after each revision: words spelled wrong . Add to dictionary means that my dictionary is polluted with words that only apply to one document and would be wrong in the next)
An LLM-based spellchecker would've caught it for sure. I am working on one here: https://github.com/pulkitsharma07/spelltastic.io, If anyone has suggestions on how this can help in Project Gutenberg / Standard Ebook's workflows, please reach out to me / open an issue.
I have seen that LLMs are pretty good at understanding context/domain / theme-specific terms, so their spellchecking is pretty good.
It's unclear that that would save time. If you put in enough hours to the project, you can get classified as one of those later pass proofers. That is extremely taxing work because most of the scannos have already been found by the earlier proofers. You will "complete" multiple pages without ever finding a scanno. The doubt starts to set in if you are on auto-pilot or not.
Meanwhile, in that early stage, because of the stream of errors, it is easy to pay attention and feel like you are doing rewarding work. Moreover, if you are quite quick and diligent, you can basically just read a book as volunteer work.
Also, sometimes the error is in the source material. Different editors have different opinions about what should be done there. Sometimes I had to re-add mistakes that were "fixed" by early proofers trying to correct grammar, if I recall correctly... it was a while back that I volunteered.
Editor-in-chief here, happy to answer any questions, as always. We also recently celebrated Public Domain Day with an especially notable crop of books, including The Sound and the Fury, All Quiet on the Western Front, John Steinbeck's first novel, some Hemingway, Gandhi, two Dashiell Hammett novels, and more: https://standardebooks.org/blog/public-domain-day-2025
Another question - in https://standardebooks.org/contribute/producing-an-ebook-ste... you talk about "modernising" spelling, e.g. changing "some one" to "someone". This may be against the implicit goal of making these accessible for a general reader, but I prefer to read what was originally written, and it feels like it crosses a line into editorialising rather than letting the original feel stand as-is. (Although of course these texts have already been "editorialised" by their original editors!) Totally your decision given the amount of effort that has clearly gone into this, but I'd be interested to read the rationale for that decision.
I respect this choice of modernization, and I suppose some readers enjoy it, but it makes the publisher's whole work useless to me. When a text has been altered, I can't trust it respects the intent of the author, and any style inconsistency I find may be a by-product of the publisher's mangling.
So, when I care about a book, I never read Standard Ebooks' edition.
By the way, the modernization is more than joining a few words. Sometimes, Standard Ebooks replaces the word used at the time the book was written. For instance:
This time, however, the mountain was going to [-Mahomet;-]{+Muhammad;+}
The previous quote is from Galsworthy's "Forsyte Saga". The author used many French words and French spellings – like "Tchekov" for the Russian playwriter that was living in Paris. These subtleties are lost with the modernization.
I also think some alterations are plain mistakes. For instance in the same book:
if she wanted a good book she should read [-“Job”-]{+Job+};
his father was rather like Job while Job still had land.
> I also think some alterations are plain mistakes. For instance in the same book:
That one appears to not be a mistake, [0] suggests that not quoting the name of the book of the bible being referred to (so [Job] rather than ["Job"]) is the style accepted by Chicago, MLA, and APA.
I respect their choice too, but like you the reason for my question was that I feel I can't trust the end product. Alex said "We only make sound-alike changes, like to-morrow -> tomorrow", which I could just about get along with, but Mahomet -> Muhammad creates an entirely different flavour for me. As Alex said, that's fine, in that it doesn't mean the other editions aren't available, but it is a shame for me when I essentially don't want to use something that has been put together so painstakingly.
Anyone who has read books for classes in high school and above knows that even classics are routinely fucked with by publishers. Even early in the work's history. I remember even in middle school someone would invariably end up with a different publisher's edition of a book for summer reading or whatnot and we'd find changes.
Unless the book is specifically declared to be the original text - and it may have to specify which original text - they're going to be edited.
However, in electronic form it should be possible to include both in one file, or two files with the original in a repo branch once all the document structure stuff has been added. That text will never change, so merging formatting-only changes should be pretty painless.
For every book, Standard Ebooks provides a hyperlink to the original scan, a hyperlink to the original transcription, and a full revision history in which all spelling updates have been clearly marked. To me, this already seems to be going above and beyond—most ebook repositories provide less. I can’t imagine that the marginal benefit from keeping multiple parallel branches would be worth the cost in volunteer time and labor, when maintaining pristine first editions isn’t even a goal of the project.
And of course, none of this matters in the slightest for translated works, which almost by definition includes the vast majority of works ever written.
"As it was written" is a very high bar that is simply not attainable for anything other than fairly recent works in your native language.
I'm disappointed to learn of this editing in Standard Ebooks, having had the misfortune to buy a Barnes & Noble copy of the complete Sherlock Holmes that had a similar approach taken. Book looks lovely, but has an altered chapter order, Americanised spellings and lots of typos. There is a certain amount of editing needed to render the likes of Shakespeare and Samuel Pepys readable, as Middle/Old English is quite a different language, but slight variants from 150ish years ago, or dialects, or the correct spelling according to the Queen's English, add flavour and should not be altered.
That's fine! Our editions didn't erase any of the other editions you can find online and in print. You're more than welcome to select any edition that fits your reading preferences.
Apologies if that came across as at all critical. Genuinely interested in the rationale rather than it being a how-dare-you demand for you to explain yourself!
Spelling varies widely across the eras our ebooks were published in. Therefore we attempt to standardize spelling to what a modern reader might be familiar with. We only make sound-alike changes, like to-morrow -> tomorrow.
This is a common practice that editors and publishers have quietly engaged in for centuries. For example, today you are not reading Shakespeare in the way it was spelled in its first printing.
After reading this comment I couldn't help but picture medieval monks, toiling away copying old manuscripts into "modern" English. Normally a thankless task, so thank you!
Is there epub-specific html markup you could add to changed words to indicate their original spelling? Like alt text for images, but in a span around a word? There's the html "title" attribute, of course, which would work (mouseover shows the title attribute's value), but that isn't semantically correct for the purpose.
I appreciate this service you are doing, but it would be much much better to also have an original version with archaic spelling. Double bonus points for have optional (hidden by default) explanations of words. This would be tremendously helpful to some students.
> "Don't like it? Here is a full refund and you are free to read some other version."
That is not at all what I said.
> You can't claim to care about preserving the works while changing them, and that is changing them.
We do not and have never made that claim. We are creating our own editions of these public domain books, not engaging in historical preservation.
If you want to read classic books in their original spelling, then you must locate first editions. Editors and publishers have updated both spelling and punctuation as a matter of course for centuries. Just look at any three editions of any Jane Austen novel - and you could never read an edition of Shakespeare more recent than 1800.
I think it's important to note that in the past, typesetters and printers had a much more editorial role than the process today. Authors would submit handwritten manuscripts and the typesetters in many cases would have to fix the author's mistakes, spelling, etc. to conform the manuscripts to printing standards with the author having limited communication or ability to proof the final plates
Today, it's much easier for authors to have a greater say in the final presentation due to the digital composition process
You can't use an appeal to tradition as the argument for revision.
I don't see why anyone should care that publishers have edited in the past anyway, even in this particular discussion where my own argument is for conservation. Publishers have done all kinds of things that this very project itself criticises and pointedly set themselves apart by doing differently. So, it's a weak argument for them.
Aside from that, what any other publishers do, even if it's totally common and even universal, doesn't change the argument that they were making that they wish to suggest that those edits cross a line that fixing typos doesn't cross.
For what it’s worth, that’s also exactly how I read your response, which was (to repeat) ‘That's fine! Our editions didn't erase any of the other editions you can find online and in print. You're more than welcome to select any edition that fits your reading preferences.’
I think that Standard Ebooks is a great-sounding project, but I honestly found your response not just flippant, but passive-aggressively rude to the original poster.
But — full disclosure — I also think that it would be a good idea to preserve the spellings found in the original editions you are digitising. So perhaps I inclined to feel the bite of your response more than someone who just doesn’t care.
> I honestly found your response not just flippant, but passive-aggressively rude to the original poster.
I didn’t read it that way at all. How would you have worded it in such a way as to sincerely express the stated sentiment without coming across as passive‐aggressively rude?
> How would you have worded it in such a way as to sincerely express the stated sentiment without coming across as passive‐aggressively rude?
Something like ‘While we understand that some people would prefer to read the original texts (modulo typos, formatting errors and the like), we think that it is preferable to modernize spelling because X, Y and Z.’
In other words, the polite response to ‘I like most of what you’re doing, but I dislike this particular thing’ is not ‘Fine! You’re free to go elsewhere,’ with an implied ‘don’t let the door hit you on the backside on your way out,’ but rather to engage and explain.
Again, I have to admit my own bias against the policy and consequent bias in favour of the original poster.
It is what you said. And for the record, I love the idea of this project. I just agree with the other poster about the location of this line that's all.
The text you have in your “quote” is a lot more snarky and rude than the original message. Did they edit their comment or something? Otherwise—why not quote an actual quote?
Considering the thrust of my comment, I don't understand the question. Obviously paraphrasing someone else's words into ones you like better is a fine and acceptable thing to do. So clearly I am just illustrating the problem by example.
The real answer is twofold.
1. We don't have a special 3rd kind of quote or other punctuation mark for reinterpreted references.
2. The real one: This is not a quote that lies as you imply. It is a new message, that merely uses quotes to denote a speaker, as in a pure fictional work, where the characters dialog is in quotes, even though no actual human was actually quoted.
Are there any other conundrums and baffling mysteries I can clear up for you?
When you use that syntax it looks like you are calling out an explicit quote; you may think that it's a reasonable paraphrase but I think most readers will see what you did as a strawman instead of a paraphrase.
Better to write inline "I feel like what you said amounts to [...]" to reduce the perception they you're making up quotes they someone didn't say or even clearly imply.
No one literate is in any danger of misinterpreting this very basic technique. I don't care about anyone else because it doesn't matter, they will misinterpret regardless, deliberately.
> Obviously paraphrasing someone else's words into ones you like better is a fine and acceptable thing to do.
Wrong. Not only is it tasteless and dishonest (not "fine"), it is against the rules of this site. But regardless of whether it's allowed elsewhere, you still shouldn't do it. (See "tasteless and dishonest".)
What's the point of including books that aren't public domain yet in your collections?
It makes it hard to browse those collections to find actual books to read. The first 3 series I clicked on all said "not P.D." (which at first I didn't know what "P.D" meant - remember your audience does not have your level of familiarity with your context, perhaps a tooltip on that badge would help)..
Then I see "this book will enter public domain in 2050"..
I commend you for this project, it's really awesome work.. From a user's experience, it would be great to have a filter on your various lists that restricts only to books that are available, and excludes these books that are not yet in your collection.
In addition to what Robin mentioned below, some of these placeholders are for books on our Wanted list. I also think it's useful to show readers that particular books are looking for volunteers to produce, and also to show that some books they might want are locked away by copyright for possibly decades. In that sense it's partly a political message.
Whenever we add a collection, the books that are in that collection but not yet in PD in the US get placeholders. But a filter might not be a bad idea.
Which ebook reader works well with standard ebooks in 2025?
(More concretely my reader is a 2nd-gen kindle which is basically useless these days and I’d love an idea of something that can display standard ebooks with all their advanced formatting)
I read on an old Kobo, using Kepub files. Their Kepub renderer is quite good.
I think Kindle's renderer hasn't changed significantly for many years, and it had always been pretty bad. I always say that Kindle seems to have been created by people who hate books.
The best renderer around is iBooks on an iPad, which as far as I can tell uses an up-to-date Webkit.
I read standard .epub files with KOReader on my Kobo Aura H2O. It's faster, nicer-looking, and more customizable than the stock reader, and the installation instructions were complete, correct, and easy to follow.
Kobo Libra 2 is a great e-reader. Works well one-handed (screen rotates for left/right hands), has buttons for page turns. Integrates with Overdrive (what Libby uses). Drawbacks are Kobo's bookstore is weaker than Amazon/Apple. Screen is also not flush which means some dust can collect in the recess.
A note for Kobo users: a lot of us (myself included) use Calibre to manage and upload our ebooks. Something about Calibre messes up Kepub files and strips out a lot of the formatting (including the book’s cover).
If I want to appreciate a nice Kepub from Standard Ebooks, I upload it directly to the Kobo.
A Kobo would be a great choice. I use a Kobo Libra 2 and love it a lot more than my old Kindle Paperwhite that got stolen: https://gl.kobobooks.com/products/kobo-libra-2 The Kobo Sage is also good because it has an 8" screen.
Fortunately, I had them backed up to a cloud folder. I remember almost deciding not to go to the trouble to back them up, but isn't that how it always works with backups? The Kobo also works with epub.
I recently purchased a Pocketbook Era. It is pretty much the perfect device for me - supports open standards and does not require any cloud account signups to start using it. It is not hostile to the user, 3rd party applications such as Koreader can be simply dropped in and they appear in the menus without any shenanigans like jailbreaking or custom launchers needed.
That’s calibre viewer, but it may require some customization to get something nice. Foliate is ok, but it’s a library. i’d say that’s OK because epub is a zip file and you need to extract it to read it.
I love this. However, I couldn't find an alphabetical list of authors, which is the way I wanted to browse on my first visit. Instead my only option is to show 48 on a page and paginate through, which is tedious. I know there are author pages - e.g. https://standardebooks.org/ebooks/william-makepeace-thackera... - so I presume it's feasible. An author index would significantly increase my likelihood of understanding what's available and engaging with the content.
I can tell you there is a lot of appetite for other languages. I looked at the project and the amount of stuff that would need to be rewritten to work with multiple languages was daunting. I would consider working on making your documentation and workflow functional with multiple languages.
Lots of people have tried similar projects in other languages but as far as I know none have persevered.
Personally I think it's important to have one person in charge who is able to approve of the quality of all the project's output; for now, at SE, that person is me and I'm only an expert in English.
Wonderful project! One thing I wish the website would have is being able to find the right book to read out of this enormous list — e.g. showing / sorting by Goodreads ratings (which I realize you might not want to do), or at least having some kind of a "Featured" section with the most critically acclaimed / must read books of the project on one page.
There are around a dozen collections on the (not prominently featured) collections page[1] like Le Monde's 100 Best Books of the Century and Modern Library's 100 Best Novels, etc.
Steinbeck was the first name I searched for, so this was great to see even if his major works won't be available for some time. I do wonder how badly the Steinbeck or Faulkner estates are hurt by the sudden loss of royalties? Imagine working hard to write a book to make a living and then just under a hundred years it's taken away from you. Also, AI.
Been using Standard Ebooks for a while now, but wanted to drop by here and say how great this site is! It's replaced P.G. for me (for whatever is on this site, at least) and I like the much nicer formatting on the texts. It's great on both my physical Kindle and Apple Books on my iPhone.
Each repo is a history of the ebook including editorial changes, typos fixes, and the like. Having a single repo containing thousands of ebooks and their histories would be pretty annoying to browse.
Presumably to keep the repo size reasonable. Say I want to make an ad hoc contribution to a book, if step 1 is "download this multi-gigabyte repo" then that's a fairly big hurdle.
Once you're very familiar with the process, you could get a draft of a basic prose novel ready for proofreading in a few hours. Then it has to be proofread and completed.
Beginners, and people working on more advanced books, can take much, much, much longer.
it varies widely depending on the length and type of book and how much free time the volunteer has to devote to it
Anywhere between 1 week for the simplest (straight narrative, not too much verse or endnotes) and ~1 year (thousands of endnotes, pages of verse, drama, in-line references to book titles, use of technical terms, etc)
Love this. So many in the archivist community are only interested in preservation and don't care at all about making the material accessible. Love to see a project like this prioritizing the latter.
You’re spot on with this. I recently converted a local history book from 1911 to Markdown, ePub and HTML and tracked the changes on GitHub. Only a handful of copies of this book exist in physical form and it has been photo copied (which is great).
However, I was completely shot down by the local library when I was discussing it with them. They said they already had a photo copy and didn’t need anymore digital editions, I tried to explain the benefits of having it in a machine readable format but they wouldn’t entertain it. I completed the project for me, so I wasn’t too bothered, but thought they might have been interested in archiving it but they weren’t.
My general feeling is that they didn’t like an outsider contributing and touching on a format they didn’t know so got slightly defensive.
Find an archive and make sure they're aware of the work you've done. Archivists always love meeting people who've done good work in the space they're in. Especially when they have some tech chops which is desperately lacking in the space.
Beyond that, if the material is public domain, that library is called The Internet. Post it and promote it. The only reason to seek association with a library is if you're looking for cred for some reason, and that's not the business they're in.
If it's not public domain, or if you haven't marked your derivative work public domain, then you put a library in an awkward position. Realize that these are the types of people who still post little notes by the copy machines saying what's permissible and enjoy policing it.
Most just say no for the same reason that Hollywood returns ideas and scripts unopened. They're busy and the cost/benefit isn't there.
Although the self-described online ones tend to play fast and loose, real librarians have a formal code of ethics which is worth reviewing.
Interesting. I wonder if libraries suffer a supply-chain risk and so avoid taking contributions from (non-vetted) individuals? I imagine that over time a library gets lots of offers to take "important works of literature" from cranks, and perhaps they've developed this culture to protect them from that. Pure speculation, of course.
Libraries typically don't even accept print books or CDs/DVDs. If there's a donation bin outside it probably isn't even theirs. And if stuff actually winds up with them, it just gets sold off so they can purchase material via vetted channels.
Do you "claim" a book, to make sure that no-one else is trying to work on the same book? I presume that's part of step 4 in your link, given that it would be heartbreaking to get 90% of the way through and then be beaten to it by someone who'd started at roughly the same time!
It's thanks to this site that I learned that Kobo uses a really bad renderer for epubs unless converted to their own ebook format (Kepub). It make a huge difference in appearance and performance on a Kobo device.
I assume KOReader has a better renderer for epub but will have to test how it compares to the stock software+kepub. So far I've only used KOReader on my device.
the only issues i've found with koreader is its default margin size and its display of standard ebooks' titlepages but (I believe) these can be fixed with a fairly simple user tweaks css
And https://send.djazz.se automatically performs the conversion for you with kepubify and sends it to your ereader! No affiliation, just a happy camper chiming in
Most of the big print-on-demand companies will now make hardcovers, for about $10. You can't feed raw Gutenberg files into those mills, but these "standard ebooks" have enough formatting info for that. So that would be a useful service.
Are there any non-English books? When I go to the search page, language isn't even a pull-down option, so I'm guessing not.
There is a huge world of out-of-copyright non-English texts, and Project Gutenberg has many thousands of them. I wonder if any interest could be generated to help bring them in by posting on foreign language subreddits or something.
Just looked through the entire website to answer this question. Seems like they only accept english books :(
"Types of ebooks we don’t accept:
- Non-English-language books. Translations to English are, of course, OK."
(https://standardebooks.org/contribute/collections-policy)
I understand if the existing editors can't personally proofread the submissions, but that's why peer-review exists. Or an open-source project in general where people can post corrections. Jimbo Wales didn't need to speak a hundred languages to launch Wikipedia.
To me, that niche is already covered by Wikisource. Standard Ebooks as a project is very strict about conforming to its editorial and quality standards. On boarding more languages would require volunteer editorial experts in those languages.
Besides, projects in other languages can absolutely build upon Standard Ebooks work, but to expect Standard Ebooks itself to support other languages is just too outside the scope and expertise of the volunteers available.
If you were to find the expert editors for the other languages would you let them publish the works in those other languages on standards books website?
well, that would be up to Alex. but as that would require a pretty substantial organizational and responsibility shift, I imagine, no, he would not.
As it is now, Alex is editorially responsible for all output of Standard Ebooks. Changing that would require someone with the time and experience willing to take on all the responsibilities that Alex currently has for each of those other languages.
A well-defined focus can help management of a project, for example, by not having the participants spread too thin.
The website and toolchain are open source, so if someone would build an international version, and do it persistently, I'm sure they would link or maybe even merge the projects a bit.
The manual has some known issues on mobile, I believe there's a GitHub issue open about it. It's low priority because the manual is rarely read on mobile. PRs welcomed!
the online view is not the primary way readers are expected to read the ebooks. downloading the epub and reading on an ereader (edit: where line height and font size are customizable) is the expected and best supported method
however, contributions are very welcome and everything is hosted on GitHub if you'd like to suggest improvements; or send your thoughts on the mailing list
I think the point of parent was that the issue, the too narrow leading, is not a change that needs debating. On a mailing list, issue tracker or whatever.
Or if you think it actually was, this was not a project that I'd want to get involved in.
As someone who reads mostly ePubs, many of which suffer from issues this project aims to fix, I mean that in a very caring way.
i also don't think it needs debating. my point was that the issue, the too narrow leading in the online view, is just not going to be fixed unless someone points it out to someone that can fix it. if that's you, great! you can submit a PR to the git repo. or, if don't have the time or want to have to go find where the line height is defined, submitting a comment to the mailing list or noting it on the issue tracker will let a volunteer fix it
from my own experience, Alex is very amenable to improvements. the online view of the ebooks is just not used by probably anyone to actually read the books (just use an ereader app or device its a way better experience anyway) and because of that no one has cared to point it out until now
I would love this if it were to produce viable unabridged ebooks of Francis Parkman’s “France and England in North America” vol 2-7. All the existent digital editions were poorly scanned and don’t separate footnotes from the main text.
I love this project and don't want to disparage the work that goes into it, but 900 USD, and it has to be a book that is already transcribed online? That seems a bit much to me.
That sounds quite reasonable to me. That's about what a freelance proofreader charges to edit a book, if https://thewritelife.com/how-much-to-pay-for-a-book-editor/ is correct, and that's working with a (likely Word) document which isn't poorly scanned from paper.
You can also join our Patrons Circle to have this book added to our Wanted Ebooks list, which is a list of suggestions for our volunteers to work on: https://standardebooks.org/donate#patrons-circle
Looks like a great project, and one sorely needed by people like me who find themselves trying to get hold of old books they can't get in their local library and that are too expensive to buy secondhand.
As far as I know Standard gets their raw ebooks from Project Gutenberg which has a vastly greater collection of public domain works. What they're doing is typesetting them for the average reader. But if all you're looking for is just the content, Gutenberg is the place to look for ethically clean copies.
The shadow libraries such as Anna's Archive are a treasure trove of old books, and you're not breaking any imaginary law by downloading old books which are out of copyright.
The internet archive's open library will also link to Standard Ebooks (and Gutenberg and a few others) if a version exists of a book you are looking at e.g.:
> A work that is a mere copy of another work of authorship is not copyrightable. The Office cannot register a work that has been merely copied from another work of authorship without any additional original authorship. See L. Batlin & Son, 536 F.2d at 490 (“one who has slavishly or mechanically copied from others may not claim to be an author”); Bridgeman Art Library, Ltd. v. Corel Corp., 36 F. Supp. 2d 191, 195 (S.D.N.Y. 1999) (“exact photographic copies of public domain works of art would not be copyrightable under United States law because they are not original”).
Certainly! If you add my latest Kirk/Spock slash fanfic to the end of the text, then that is transformative, so the resulting PDF is covered under copyright.
But you wrote "scan". Adding an OCR'ed text layer, or doing manual proofreading and layout ("sweat of the brow") is not sufficiently transformative to have copyright protection.
And we were specifically talking about scans of old books stored in shadow libraries.
> Of all these projects, the most amenable to automatic typesetting are those produced by Standard Ebooks and HTML Writers Guild. The benefit of using HTML Writers Guild is their semantic markup and simple document type definition (DTD) file. Standard Ebooks, as the name suggests, are brilliantly standardized and have an excellent Manual of Style that describes what to expect from the XHTML.
I like the idea. I read a bunch of classics from Gutenberg. In reality so many old books are very long and boring I ended up getting more modern books from the library instead.
Maybe TikTok ruined me but maybe these things really do literally have a shelf life. Hopefully reformatting will help. Perhaps a better way to review and find the gems would be most helpful..
Perhaps it's not just about the 'shelf life' of a book, but also the language and style they use. The more archaic the language, and the more distant the style that the author's use, the harder it is for me to focus on the book.
Perhaps it would be useful to have expertly abridged and modernized versions of (e)books, with interpreter's notes for each change.
Did you ever consider making them public domain but still offering to charge optional $10 donation for download?
I’m interested in a similar approach for a rare book library, but funding for staff is a really challenge so we want to make some kind of revenue stream.
It surprises me that the eBook (clarification: epub) format is basically XHTML because 1) that means that every eReader needs to basically be a web browser 2) this sounds like it would make reformatting for different devices NOT easier
It makes a lot of sense when you recall that HTML and its ancestors were designed to mark up and format documents, i.e. books. One of the most fundamental elements is <p>, which stands for... paragraph.
Each renderer differs in capabilities, and most are stuck in a subset of early-2000s capabilities, so designing an ebook is very much like designing for the 90s era web. Lots of hacks are required to get the same file to look good on many different renderers, and achieving that is one of the goals of Standard Ebooks.
Yeah (i guess you mean epub), though in practice readers support only a tiny subset and epubs avoid using anything fancier than basic XHTML. Epubs that try to use fancy stuff (like most CSS outside of setting fonts - that readers can ignore either because they do not support it, or because the user wants to use another font) tend to not display correctly.
Including a web browser seems a lot easier and simpler than coming up with your own rendering system once you want to support a feature set past the trivial.
Also, xhtml is just markup. It doesn’t mean you have to support all the possible tags and styles of modern html and css. It would be a sensible choice even if you had basic needs. You just parse it into whatever representation you want.
this also somewhat surprised me at first but I think it's obvious in hindsight, though they don't have to be a full-blown web browser (you can go read the epub specs at W3C to see what's supported)
as for (2) I'm not sure why you think it would make it less easier? being html, text reflows automatically based on screen size, font size, line height, etc
I guess I assumed that, for example, multi device support on websites for various device widths entails a bunch of CSS, which means the epub renderer would have to also do that, which basically means a whole web browser.
also that things like footnotes or anything that has a floating reference (table of contents links for example) might get very complex or require javascript
since ebooks are primarily (only?) text you don't have to worry about UI elements and such which simplifies a lot of the css
footnotes aren't really a thing with ebooks (at least as far as displaying the note on the page with the text). Because it is just a html renderer, footnotes are presented as mutual <a> elements located in the endnotes at the end of the book
The greatest surprise is that no popular web browser opens ePubs natively! This in 2025, where they all display PDFs, high resolution video, 3D games, etc.
A bigger surprise (failure) is that the EPUB folks have continued to evolve their bespoke format instead of ditching it for something that legacy browsers already know how to handle. An "EPUB" should just be a Mac-style bundle (i.e. a directory) with an XHTML file in it written to conform to a specific metadata profile.
EPUB isn't all that different from what you're describing. It's bundled as a ZIP archive with a couple of XML metadata files - and the content is split into one HTML file per chapter or section to make it easier to handle - but the idea is the same.
Hey, ChatGPT, tell me what's wrong with this person's comment.
> [T]he third comment violates the Cooperative Principle, specifically Grice’s Maxims of Relation and Manner, and ends up implying ignorance where there is none. Let’s break it down a bit more with that framework in mind:
> VIOLATION OF GRICE’S MAXIMS
> The second ["EPUB folks have continued to evolve their bespoke format instead of ditching it for something that legacy browsers already know how to handle"] commenter criticizes EPUB for continuing to evolve a packaging format that is not browser-native. They're not confused about what EPUB is—they're lamenting that it isn’t something simpler, like a plain web bundle a browser could just open.
> The third commenter responds by explaining what EPUB is, as if that somehow rebuts the original critique.
> Factually true.
> Entirely irrelevant in context.
> This failure to meet the relevance standard creates an implicature: the previous commenter must not have understood the format they were critiquing.
> THE IMPLICATURE TRAPS THE THIRD COMMENTER
> By stating something the second commenter obviously already knows, the third commenter unintentionally shifts the conversational footing in a way that belittles rather than builds. That’s why the tone feels off: not because of overt rudeness, but because the presupposition of ignorance is baked into the structure of the reply.
> FINAL THOUGHT
> The third comment reads like an attempted “correction,” but since the original comment didn’t contain a factual error, only a value judgment or proposal, this “correction” becomes a non sequitur—one that subtly undermines the prior speaker’s credibility while failing to address their actual point. That’s what makes it rhetorically broken, even if factually fine.
There’s also an epub-namespaced set of attributes which extend XHTML with ebook specific semantics. But those typically aren’t necessary for the visual representation of books.
I would also recommend using Microsoft Edge's built-in ReadAloud (TTS) on standard ebooks. They have a mind boggling number of hyper realistic voices; more than any other browser I've tested.
What I'm missing in modern ebooks (like epub format) is more metadata. Who's talking (character data)? What emotional aspects does the scene have (angry, happy, sad, in a hurry)? Where does the conversation take place (geodata)?
I'd love to see at least:
- character: ID, Name, Gender, Age
- mood: ID, Name (Happy, Sad, Angry, ...)
- place: ID, Name, Acoustic (Outside, Inside, Cave, ...)
This could be prepared by the author, work as a glossary, enrich the whole ebook experience and also would be a great preparation to teach AI voices how to convert a book into an audiobook.
What's the point of reading a book, then? The joy of reading fiction is to try to understand the humanity in the scene. I don't need the author to force feed me all of these details. I want to wrestle with the answers, to try to grasp what it might mean.
The challenge would be balancing that metadata richness without turning the book into a spreadsheet, but if done well (maybe opt-in layers or a toggle), it could really deepen the experience
Specialization I presume, so one produces the metadata that can be consumed by another.
Also, the thing from the above post that stood out to me would be to act as a reminder for the reader. Not so much the location and emotion, but the character data. I've often found myself wondering who the character is that's appeared in a scene, forgetting that they previously appeared earlier.
We use the Flesch-Kincaid algorithm to calculate reading ease. For most books it works pretty well, but for avant-garde prose like The Sound and the Fury it fails pretty badly. It also considers Ulysses to be "fairly easy"!
As the linked comment says, it's up to the individual contributor to inform PG of any corrections; SE does not do so as a matter of course (at least, that was the case when I last contributed).
I can't really answer that because I haven't actually tried to use an LLM on any part of the process. The vast majority of the process is semantic markup using (x)html and proofreading. The markup process could, I guess, use an LLM, but most of it is already automated using regex and linting.
My bro-in-law supported his family as a freelance editor for years while my sister was doing the "maternity leave" thing so I know there's a non-trivial amount of work that goes into book editing. Cutting out some of that human labor seems like a good thing for a volunteer project.
its never too late to expand your "stuff I really like" further into the public domain!
there are whole generations of wonderful and insightful works that essentially disappeared from present consciousness for no reason other than for being old
A good initiative, but the "us vs them" framing — where the "them" are other people trying to do a service for people — gives off bad juju. It positions the value proposition by seemingly denigrating other providers of free ebooks.
It begins with "Other free ebooks don’t put much effort into..." which sounds extremely catty.
Maybe I'm reading too much into it, but it seems there's a way to stand on other people's shoulders and celebrate each other.
A bit of context regarding Project Gutenberg. Its intake process is far from casual. Take a look at Project Gutenberg Distributed Proofreaders (PGDP, [0],[1]), one of the oldest "crowd-sourcing" projects on the net (est. 2000). As you can see from [0], every book goes through three rounds of proofing, where volunteers read each page of text and compare it to the scanned image; then through two rounds of format review, where other volunteers insert or review format markup.
From that 5-pass process the marked-up text is handed to a volunteer "post-processor" who assembles the final HTML or e-book file; then the completed book gets one more "smooth reading" pass before it is posted to PG.
This it the process that produces the books input to Standard Ebooks. That they can still find scanner errors ("tne" for "the", a typical "scanno") demonstrates how difficult it is to see those. But their presence isn't from carelessness or disregard for the value of the books.
In the 20-teens I put in hundreds of volunteer hours at PGDP in all the above roles, and it was very satisfying work. I'd recommend it to anyone wanting an online hobby that feels constructive. Volunteering time to Standard Ebooks would probably feel good as well.
[0] https://www.pgdp.net/c/activity_hub.php
[1] https://en.wikipedia.org/wiki/Distributed_Proofreaders
The work done by Distributed Proofreaders is pretty amazing. I try to contribute my 35 pages as often as I can. The backlog there is pretty insane even while finishing upwards of 150 ebooks per month
it truly is an "online hobby that feels constructive". you get these tiny glimpses into our shared literary/cultural history while knowing that the work you're doing is for the benefit of all (benefit of the public domain)
> The backlog there is pretty insane even while finishing upwards of 150 ebooks per month
Isn't the backlog there mostly in the post-processing step, though? To the point where they're taking finished texts and running them again through the page-by-page proofreading in hope of fishing out more OCR typos and improving the format markup?
You can also contribute at Wikisource if you prefer, that doesn't really have a post-processing step and has much less of a fixed pipeline. (There are explicit "proofreading" and "verification" steps per page, but not much beyond that.)
In a similar vein, there is Wikisource.[0] Wikisource has the advantage of allowing for extensive formatting to closely match the source works due to its wiki-based format, but doesn't have quite as robust processes. Its flexibility is unparalleled though -- it covers virtually any form of scanned print work and even some old movies, and contributors can focus on whatever niches they're interested in if they want.
[0] https://en.wikisource.org/wiki/Main_Page
> doesn't have quite as robust processes
They do have a double-pass system for all works based on scanned pages, which is quite nifty. Green means two passes complete: https://en.m.wikisource.org/wiki/Index:Sophocles%27_King_Oed...
Plus you can just jump in to any work, in true wiki fashion.
I think a lot of people (my past self included) underestimate how much meticulous, behind-the-scenes work goes into something like PGDP
> In the 20-teens
That being 2013 to 2019?
out of curiosity, wouldn't an automated spell check pass help catch ocr errors? e.g. "tne" would be caught immediately.
The most confusing errors are the ones spellcheck doesn't catch because they transform a word into a valid word. But it's them that we want the least.
true, it wouldn't do a 100% job, but it would be another line of defense. the reason I was wondering about it was that the gp cited an example that was easy for humans to miss, but would be caught at once with a spell checker.
there are also statistical methods to detect words that are changed into other, valid words - check out the grammar checker in google docs for instance. again, not 100%, but every bit helps.
It would probably also throw out a lot of false positives which would take time to check. Especially in works of fiction, writers could take liberties with non-standard spelling.
Unless tne is an abbreviation and so it should pass. Names are a common place where people make up weird spellings and so spell checkers are annoying. I have terrible spelling, and yet most of the time I run spellcheck it is tripping up on words that are spelled correct but not in the dictionary (in large part because I run spell check after each revision: words spelled wrong . Add to dictionary means that my dictionary is polluted with words that only apply to one document and would be wrong in the next)
An LLM-based spellchecker would've caught it for sure. I am working on one here: https://github.com/pulkitsharma07/spelltastic.io, If anyone has suggestions on how this can help in Project Gutenberg / Standard Ebook's workflows, please reach out to me / open an issue.
I have seen that LLMs are pretty good at understanding context/domain / theme-specific terms, so their spellchecking is pretty good.
the distributed proofreaders process does include a mandatory spellcheck
The amount of this that could be trivially automated fills me with rage.
Even just automated flagging of common errors would save 1000s of volunteer hours.
It's unclear that that would save time. If you put in enough hours to the project, you can get classified as one of those later pass proofers. That is extremely taxing work because most of the scannos have already been found by the earlier proofers. You will "complete" multiple pages without ever finding a scanno. The doubt starts to set in if you are on auto-pilot or not.
Meanwhile, in that early stage, because of the stream of errors, it is easy to pay attention and feel like you are doing rewarding work. Moreover, if you are quite quick and diligent, you can basically just read a book as volunteer work.
Also, sometimes the error is in the source material. Different editors have different opinions about what should be done there. Sometimes I had to re-add mistakes that were "fixed" by early proofers trying to correct grammar, if I recall correctly... it was a while back that I volunteered.
Editor-in-chief here, happy to answer any questions, as always. We also recently celebrated Public Domain Day with an especially notable crop of books, including The Sound and the Fury, All Quiet on the Western Front, John Steinbeck's first novel, some Hemingway, Gandhi, two Dashiell Hammett novels, and more: https://standardebooks.org/blog/public-domain-day-2025
Another question - in https://standardebooks.org/contribute/producing-an-ebook-ste... you talk about "modernising" spelling, e.g. changing "some one" to "someone". This may be against the implicit goal of making these accessible for a general reader, but I prefer to read what was originally written, and it feels like it crosses a line into editorialising rather than letting the original feel stand as-is. (Although of course these texts have already been "editorialised" by their original editors!) Totally your decision given the amount of effort that has clearly gone into this, but I'd be interested to read the rationale for that decision.
I respect this choice of modernization, and I suppose some readers enjoy it, but it makes the publisher's whole work useless to me. When a text has been altered, I can't trust it respects the intent of the author, and any style inconsistency I find may be a by-product of the publisher's mangling.
So, when I care about a book, I never read Standard Ebooks' edition.
By the way, the modernization is more than joining a few words. Sometimes, Standard Ebooks replaces the word used at the time the book was written. For instance:
The previous quote is from Galsworthy's "Forsyte Saga". The author used many French words and French spellings – like "Tchekov" for the Russian playwriter that was living in Paris. These subtleties are lost with the modernization.I also think some alterations are plain mistakes. For instance in the same book:
> I also think some alterations are plain mistakes. For instance in the same book:
That one appears to not be a mistake, [0] suggests that not quoting the name of the book of the bible being referred to (so [Job] rather than ["Job"]) is the style accepted by Chicago, MLA, and APA.
[0] https://en.wikipedia.org/wiki/Bible_citation#Common_formats
I respect their choice too, but like you the reason for my question was that I feel I can't trust the end product. Alex said "We only make sound-alike changes, like to-morrow -> tomorrow", which I could just about get along with, but Mahomet -> Muhammad creates an entirely different flavour for me. As Alex said, that's fine, in that it doesn't mean the other editions aren't available, but it is a shame for me when I essentially don't want to use something that has been put together so painstakingly.
Anyone who has read books for classes in high school and above knows that even classics are routinely fucked with by publishers. Even early in the work's history. I remember even in middle school someone would invariably end up with a different publisher's edition of a book for summer reading or whatnot and we'd find changes.
Unless the book is specifically declared to be the original text - and it may have to specify which original text - they're going to be edited.
However, in electronic form it should be possible to include both in one file, or two files with the original in a repo branch once all the document structure stuff has been added. That text will never change, so merging formatting-only changes should be pretty painless.
For every book, Standard Ebooks provides a hyperlink to the original scan, a hyperlink to the original transcription, and a full revision history in which all spelling updates have been clearly marked. To me, this already seems to be going above and beyond—most ebook repositories provide less. I can’t imagine that the marginal benefit from keeping multiple parallel branches would be worth the cost in volunteer time and labor, when maintaining pristine first editions isn’t even a goal of the project.
And of course, none of this matters in the slightest for translated works, which almost by definition includes the vast majority of works ever written.
"As it was written" is a very high bar that is simply not attainable for anything other than fairly recent works in your native language.
I'm disappointed to learn of this editing in Standard Ebooks, having had the misfortune to buy a Barnes & Noble copy of the complete Sherlock Holmes that had a similar approach taken. Book looks lovely, but has an altered chapter order, Americanised spellings and lots of typos. There is a certain amount of editing needed to render the likes of Shakespeare and Samuel Pepys readable, as Middle/Old English is quite a different language, but slight variants from 150ish years ago, or dialects, or the correct spelling according to the Queen's English, add flavour and should not be altered.
That's fine! Our editions didn't erase any of the other editions you can find online and in print. You're more than welcome to select any edition that fits your reading preferences.
Apologies if that came across as at all critical. Genuinely interested in the rationale rather than it being a how-dare-you demand for you to explain yourself!
Spelling varies widely across the eras our ebooks were published in. Therefore we attempt to standardize spelling to what a modern reader might be familiar with. We only make sound-alike changes, like to-morrow -> tomorrow.
This is a common practice that editors and publishers have quietly engaged in for centuries. For example, today you are not reading Shakespeare in the way it was spelled in its first printing.
A wonderful project!
After reading this comment I couldn't help but picture medieval monks, toiling away copying old manuscripts into "modern" English. Normally a thankless task, so thank you!
Is there epub-specific html markup you could add to changed words to indicate their original spelling? Like alt text for images, but in a span around a word? There's the html "title" attribute, of course, which would work (mouseover shows the title attribute's value), but that isn't semantically correct for the purpose.
No, there are too many things to track, but all of it is in the git history. Editorial changes have a commit message prefaced with [Editorial].
And you're for sure not speaking it like he would have
Fair enough - thanks for the explanation.
> For example, today you are not reading Shakespeare in the way it was spelled in its first printing.
However, we call modernised Shakespeare “abridged”.
Abridged means shortened, not modernized.
I appreciate this service you are doing, but it would be much much better to also have an original version with archaic spelling. Double bonus points for have optional (hidden by default) explanations of words. This would be tremendously helpful to some students.
[flagged]
> "Don't like it? Here is a full refund and you are free to read some other version."
That is not at all what I said.
> You can't claim to care about preserving the works while changing them, and that is changing them.
We do not and have never made that claim. We are creating our own editions of these public domain books, not engaging in historical preservation.
If you want to read classic books in their original spelling, then you must locate first editions. Editors and publishers have updated both spelling and punctuation as a matter of course for centuries. Just look at any three editions of any Jane Austen novel - and you could never read an edition of Shakespeare more recent than 1800.
That’s how I read it. What do you mean then? It sounds like the only edition you may offer is the editorialized one, if applicable.
As someone who writes I greatly dislike this. These are my words, not yours.
A translation across time and generations is a completely different matter.
I think it's important to note that in the past, typesetters and printers had a much more editorial role than the process today. Authors would submit handwritten manuscripts and the typesetters in many cases would have to fix the author's mistakes, spelling, etc. to conform the manuscripts to printing standards with the author having limited communication or ability to proof the final plates
Today, it's much easier for authors to have a greater say in the final presentation due to the digital composition process
You can't use an appeal to tradition as the argument for revision.
I don't see why anyone should care that publishers have edited in the past anyway, even in this particular discussion where my own argument is for conservation. Publishers have done all kinds of things that this very project itself criticises and pointedly set themselves apart by doing differently. So, it's a weak argument for them.
Aside from that, what any other publishers do, even if it's totally common and even universal, doesn't change the argument that they were making that they wish to suggest that those edits cross a line that fixing typos doesn't cross.
By the time they reach the public domain they aren't though, and the public can and should do with them as they see fit
Modernizing / adapting is the least damaging change to be done here
For what it’s worth, that’s also exactly how I read your response, which was (to repeat) ‘That's fine! Our editions didn't erase any of the other editions you can find online and in print. You're more than welcome to select any edition that fits your reading preferences.’
I think that Standard Ebooks is a great-sounding project, but I honestly found your response not just flippant, but passive-aggressively rude to the original poster.
But — full disclosure — I also think that it would be a good idea to preserve the spellings found in the original editions you are digitising. So perhaps I inclined to feel the bite of your response more than someone who just doesn’t care.
> I honestly found your response not just flippant, but passive-aggressively rude to the original poster.
I didn’t read it that way at all. How would you have worded it in such a way as to sincerely express the stated sentiment without coming across as passive‐aggressively rude?
> How would you have worded it in such a way as to sincerely express the stated sentiment without coming across as passive‐aggressively rude?
Something like ‘While we understand that some people would prefer to read the original texts (modulo typos, formatting errors and the like), we think that it is preferable to modernize spelling because X, Y and Z.’
In other words, the polite response to ‘I like most of what you’re doing, but I dislike this particular thing’ is not ‘Fine! You’re free to go elsewhere,’ with an implied ‘don’t let the door hit you on the backside on your way out,’ but rather to engage and explain.
Again, I have to admit my own bias against the policy and consequent bias in favour of the original poster.
It is what you said. And for the record, I love the idea of this project. I just agree with the other poster about the location of this line that's all.
The text you have in your “quote” is a lot more snarky and rude than the original message. Did they edit their comment or something? Otherwise—why not quote an actual quote?
Considering the thrust of my comment, I don't understand the question. Obviously paraphrasing someone else's words into ones you like better is a fine and acceptable thing to do. So clearly I am just illustrating the problem by example.
The real answer is twofold.
1. We don't have a special 3rd kind of quote or other punctuation mark for reinterpreted references.
2. The real one: This is not a quote that lies as you imply. It is a new message, that merely uses quotes to denote a speaker, as in a pure fictional work, where the characters dialog is in quotes, even though no actual human was actually quoted.
Are there any other conundrums and baffling mysteries I can clear up for you?
When you use that syntax it looks like you are calling out an explicit quote; you may think that it's a reasonable paraphrase but I think most readers will see what you did as a strawman instead of a paraphrase.
Better to write inline "I feel like what you said amounts to [...]" to reduce the perception they you're making up quotes they someone didn't say or even clearly imply.
No one literate is in any danger of misinterpreting this very basic technique. I don't care about anyone else because it doesn't matter, they will misinterpret regardless, deliberately.
“I wanted a pure fictional speaker to argue against.”
Ok, thanks, that makes sense.
Ah but I did paraphrase, and you did not. My paraphrasing was not a lie, and yours is.
> Obviously paraphrasing someone else's words into ones you like better is a fine and acceptable thing to do.
Wrong. Not only is it tasteless and dishonest (not "fine"), it is against the rules of this site. But regardless of whether it's allowed elsewhere, you still shouldn't do it. (See "tasteless and dishonest".)
What's the point of including books that aren't public domain yet in your collections?
It makes it hard to browse those collections to find actual books to read. The first 3 series I clicked on all said "not P.D." (which at first I didn't know what "P.D" meant - remember your audience does not have your level of familiarity with your context, perhaps a tooltip on that badge would help)..
Then I see "this book will enter public domain in 2050"..
I commend you for this project, it's really awesome work.. From a user's experience, it would be great to have a filter on your various lists that restricts only to books that are available, and excludes these books that are not yet in your collection.
In addition to what Robin mentioned below, some of these placeholders are for books on our Wanted list. I also think it's useful to show readers that particular books are looking for volunteers to produce, and also to show that some books they might want are locked away by copyright for possibly decades. In that sense it's partly a political message.
It sounds like implementing the filter gp suggested would still send the political message though.
Whenever we add a collection, the books that are in that collection but not yet in PD in the US get placeholders. But a filter might not be a bad idea.
Which ebook reader works well with standard ebooks in 2025?
(More concretely my reader is a 2nd-gen kindle which is basically useless these days and I’d love an idea of something that can display standard ebooks with all their advanced formatting)
Thanks!
I read on an old Kobo, using Kepub files. Their Kepub renderer is quite good.
I think Kindle's renderer hasn't changed significantly for many years, and it had always been pretty bad. I always say that Kindle seems to have been created by people who hate books.
The best renderer around is iBooks on an iPad, which as far as I can tell uses an up-to-date Webkit.
I'd suggest KOReader, on various devices, as the best renderer and interface.
I read standard .epub files with KOReader on my Kobo Aura H2O. It's faster, nicer-looking, and more customizable than the stock reader, and the installation instructions were complete, correct, and easy to follow.
Thanks! I don’t like reading on a backlit screen (hurts the eyes) so iPad is a no-go, but a kobo would probably work!
Kobo Libra 2 is a great e-reader. Works well one-handed (screen rotates for left/right hands), has buttons for page turns. Integrates with Overdrive (what Libby uses). Drawbacks are Kobo's bookstore is weaker than Amazon/Apple. Screen is also not flush which means some dust can collect in the recess.
I also use a Kobo and occasionally an iPad. Do you know if it's possible to sync progress between the two.
I've been meaning to try calibre-web, but I'm doubtful iBooks will support OPDS.
A note for Kobo users: a lot of us (myself included) use Calibre to manage and upload our ebooks. Something about Calibre messes up Kepub files and strips out a lot of the formatting (including the book’s cover).
If I want to appreciate a nice Kepub from Standard Ebooks, I upload it directly to the Kobo.
A Kobo would be a great choice. I use a Kobo Libra 2 and love it a lot more than my old Kindle Paperwhite that got stolen: https://gl.kobobooks.com/products/kobo-libra-2 The Kobo Sage is also good because it has an 8" screen.
Standard eBooks offers kepub format for Kobo devices and files, they use their advanced Webkit-based renderer: https://standardebooks.org/help/how-to-use-our-ebooks#kobo-f...
What did you do with purchased books you had in your kindle? Rebuy them? Just “let them go”?
Thanks for the recommendation!
Fortunately, I had them backed up to a cloud folder. I remember almost deciding not to go to the trouble to back them up, but isn't that how it always works with backups? The Kobo also works with epub.
I recently purchased a Pocketbook Era. It is pretty much the perfect device for me - supports open standards and does not require any cloud account signups to start using it. It is not hostile to the user, 3rd party applications such as Koreader can be simply dropped in and they appear in the menus without any shenanigans like jailbreaking or custom launchers needed.
In my ideal world all devices would be like this.
KOReader for Kindle? https://github.com/koreader/koreader
It does a good job of modernising old Kindles.
Piggybacking: for computers, what is a good epub viewer?
What I'm personally looking for:
- Linux and/or OS X
- No ‘import’ requirement (a viewer, not a collection manager)
- Single page or continuous (no forced double spread)
- No required animations
- At least basic control over font size, spacing, margins.
- Keyboard navigation (at least next/previous page)
Check out Foliate, it's a really nice reader and Standard Ebooks display quite nicely using Foliate IMO.
For Linux, Foliate is very nice.
Apple Books on macOS is pretty nice
That’s calibre viewer, but it may require some customization to get something nice. Foliate is ok, but it’s a library. i’d say that’s OK because epub is a zip file and you need to extract it to read it.
Zathura is nice. Has vim bindings and a minimal UI.
Alexandria.
OS X: FB Reader
For Android, Moon Reader Pro.
Unmatched UI tweaking features which make reading a pleasure. Syncs bookmarks with cloud services, thus across different devices.
My Kindle is 8 years old and works excellent with standard ebooks. I think you can select any device that you prefer and it will be good.
Oh so you have one of the new Kindles!!
For reference my gen 2 kindle is 16 years old.
[dead]
I love this. However, I couldn't find an alphabetical list of authors, which is the way I wanted to browse on my first visit. Instead my only option is to show 48 on a page and paginate through, which is tedious. I know there are author pages - e.g. https://standardebooks.org/ebooks/william-makepeace-thackera... - so I presume it's feasible. An author index would significantly increase my likelihood of understanding what's available and engaging with the content.
We don't have a list of authors yet, but that's a good idea to add!
You could reuse whatever process generates the sitemap: https://standardebooks.org/sitemap
All the author pages come before any pages with books from those authors.
https://standardebooks.org/bulk-downloads/authors
Links in the first column.
Hi, Alex. Is there anyway to browser the ebooks filtered by languages? I tried to find some texts in French, but it doesn't seem to have any.
Standard Ebooks only works on English-language books, as typography varies between languages and we're only experts in English.
I can tell you there is a lot of appetite for other languages. I looked at the project and the amount of stuff that would need to be rewritten to work with multiple languages was daunting. I would consider working on making your documentation and workflow functional with multiple languages.
Lots of people have tried similar projects in other languages but as far as I know none have persevered.
Personally I think it's important to have one person in charge who is able to approve of the quality of all the project's output; for now, at SE, that person is me and I'm only an expert in English.
Project Runeberg seems to be still going after 30-odd years.
Project Runeberg is trying to be a nordic Project Gutenberg, not a nordic Standard Ebooks.
Enlightening comment!
Same for me. I think it's english only.
I am from India. Could you add local UPI based donation option at some point? Not everyone has card here.
Great work! Gutenberg project books have always been a pain to read. Thank you for caring!
Wonderful project! One thing I wish the website would have is being able to find the right book to read out of this enormous list — e.g. showing / sorting by Goodreads ratings (which I realize you might not want to do), or at least having some kind of a "Featured" section with the most critically acclaimed / must read books of the project on one page.
There are around a dozen collections on the (not prominently featured) collections page[1] like Le Monde's 100 Best Books of the Century and Modern Library's 100 Best Novels, etc.
1. <https://standardebooks.org/collections>
Steinbeck was the first name I searched for, so this was great to see even if his major works won't be available for some time. I do wonder how badly the Steinbeck or Faulkner estates are hurt by the sudden loss of royalties? Imagine working hard to write a book to make a living and then just under a hundred years it's taken away from you. Also, AI.
Been using Standard Ebooks for a while now, but wanted to drop by here and say how great this site is! It's replaced P.G. for me (for whatever is on this site, at least) and I like the much nicer formatting on the texts. It's great on both my physical Kindle and Apple Books on my iPhone.
Is there an API or downloadable catalog of the titles? Happy to feature them on meetnewbooks.com so more readers can find them.
Yes, we have complete feeds available for our Patrons: https://standardebooks.org/feeds
Really appreciate the work Standard Ebooks puts into making these texts not just available, but readable
I’d love to know more about the pattern of keeping each book in individual repos, rather than in a singular repo.
Each repo is a history of the ebook including editorial changes, typos fixes, and the like. Having a single repo containing thousands of ebooks and their histories would be pretty annoying to browse.
Presumably to keep the repo size reasonable. Say I want to make an ad hoc contribution to a book, if step 1 is "download this multi-gigabyte repo" then that's a fairly big hurdle.
In your opinion, what is the ebook reader you like the most ?
Roughly speaking, how long does it take you to produce a single ebook?
Once you're very familiar with the process, you could get a draft of a basic prose novel ready for proofreading in a few hours. Then it has to be proofread and completed.
Beginners, and people working on more advanced books, can take much, much, much longer.
it varies widely depending on the length and type of book and how much free time the volunteer has to devote to it
Anywhere between 1 week for the simplest (straight narrative, not too much verse or endnotes) and ~1 year (thousands of endnotes, pages of verse, drama, in-line references to book titles, use of technical terms, etc)
ooo tempted to reprint faulkner as part of a small press, thanks for the idea
I recently started on my first title contribution to the project, it’s a rewarding experience https://github.com/stoyan/edith-wharton_the-custom-of-the-co... It’s HTML all the way down
The step-by-step: https://standardebooks.org/contribute/producing-an-ebook-ste...
In a nutshell: start with a Project Gutenberg text, clean it up to a high standard, have it peer reviewed and published
Love this. So many in the archivist community are only interested in preservation and don't care at all about making the material accessible. Love to see a project like this prioritizing the latter.
You’re spot on with this. I recently converted a local history book from 1911 to Markdown, ePub and HTML and tracked the changes on GitHub. Only a handful of copies of this book exist in physical form and it has been photo copied (which is great).
However, I was completely shot down by the local library when I was discussing it with them. They said they already had a photo copy and didn’t need anymore digital editions, I tried to explain the benefits of having it in a machine readable format but they wouldn’t entertain it. I completed the project for me, so I wasn’t too bothered, but thought they might have been interested in archiving it but they weren’t.
My general feeling is that they didn’t like an outsider contributing and touching on a format they didn’t know so got slightly defensive.
Find an archive and make sure they're aware of the work you've done. Archivists always love meeting people who've done good work in the space they're in. Especially when they have some tech chops which is desperately lacking in the space.
Beyond that, if the material is public domain, that library is called The Internet. Post it and promote it. The only reason to seek association with a library is if you're looking for cred for some reason, and that's not the business they're in.
If it's not public domain, or if you haven't marked your derivative work public domain, then you put a library in an awkward position. Realize that these are the types of people who still post little notes by the copy machines saying what's permissible and enjoy policing it.
Most just say no for the same reason that Hollywood returns ideas and scripts unopened. They're busy and the cost/benefit isn't there.
Although the self-described online ones tend to play fast and loose, real librarians have a formal code of ethics which is worth reviewing.
https://www.ala.org/tools/ethics
Interesting. I wonder if libraries suffer a supply-chain risk and so avoid taking contributions from (non-vetted) individuals? I imagine that over time a library gets lots of offers to take "important works of literature" from cranks, and perhaps they've developed this culture to protect them from that. Pure speculation, of course.
Libraries typically don't even accept print books or CDs/DVDs. If there's a donation bin outside it probably isn't even theirs. And if stuff actually winds up with them, it just gets sold off so they can purchase material via vetted channels.
https://www.betterworldbooks.com/go/donate
Thanks for doing this. We need more people to take initiative like this!
can you share the links to your project?
Do you "claim" a book, to make sure that no-one else is trying to work on the same book? I presume that's part of step 4 in your link, given that it would be heartbreaking to get 90% of the way through and then be beaten to it by someone who'd started at roughly the same time!
Yes, you signal your intent on the mailing list subject to approval by the editor-in-chief
Exactly, you do get approval before you start, as step 4 says: https://standardebooks.org/contribute/producing-an-ebook-ste...
In my case I picked a title from the project’s wishlist and almost started but searching the mailing list showed that someone has just started. I found another title by the same author: https://groups.google.com/g/standardebooks/c/IP0emhSQ6Bw/m/B...
Some of the higher ranking previous discussions:
2017, 441 points, 97 comments https://news.ycombinator.com/item?id=14570035
2019, 820 points, 131 comments https://news.ycombinator.com/item?id=20594802
2022, 1578 points, 256 comments https://news.ycombinator.com/item?id=32215324
2024, 701 points, 154 comments https://news.ycombinator.com/item?id=38831219
It's thanks to this site that I learned that Kobo uses a really bad renderer for epubs unless converted to their own ebook format (Kepub). It make a huge difference in appearance and performance on a Kobo device.
https://standardebooks.org/help/how-to-use-our-ebooks#kobo-f...
You don't even have to convert it, just rename the extension to .kepub.epub. https://github.com/kobolabs/epub-spec?tab=readme-ov-file#sid...
This is not entirely correct - Kobo also expects a bunch of special <span>s inserted for things like highlighting and page numbers to work.
It kills me that Kobo is so close to having plain epubs rendered with Webkit but for some reason they just won't take the leap!
I discovered this too. However, I now use Plato Reader on my Kobo with standard ePub and it’s lovely.
I assume KOReader has a better renderer for epub but will have to test how it compares to the stock software+kepub. So far I've only used KOReader on my device.
the only issues i've found with koreader is its default margin size and its display of standard ebooks' titlepages but (I believe) these can be fixed with a fairly simple user tweaks css
You can set default margins in the user interface of KOReader too.
You can use kepubify to convert epubs to kepubs (and calibre will do this as well)
https://pgaskin.net/kepubify/
And https://send.djazz.se automatically performs the conversion for you with kepubify and sends it to your ereader! No affiliation, just a happy camper chiming in
Wow I never knew this!
Yeah, if you just load normal epubs it defaults to an old version of Adobe Digital Editions unfortunately.
Yes, though I understand Kobo is working on correcting these issues with the epub format.
Are they? Where have you heard that?
Recently Calibre was updated to convert things to kepub when loading to Kobo devices - see https://www.omgubuntu.co.uk/2025/03/calibre-update-convert-k... - but I haven't anything about Kobo itself doing anything to improve this.
I love Standard Ebooks.
See also Global Grey ebooks: https://www.globalgreyebooks.com/ One woman has formatted hundreds of ebooks herself.
Most of the big print-on-demand companies will now make hardcovers, for about $10. You can't feed raw Gutenberg files into those mills, but these "standard ebooks" have enough formatting info for that. So that would be a useful service.
What are some examples of companies that do this?
Are there any non-English books? When I go to the search page, language isn't even a pull-down option, so I'm guessing not.
There is a huge world of out-of-copyright non-English texts, and Project Gutenberg has many thousands of them. I wonder if any interest could be generated to help bring them in by posting on foreign language subreddits or something.
Just looked through the entire website to answer this question. Seems like they only accept english books :( "Types of ebooks we don’t accept: - Non-English-language books. Translations to English are, of course, OK." (https://standardebooks.org/contribute/collections-policy)
Weird. Why the explicit rule against them?
I understand if the existing editors can't personally proofread the submissions, but that's why peer-review exists. Or an open-source project in general where people can post corrections. Jimbo Wales didn't need to speak a hundred languages to launch Wikipedia.
To me, that niche is already covered by Wikisource. Standard Ebooks as a project is very strict about conforming to its editorial and quality standards. On boarding more languages would require volunteer editorial experts in those languages.
Besides, projects in other languages can absolutely build upon Standard Ebooks work, but to expect Standard Ebooks itself to support other languages is just too outside the scope and expertise of the volunteers available.
If you were to find the expert editors for the other languages would you let them publish the works in those other languages on standards books website?
well, that would be up to Alex. but as that would require a pretty substantial organizational and responsibility shift, I imagine, no, he would not.
As it is now, Alex is editorially responsible for all output of Standard Ebooks. Changing that would require someone with the time and experience willing to take on all the responsibilities that Alex currently has for each of those other languages.
A well-defined focus can help management of a project, for example, by not having the participants spread too thin.
The website and toolchain are open source, so if someone would build an international version, and do it persistently, I'm sure they would link or maybe even merge the projects a bit.
Answered here: https://news.ycombinator.com/item?id=43601273
Love that they're using Git and keeping everything open. It's rare to see such a thoughtful blend of literary love and modern tooling
That website is hopefully not an indication of how these ebooks will look on my mobile.
A screenshot from the typography section:
https://ibb.co/nqhyTR3M
The manual has some known issues on mobile, I believe there's a GitHub issue open about it. It's low priority because the manual is rarely read on mobile. PRs welcomed!
if you're reading a style manual it might :)
but no, the manual itself is not really mobile-friendly. you can check what an actual ebook would look like though:
https://standardebooks.org/ebooks/louis-couperus/the-tour/al...
Much too tight leading for a book text.
This is a leading you'd see on the ingredients list of an energy bar packaging.
The other choices are fine.
Caveat: I studied typography and worked in that field for a decade.
the online view is not the primary way readers are expected to read the ebooks. downloading the epub and reading on an ereader (edit: where line height and font size are customizable) is the expected and best supported method
however, contributions are very welcome and everything is hosted on GitHub if you'd like to suggest improvements; or send your thoughts on the mailing list
But if they have an online view, why not make it readable? The suggestion above about the line height is presumably a 1-line CSS change.
presumably, which is why i encouraged submitting a note to the mailing list or the standardebooks/web repo on github
I think the point of parent was that the issue, the too narrow leading, is not a change that needs debating. On a mailing list, issue tracker or whatever.
Or if you think it actually was, this was not a project that I'd want to get involved in.
As someone who reads mostly ePubs, many of which suffer from issues this project aims to fix, I mean that in a very caring way.
i also don't think it needs debating. my point was that the issue, the too narrow leading in the online view, is just not going to be fixed unless someone points it out to someone that can fix it. if that's you, great! you can submit a PR to the git repo. or, if don't have the time or want to have to go find where the line height is defined, submitting a comment to the mailing list or noting it on the issue tracker will let a volunteer fix it
from my own experience, Alex is very amenable to improvements. the online view of the ebooks is just not used by probably anyone to actually read the books (just use an ereader app or device its a way better experience anyway) and because of that no one has cared to point it out until now
For those who are into ebooks and audiobooks, I’ve been having a blast with the app Storyteller: https://storyteller-platform.gitlab.io/storyteller/
You can self host the server, and it will create epub3s with the audiobook and ebook synced up.
Then you use the mobile app to listen and read the books. It works way better than whispersync from kindle.
Read on your boox e reader then switch to your phone and listen and everything syncs up seamlessly.
Where do you find the books to host?
Also your link has an erroneous .com
You can get drm free audiobooks from libro and you can strip drm from kindle and audible books with calibre and libation.
I would love this if it were to produce viable unabridged ebooks of Francis Parkman’s “France and England in North America” vol 2-7. All the existent digital editions were poorly scanned and don’t separate footnotes from the main text.
If you have the cash, you can pay them to do so! Scroll down to “SPONSOR A NEW EBOOK”:
https://standardebooks.org/donate
> Sponsoring a new ebook of your choice calls for a donation of $900 + $0.02 per word over the first 100,000
I love this project and don't want to disparage the work that goes into it, but 900 USD, and it has to be a book that is already transcribed online? That seems a bit much to me.
You’re paying a human to remaster the book word for word and hand transform it into epub html paragraph by paragraph.
How much less would you do it for?
That sounds quite reasonable to me. That's about what a freelance proofreader charges to edit a book, if https://thewritelife.com/how-much-to-pay-for-a-book-editor/ is correct, and that's working with a (likely Word) document which isn't poorly scanned from paper.
If you pooled the funds with 10 other people who want the book, it would be $90 each. Or imagine pooling it with 100 people.
You can also join our Patrons Circle to have this book added to our Wanted Ebooks list, which is a list of suggestions for our volunteers to work on: https://standardebooks.org/donate#patrons-circle
Looks like a great project, and one sorely needed by people like me who find themselves trying to get hold of old books they can't get in their local library and that are too expensive to buy secondhand.
As far as I know Standard gets their raw ebooks from Project Gutenberg which has a vastly greater collection of public domain works. What they're doing is typesetting them for the average reader. But if all you're looking for is just the content, Gutenberg is the place to look for ethically clean copies.
Tracking down older or out-of-print books can be weirdly frustrating, especially when prices for secondhand copies get absurd
The shadow libraries such as Anna's Archive are a treasure trove of old books, and you're not breaking any imaginary law by downloading old books which are out of copyright.
If a book is out of copyright you can usually find the scan on Internet Archive. No need to look elsewhere at all.
The internet archive's open library will also link to Standard Ebooks (and Gutenberg and a few others) if a version exists of a book you are looking at e.g.:
https://openlibrary.org/books/OL37044523M/The_Woodlanders
If a book is still in copyright, chances are you’ll find it there as well.
Scans suck though, even a badly OCR’ed EPUB is way better.
The scans can have a different copyright date than the book itself.
There is no copyright on scans.
Scanning is not transformative and does not result in a derivative work which can is protected by copyright law.
https://en.wikipedia.org/wiki/Wikipedia:Scanning_an_image_do...
https://law.stackexchange.com/questions/1214/who-owns-a-copy... points us to read the Compendium of US Copyright Office Practices at https://www.copyright.gov/comp3/docs/compendium.pdf
> 313.4(A) Mere Copies
> A work that is a mere copy of another work of authorship is not copyrightable. The Office cannot register a work that has been merely copied from another work of authorship without any additional original authorship. See L. Batlin & Son, 536 F.2d at 490 (“one who has slavishly or mechanically copied from others may not claim to be an author”); Bridgeman Art Library, Ltd. v. Corel Corp., 36 F. Supp. 2d 191, 195 (S.D.N.Y. 1999) (“exact photographic copies of public domain works of art would not be copyrightable under United States law because they are not original”).
A pdf file can contain more than just the raw images of the pages.
Certainly! If you add my latest Kirk/Spock slash fanfic to the end of the text, then that is transformative, so the resulting PDF is covered under copyright.
But you wrote "scan". Adding an OCR'ed text layer, or doing manual proofreading and layout ("sweat of the brow") is not sufficiently transformative to have copyright protection.
And we were specifically talking about scans of old books stored in shadow libraries.
https://dave.autonoma.ca/blog/2020/04/11/project-gutenberg-p...
> Of all these projects, the most amenable to automatic typesetting are those produced by Standard Ebooks and HTML Writers Guild. The benefit of using HTML Writers Guild is their semantic markup and simple document type definition (DTD) file. Standard Ebooks, as the name suggests, are brilliantly standardized and have an excellent Manual of Style that describes what to expect from the XHTML.
How about https://en.wikisource.org/wiki/Main_Page ?
It's not very obvious, but Wikisource provides EPUBs via the Tools menu for every book.
I really miss some kind of organization. I can't find a most downloaded page, or even some recommendations lists.
Does anyone know why the beautiful covers disappear when I import these books in standard format into Calibre?
I like the idea. I read a bunch of classics from Gutenberg. In reality so many old books are very long and boring I ended up getting more modern books from the library instead.
Maybe TikTok ruined me but maybe these things really do literally have a shelf life. Hopefully reformatting will help. Perhaps a better way to review and find the gems would be most helpful..
Perhaps it's not just about the 'shelf life' of a book, but also the language and style they use. The more archaic the language, and the more distant the style that the author's use, the harder it is for me to focus on the book.
Perhaps it would be useful to have expertly abridged and modernized versions of (e)books, with interpreter's notes for each change.
> Perhaps it would be useful to have expertly abridged and modernized versions of (e)books, with interpreter's notes for each change.
A good AI can do this for you nowadays. So if anything it's nice to have the original version available.
Did you ever consider making them public domain but still offering to charge optional $10 donation for download?
I’m interested in a similar approach for a rare book library, but funding for staff is a really challenge so we want to make some kind of revenue stream.
Standard Ebooks grew out of a pay-what-you-want experiment that Alex did ~10 years ago
It surprises me that the eBook (clarification: epub) format is basically XHTML because 1) that means that every eReader needs to basically be a web browser 2) this sounds like it would make reformatting for different devices NOT easier
It makes a lot of sense when you recall that HTML and its ancestors were designed to mark up and format documents, i.e. books. One of the most fundamental elements is <p>, which stands for... paragraph.
Each renderer differs in capabilities, and most are stuck in a subset of early-2000s capabilities, so designing an ebook is very much like designing for the 90s era web. Lots of hacks are required to get the same file to look good on many different renderers, and achieving that is one of the goals of Standard Ebooks.
Yeah (i guess you mean epub), though in practice readers support only a tiny subset and epubs avoid using anything fancier than basic XHTML. Epubs that try to use fancy stuff (like most CSS outside of setting fonts - that readers can ignore either because they do not support it, or because the user wants to use another font) tend to not display correctly.
Including a web browser seems a lot easier and simpler than coming up with your own rendering system once you want to support a feature set past the trivial.
Also, xhtml is just markup. It doesn’t mean you have to support all the possible tags and styles of modern html and css. It would be a sensible choice even if you had basic needs. You just parse it into whatever representation you want.
> Also, xhtml is just markup.
And so it's not a programming language runtime (i.e. javascript or wasm), nor a css renderer, nor a bunch of web-apis.
It's these things, not the (X)HTML parsing and rendering that makes a browser the complex thing it is.
this also somewhat surprised me at first but I think it's obvious in hindsight, though they don't have to be a full-blown web browser (you can go read the epub specs at W3C to see what's supported)
as for (2) I'm not sure why you think it would make it less easier? being html, text reflows automatically based on screen size, font size, line height, etc
I guess I assumed that, for example, multi device support on websites for various device widths entails a bunch of CSS, which means the epub renderer would have to also do that, which basically means a whole web browser.
also that things like footnotes or anything that has a floating reference (table of contents links for example) might get very complex or require javascript
since ebooks are primarily (only?) text you don't have to worry about UI elements and such which simplifies a lot of the css
footnotes aren't really a thing with ebooks (at least as far as displaying the note on the page with the text). Because it is just a html renderer, footnotes are presented as mutual <a> elements located in the endnotes at the end of the book
The greatest surprise is that no popular web browser opens ePubs natively! This in 2025, where they all display PDFs, high resolution video, 3D games, etc.
A bigger surprise (failure) is that the EPUB folks have continued to evolve their bespoke format instead of ditching it for something that legacy browsers already know how to handle. An "EPUB" should just be a Mac-style bundle (i.e. a directory) with an XHTML file in it written to conform to a specific metadata profile.
EPUB isn't all that different from what you're describing. It's bundled as a ZIP archive with a couple of XML metadata files - and the content is split into one HTML file per chapter or section to make it easier to handle - but the idea is the same.
Hey, ChatGPT, tell me what's wrong with this person's comment.
> [T]he third comment violates the Cooperative Principle, specifically Grice’s Maxims of Relation and Manner, and ends up implying ignorance where there is none. Let’s break it down a bit more with that framework in mind:
> VIOLATION OF GRICE’S MAXIMS
> The second ["EPUB folks have continued to evolve their bespoke format instead of ditching it for something that legacy browsers already know how to handle"] commenter criticizes EPUB for continuing to evolve a packaging format that is not browser-native. They're not confused about what EPUB is—they're lamenting that it isn’t something simpler, like a plain web bundle a browser could just open.
> The third commenter responds by explaining what EPUB is, as if that somehow rebuts the original critique.
> Factually true.
> Entirely irrelevant in context.
> This failure to meet the relevance standard creates an implicature: the previous commenter must not have understood the format they were critiquing.
> THE IMPLICATURE TRAPS THE THIRD COMMENTER
> By stating something the second commenter obviously already knows, the third commenter unintentionally shifts the conversational footing in a way that belittles rather than builds. That’s why the tone feels off: not because of overt rudeness, but because the presupposition of ignorance is baked into the structure of the reply.
> FINAL THOUGHT
> The third comment reads like an attempted “correction,” but since the original comment didn’t contain a factual error, only a value judgment or proposal, this “correction” becomes a non sequitur—one that subtly undermines the prior speaker’s credibility while failing to address their actual point. That’s what makes it rhetorically broken, even if factually fine.
There’s also an epub-namespaced set of attributes which extend XHTML with ebook specific semantics. But those typically aren’t necessary for the visual representation of books.
Edge used to, until MS rebuilt it on top of Chromium. Shame.
Yes, and that was a great viewer too. Having the whole book laid out horizontally rather than vertically was a good idea.
Is there anything similar for Audiobooks (which I wish would go back to being called Talking Books)
Librivox https://librivox.org/ is the closest I know.
I would also recommend using Microsoft Edge's built-in ReadAloud (TTS) on standard ebooks. They have a mind boggling number of hyper realistic voices; more than any other browser I've tested.
I love Standard Ebooks! It is such a treasure! Currently enjoying Cup of Gold by John Steinbeck.
Thank you to everyone who helps put this together!
It would be great to be able to sort by popularity, to make it easier to find popular books. Or have a list of top 100 downloads.
What I'm missing in modern ebooks (like epub format) is more metadata. Who's talking (character data)? What emotional aspects does the scene have (angry, happy, sad, in a hurry)? Where does the conversation take place (geodata)?
I'd love to see at least:
This could be prepared by the author, work as a glossary, enrich the whole ebook experience and also would be a great preparation to teach AI voices how to convert a book into an audiobook.What's the point of reading a book, then? The joy of reading fiction is to try to understand the humanity in the scene. I don't need the author to force feed me all of these details. I want to wrestle with the answers, to try to grasp what it might mean.
The challenge would be balancing that metadata richness without turning the book into a spreadsheet, but if done well (maybe opt-in layers or a toggle), it could really deepen the experience
TEI is something like that, but the amount of effort required to mark a book up like that would be astronomical.
Starts to sound like the kind of task an AI could do reasonably well though
If the goal of these tags are metadata for AI consumption, and the solution to generate them is “use an AI”… what is the point?
Specialization I presume, so one produces the metadata that can be consumed by another.
Also, the thing from the above post that stood out to me would be to act as a reminder for the reader. Not so much the location and emotion, but the character data. I've often found myself wondering who the character is that's appeared in a scene, forgetting that they previously appeared earlier.
That sounds like you are asking for a play.
If it can be derived from the book text, then LLMs or reader can already derive it.
If it can’t be derived from the book text, then it’s extra content that probably shouldn’t be there because it came from elsewhere.
I found curious that if you order the books by reading difficulty (easier to harder) The sound and the fury is on the second place.
We use the Flesch-Kincaid algorithm to calculate reading ease. For most books it works pretty well, but for avant-garde prose like The Sound and the Fury it fails pretty badly. It also considers Ulysses to be "fairly easy"!
Beautifully made! Which gutenberg.org would be updated with this design & approach!
A sort by popularity filter would be appreciated.
Some places resist this because it causes a "rich get richer" effect in popularity. But it's admittedly convenient.
Awesome project. Gutenberg is mentioned, does this project feed back to Gutenberg?
Absolutely, from a previous discussion:
https://news.ycombinator.com/item?id=32217313
As the linked comment says, it's up to the individual contributor to inform PG of any corrections; SE does not do so as a matter of course (at least, that was the case when I last contributed).
I love this. They pay attention to everything I normally despise about (many) ebooks (poor layout, lack of metadata, no chapter headings etc).
Do they use AI tools in their conversion workflow?
No, LLMs are not used (nor would they be allowed). As for whether you would consider OCR to be AI, then... possibly?
Sorry for the question but how behind are the LLMs in terms of quality for something like this?
I can't really answer that because I haven't actually tried to use an LLM on any part of the process. The vast majority of the process is semantic markup using (x)html and proofreading. The markup process could, I guess, use an LLM, but most of it is already automated using regex and linting.
Does it use any automation?
My bro-in-law supported his family as a freelance editor for years while my sister was doing the "maternity leave" thing so I know there's a non-trivial amount of work that goes into book editing. Cutting out some of that human labor seems like a good thing for a volunteer project.
there is quite a lot of automated changes using standard ebooks open source tools package
the vast majority of textual tooling is regex-galore, but there is also automated epub tooling in there too
Another great ebook/volunteer project is Librivox - free public-domain audiobooks read by volunteers around the world...
https://librivox.org/
You can pair these together with the Storyteller app to create an epub3 with the audio embedded and aligned to have a whispersync-esque experience
What a great project! This should really be funded by states, states which often already have some money dedicated to the preservation of culture.
Too bad most stuff I really like will never enter the public domain in my lifetime... well, paper and the high seas still exist!
its never too late to expand your "stuff I really like" further into the public domain!
there are whole generations of wonderful and insightful works that essentially disappeared from present consciousness for no reason other than for being old
It would be better to expand the public domain. Whole generations of works were stolen by extensions of copyright.
while I don't disagree, ¿por que no los dos?
A good initiative, but the "us vs them" framing — where the "them" are other people trying to do a service for people — gives off bad juju. It positions the value proposition by seemingly denigrating other providers of free ebooks.
It begins with "Other free ebooks don’t put much effort into..." which sounds extremely catty.
Maybe I'm reading too much into it, but it seems there's a way to stand on other people's shoulders and celebrate each other.
Forbidden You don't have permission to access this resource.
thanks for being open ...I guess
You're probably in some country that has longer copyright duration than the US (life+70a, which is atrocious enough). Use Tor or a proxy.