Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Topic Options
#357679 - 22/02/2013 19:44 Creating an eBook
tanstaafl.
carpal tunnel

Registered: 08/07/1999
Posts: 5549
Loc: Ajijic, Mexico
Does anybody here have experience or suggestions related to creating eBooks?

I have a few books I wish to convert from old paperbacks to eBooks suitable for reading on a Kindle.

Here's how I have been doing it:

1) I take my jigsaw and cut the binding off the paperback.
2) I run the loose pages through my scanner: 30 sheets per minute (double-sided!) at 300 dpi.
3) I saved the scanned images into a single PDF file.
4) I run Adobe Acrobat's OCR module through the PDF file.
5) I save the OCR'd PDF file as an MS-Word editable RTF file.
6) I use MS-Word to strip out headers and footers, i.e., page numbers and Author/Title on each page using global search and replace, and then format chapter headings to my liking by putting in forced page breaks and bold type.
7) I save the file, and import it into Calibre where I can convert it into *.mobi format for the Kindle.

While this seems very complicated and involved, the whole process, including the "carpentry" smile , takes little more than an hour.

However… (you knew there'd be a 'however', didn't you? smile

The OCR and saving as RTF is less than robust. In particular, Adobe's OCR does an inadequate job of dealing with ligatures ("fi" may be be read as "h", for example) and does not do well parsing dialog. For instance:

"If you set the cat on fire, he won't like it."
"Well, I don't have any matches so I can't do it."

Will very likely come out as

"If you set the cat on fire, he won't like it." "Well, I don't have any matches so I can't do it."

I can fix these with a global search for quote-space-quote, replaced with quote-carriagereturn-tab-quote, but many other deficiences are harder to deal with. To thoroughly proofread and correct the scanning + OCR errors (not to mention the errors in the original document!) can take six to ten hours.

I only have a few books I wish to convert, and don't wish to spend money on some powerful, difficult to learn and use dedicated bookmaker program. Instead, I am looking for suggestions that might make what I am doing more efficient and transparent.

Ideas?

tanstaafl.
_________________________
"There Ain't No Such Thing As A Free Lunch"

Top
#357682 - 22/02/2013 20:02 Re: Creating an eBook [Re: tanstaafl.]
drakino
carpal tunnel

Registered: 08/06/1999
Posts: 7868
No experience in the way you are doing it, but it sounds like you need an OCR solution that preserves layout. Using key words in that regard may help you find something.

If you also want a project, this site looks to be a good resource on building a book scanning rig: http://www.diybookscanner.org . They also have some forums, and some of the software discussion there may interest you.

All of my ebook experience has been using already digitized text sources and using Pages or iBooks Author to save to ePub on a Mac. I know your anti Mac views, but I figured I'd mention those two apps for others here.

Top
#357683 - 22/02/2013 21:14 Re: Creating an eBook [Re: drakino]
tanstaafl.
carpal tunnel

Registered: 08/07/1999
Posts: 5549
Loc: Ajijic, Mexico
Originally Posted By: drakino
If you also want a project, this site looks to be a good resource on building a book scanning rig: http://www.diybookscanner.org .
Naahhh... been there (to the website) done that. smile

Their rig can do 14 pages per minute (20 pages with dual cameras) and then requires manually interleaving the left/right filenames before importing into a PDF. My "rig" does 60 pages per minute and saves the completed scan into the PDF in one step. Then they're using Adobe Acrobat 9 for OCR, just like I'm doing.

Of course, their setup has the advantage of not destroying the original book with a jigsaw! I'm not converting rare or valuable books, just books that aren't available in eBook format at the moment. Example: "The Shipkiller" by Justin Scott.

I'll scan through their forum, though, and see if I can pick up any pointers.

Thanks for reminding me of that site.

tanstaafl.
_________________________
"There Ain't No Such Thing As A Free Lunch"

Top
#357696 - 24/02/2013 04:44 Re: Creating an eBook [Re: tanstaafl.]
tanstaafl.
carpal tunnel

Registered: 08/07/1999
Posts: 5549
Loc: Ajijic, Mexico
Originally Posted By: tanstaafl.
I'll scan through their forum, though, and see if I can pick up any pointers.
Okay, I did pick up one pointer that paid off. The forum members seem pretty fond of the ABBYY program for OCR, it's on sale for half price through the end of February, so I sprang $30 for it. At a rough approximation, it cut my OCR errors by about a factor of 10 as compared to the OCR built into Adobe Acrobat.

Reading through the forum, looking at the questions being asked, I don't think there's much there that they can teach me. Well, that's a bit unfair, they are focused more on getting their (IMHO) somewhat Rube Goldberg machines to work properly, they're more into the hardware end than the software end. They're doing non-destructive eBook conversions, whereas I am destroying my books, which greatly simplifies the process.

I can highly recommend the ABBYY program for OCR. It does what I call "Intelligent OCR" in that it has dictionaries for about 50 languages built into it, and rather than converting letter by letter, it does it word by word, so that if it sees "floor" with the troublesome "fl" ligature, it doesn't make it into "hoor" or "Aoor".

tanstaafl.
_________________________
"There Ain't No Such Thing As A Free Lunch"

Top
#357698 - 24/02/2013 10:01 Re: Creating an eBook [Re: tanstaafl.]
andy
carpal tunnel

Registered: 10/06/1999
Posts: 5916
Loc: Wivenhoe, Essex, UK
ABBYY is what is included as part of the software package for my Fujitsu ScanSnap document scanner that I recently started using.

I can't recommend this ScanSnap enough, many hundreds of pages later not a single double feed. It takes a few seconds to take a multi page document to scan both sides, OCR it and import it into EverNote. Highly recommended.
_________________________
Remind me to change my signature to something more interesting someday

Top
#357699 - 24/02/2013 12:25 Re: Creating an eBook [Re: andy]
larry818
old hand

Registered: 01/10/2002
Posts: 1039
Loc: Fullerton, Calif.
Which Fujitsu ScanSnap do you have and like? All their scanners are called ScanSnap. I'm needing something with a feeder to replace my Agfa SnapScan.

Top
#357700 - 24/02/2013 12:29 Re: Creating an eBook [Re: larry818]
larry818
old hand

Registered: 01/10/2002
Posts: 1039
Loc: Fullerton, Calif.
Various things work better / worse in various versions of Acrobat. Such as the measurement tool doesn't work at all in any version other than 8. I use that a lot, so I'm stuck on 8.

The OCR in version 8 works very well, you might give that a try.

BTW, the last version of acrobat that had a decent print to pdf utility was 4. I still use 4 for printing.

Top
#357701 - 24/02/2013 12:30 Re: Creating an eBook [Re: andy]
K447
old hand

Registered: 29/05/2002
Posts: 799
Loc: near Toronto, Ontario, Canada
I can also recommend the Fujitsu ScanSnap product. I used an S500 unit to scan multiple instruction manuals, each maybe 1 to 2cm thick, double sided. I trimmed off the glued binding, then fed the pages through in chunks. Sometimes I would slip in the next stack as it was pulling the last page of the prior stack. Even if it paused after exhausting the currect feeding, it would carry on when I added more pages, with all the pages going into a single output PDF (configurable).

Very few OCR errors, excellent rendering of the diagrams and charts, direct to searchable PDF output. One pass does both sides, colour, grey scale, or B&W. I opted for higher DPI scan quality rather than most compact output file, and I suspect that also helped the OCR accuracy.

Fairly quick overall.

I have yet to unpack my newer Fujitsu S1500 model, but I expect it to be as good or better, both software and hardware.

Top
#357702 - 24/02/2013 13:12 Re: Creating an eBook [Re: larry818]
andy
carpal tunnel

Registered: 10/06/1999
Posts: 5916
Loc: Wivenhoe, Essex, UK
Originally Posted By: larry818
Which Fujitsu ScanSnap do you have and like? All their scanners are called ScanSnap. I'm needing something with a feeder to replace my Agfa SnapScan.


The iX500, not cheap at around £400. However it has already saved me time and even if it only saves me a few hours of frantic searching for lost paperwork a year it will pay for itself in a couple of years.

(I'm using on the Mac, don't know how their software package for Windows compares)


Edited by andy (24/02/2013 13:14)
_________________________
Remind me to change my signature to something more interesting someday

Top