Announcement

Collapse
No announcement yet.

gbDocumentCapture - Discussion

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • gbDocumentCapture - Discussion

    This thread discusses the gbDocumentCapture app I posted in the Source Code forum.

    It allows you to capture the text from all pages in a document. It does so by capturing an image of the document found in the middle of the Desktop, then uses the free Tesseract library to extract the text from the image. gbDocument Capture walks through the document, capturing and merging the text from each page.

    I like how this turned out. I'll put some more effort into giving it a much better interface along with some other capabilities that come to mind.

    Folks who want to convert a document, such as Kindle book, into a text file will find this useful - particularly for books with DRM protection.

  • #2
    Some apps, such as the Kindle for the PC, do not support a Ctrl-A shortcut. Instead you have to manually select the page content before using Ctrl-C to copy the selection. Also, the Kindle does displays only a page of content at a time, meaning you have to select text one page at a time.

    So trying to manually save a text copy of the entire document would be very slow. With gbDocumentCapture the capture of all pages in the document is automated.. It's not a particularly fast process - around 2s per page - but it is automated so that you can walk away and let it work unattended.

    Comment


    • #3
      Oh ... and for those who wonder about the ethics of converting a Kindle book to text ... I reached out a couple of years ago with that question and they told me that they were ok with it as long as I used the extracted text for personal use only. I still have the email from them.

      There are a number of apps out there which do conversion of Kindle books to multiple formats, so I don't seem to be doing anything that Amazon polices to any extent.

      I will admit that I don't plan to ask them again

      Comment


      • #4
        Hmmm... after all that happy feeling about how cool it was to use Tesseract to extract text, I wondered if I couldn't simply use %WM_GetText?

        Off to give it a try ...

        ... added ... in first tests, neither the NotePad nor Kindle apps returned the text - using %WM_GetText and %EM_GetTextEX. Time for bed ... will try something more tomorrow.

        Comment


        • #5
          So trying to manually save a text copy of the entire document would be very slow.
          Kindle is not a document management tool. If we could all read more than one page at time, it might have been.
          Michael Mattias
          Tal Systems (retired)
          Port Washington WI USA
          [email protected]
          http://www.talsystems.com

          Comment


          • #6
            Also, some books on Kindle, limit the total amount of copy and pastes permitted, so copying a book page by page is not even possible.

            Comment


            • #7
              Howdy, James!

              Yes, all the more reason my approach has merit - it bypasses the limitations that the Kindle app presents. But I want much better speed.

              I've not done any testing on speeding up the process. It worked for me at 2s per page. I'll test it at 0.5, 1.0 and 1.5. I could even add a timer to get an actual result.

              There might be a way to enlist the help of threads. I'll look in to that as well.

              Comment


              • #8
                kindle is about distribution with royalties and copyright protection.

                Discovering how to make copies will simply force Amazon to a more secure version.

                Cheers,
                Dale

                Comment


                • #9
                  Howdy, Dale!

                  Yes, you are right. As best I can tell, Amazon has simply increased their encryption each time companies have caught up to the latest encryption method.

                  I would guess that Kindle doesn't worry all that much about the issue because I would guess that 99%+ of their customers don't have any interest in doing anything except buying/downloading/reading the books in the Kindle app. That's how I do it.

                  Those folks who simply want to read their purchased book in another reader, such as by using Calibre, don't seem to be a target of Amazon, even though Calibre has a page that says there are roughly 3M active installations of Calibre.

                  I read also that other countries don't all utilize DRM as Amazon does. Yet, I'd assume Amazon sells worldwide. I wonder how book DRM is handled in those cases?

                  Comment


                  • #10
                    On my PC, with a small window of text, a Capture/Extraction took 0.7s. On a large window of text, Capture/Extraction too 1.4s

                    Further tests showed that the extraction step using Tesseract takes over 95% of the total Capture/Extraction time.

                    Comment


                    • #11
                      I read also that other countries don't all utilize DRM as Amazon does. Yet, I'd assume Amazon sells worldwide. I wonder how book DRM is handled in those cases?
                      Does not matter if Kindle is strong enough what other countries do about DRM. Books can be bought, and Amazon has reasonable expectation of no copies (even if they can't prosecute a copier).

                      added- OCR would be slower. And is "last ditch" way to copy.

                      Cheers,
                      Dale

                      Comment

                      Working...
                      X