The GW Micro blog has been discontinued. For instant updates on GW Micro products and events, follow us on Twitter, and like us on Facebook.


Protected PDFs - A Rant and Solution

by Aaron Smith on Tuesday, September 4 2007

Before I begin, let me make something perfectly clear: GW Micro does not condone the act of hacking or circumventing security restrictions explicitly applied to protect content in Adobe PDF files. If an author set a password on a PDF document, they probably did so for a reason, and we're not in the business of defrauding those trying to safeguard their livelihood.

With that in mind, on to my rant. We tout support for protected PDFs in Window-Eyes, so what the heck am I going on about? PDF protection isn't as clear as on or off. Using Adobe Acrobat, when an author makes the conscious decision to protect a PDF document, they can choose to add a password, restrict editing and printing, restrict copying images, text, and other content, and (hold on to your seats) restrict text access for screen readers. Yep, you heard that last one correctly. Adobe provides authors with the ability (pun intended -- you'll know why in a second) to explicitly deny access to assistive technology. This aberration is clearly marked with a check box labeled, "Enable text access for screen reader devices for the visually impaired." I applaud Adobe for taking the lead in creating accessible electronic documentation by providing access to PDF documents, but I will never understand the inclusion of an option that gives someone the ability to decide whether or not accessibility should exist. That check box should have never been created, and it needs to be removed. Accessibility is something that should not be decided by a flick of a mouse button from the hand of a sighted person who doesn't have the first clue as to why a blind person needs access to a PDF in the first place. Accessibility should not be optional, and that scenario is precisely the reason why I have no objections to providing a solution to access restricted information, assuming that you legally own the PDFs that you need access to.

Let me make more perfectly clear what I previously made perfectly clear: we are not looking to break the security model of PDF files. We’re not talking about removing passwords, or enabling the ability to modify the text of a PDF. We don’t want you to be able to print when you want to print, copy when you want to copy, or anything along those lines. Protected PDFs are a decent way to protect content, just like password protected Word documents, password protected ZIP files, secure web pages, emails, and so on. We are highly sensitive to the need for security, and even implement our own security models wherever we can. We are instead simply providing a solution that provides access to text that has been unduly restricted, most likely due to the ignorance of the individual who enabled the restrictive security methods. And, once the process is all said and done, it’s really no different than printing a PDF, scanning the result, and OCR’ing into your favorite word processor. In fact, if the printing security restriction has been enabled, this trick won’t work anyway.

I think I’ve disclaimed enough, so let’s move on. Although there are various means to access protected PDF text (many of them quite actionable if you don't legally own the PDF in question), I'm going to discuss one that uses the Microsoft Office Document Imaging feature available with Microsoft Office 2003 and up. The basic gist of the process involves printing a PDF to the Microsoft Office Document Image Writer, and then using the OCR features of the Microsoft Office Document Imaging application to provide the text to Microsoft Word.

First make sure you have Microsoft Office 2003 installed along with the Microsoft Office Document Imaging feature (which, I believe, is installed by default, at least with the Professional Edition of Microsoft Office 2003). Next, make sure you have either Adobe Reader or Adobe Acrobat installed, which you would need anyway to read non-protected PDF files. Finally, you'll need the PDF file that you can't read through normal Adobe means.

Here’s the step by step:

  1. Open the restricted PDF file.
  2. Press CTRL-P to print.
  3. Select the Microsoft Office Document Image Writer from the printer name combo box, and press ENTER.
  4. Enter a file name to print to. The extension should be .MDI (for Microsoft Document Imaging Format). Once the document has printed, close the PDF file.
  5. Open the Microsoft Office Document Imaging utility (usually located in the Start Menu, under Programs, Microsoft Office, Microsoft Office Tools).
  6. Press CTRL-O to bring up the Open dialog.
  7. Type in the path and file name of the MDI you saved in step 5, and press ENTER.
  8. Press ALT-T for Tools.
  9. Arrow down to Send Text to Word, and press ENTER.
  10. Press ENTER to begin the conversion with the default options. If you’re presented with a dialog stating, “You must re-run OCR before performing this operation,” simply confirm by selecting the OK button.
    The conversion process will begin. You can use the Window-Eyes progress hot key (CTRL-INS-B by default) to interrogate the progress.
  11. Once the conversion is complete, Microsoft Word will be open (for me, it opened in the background) with the text of the PDF file available for your perusal. Once you locate the Microsoft Word window, you can close the Microsoft Office Document Imaging utility.

There are a few things to note about this process.

  • The results of an OCR are only as good as the OCR engine. OCR is never a complete replacement for the original text. In other words, don’t expect perfect text accuracy.
  • This process does not remove password protection. If you have a password protected PDF, you will still need to know the password to perform this task.
  • If a PDF author has restricted copying text, this method will enable the OCR’d text to be copied. Acrobat itself warns about this when you enable the copying restriction: “All Adobe products enforce the restrictions set by the Permissions Password. However, not all third-party products fully support and respect these settings. Recipients using such third-party products might be able to bypass some of the restrictions you have set.”
  • If the printing security restriction has been enabled, you cannot print the PDF, meaning you can’t use this method to do what you want.

Although I’ve been discussing this method for use with restricted PDFs, it will also work fairly well with PDFs that contain nothing but images. If you don’t have access to another utility that boasts PDF OCR capabilities, this may be a good solution for you.

For example, I took a screen shot of a web page, and created a PDF out of it; the PDF contained nothing but an image of what was on my screen. I ran it through this process, and for the most part, the text on the web page was readable.

PDF files, in general, are very accessible despite their enigmatic stigma. Adobe even provides their own methods of tweaking accessibility settings (i.e. changing reading order, overriding tagged order, etc.). There’s even an Accessibility Quick Check in the Acrobat Reader (even more detailed Accessibility tools in the full Adobe Acrobat) for examining documents, and reporting problems to the PDF author.

Now you have an additional resource when you encounter a not-so-friendly PDF file that doesn’t live up to good accessibility standards.

Do you have any other tips for reading PDF files?


Return to Article List