Sam Young: Non-OCRing a pdf

How about this for an interesting problem: can we un-OCR a pdf?

As part of my work, I provide an external and independent vocational opinions for claimants who are taking the Accident Compensation Corporation (ACC) - New Zealander's national accident insurer, the government department which manages our insurance - to review. I will receive a huge file of claimant information from their advocate in a pdf.

The advocates will have been sent their client's entire file from ACC, which they will then forward to me. These ACC files have always been huge. I may receive a 1,000 or 2,000 page file, in no particular order, with material repeated several times. As a result, it can take HOURS to navigate through the dross to find the information you need to make a careful, systematic and considered opinion. I need to be quick, because the claimant has to pay for my time. And even worse, claimants only get a $500 reimbursement for my report. I could easily spend ten hours just reading the documentation: ...oops, already spent more than their allowance and we haven't even got to constructing any argument yet.

I used to simply OCR the ACC file so that I could find certain document titles, which I would then bookmark, to make file navigation easier. Did you notice the past tense there? Unfortunately, what has now started to happen is that the entire pdf file will have been doctored so that it becomes non-searchable. Worse, because it has been OCRed already, you can't re-OCR it. Grrr!

So when you have just received a 2224 page file, and the claimant is on the bones of their bum, I need a quick fix so that I can 'un-OCR' this file, then re-OCR it so it becomes searchable once more.

Thankfully, OCRing and un-OCRing is something that lawyers like to do, to make discovery harder for the other side. More or less exactly what ACC has been doing ...behaviour that I would call vindictive, malicious, capricious, aggressive, and willfully time-wasting. It is even more reprehensible when you consider that claimants and their employers are tax-payers, and that ACC is the public servant which administers the redistribution of forward-paid taxes to people in need.

So, a tricksy, devious lawyer posted online how to make a mess of OCRing. All we need do is to reverse the process. Providing we have Adobe Acrobat, we are in business; as follows (Borstein, 15 April 2008):

Go to File | Print.
When the Print dialogue box opens, select Adobe pdf in the Printer name field.
At the bottom of the Print dialogue box, click the Advanced button at the bottom left.
When the Advanced Print Setup dialogue box opens, tick the "Print as Image" option; select 600 dots per inch (dpi) for your new pdf (I would go as high as you can so that it will re-OCR well later).
Click the OK button to go back to the Print dialogue box, then again to save your un-OCRed file.
Click Print the existing pdf to pdf.
Once processed, open the new file.
Click OCR to reprocess.

NB: the printing process will take QUITE a while, particularly if your dpi is very high, or if you have many pages. Documents with over 2000 pages may take 4 or 5 hours. If the file fails, chop the document into smaller pieces. You may also hardly be able to notice that there is any processing going on, except you can't get the Adobe Acrobat window to get focus. If you minimise any windows you have open, you may note a wee message box like the one illustrating this post, which will tell you how far through the print process you are.

Be patient, and you will be rewarded!

Sam

Reference: Borstein, R. (15 April 2008). Creating a Non-Searchable PDF from Office Documents. Retrieved 7 April 2018 from http://blogs.adobe.com/acrolaw/2008/04/creating_a_nonsearchable_pdf_fro/

Pages

Friday, 1 June 2018

Non-OCRing a pdf

No comments :

Post a Comment

Get new posts by email:

Contact

Mail To

Digital Card