Nov 13, 2009

Trying to go paperless

Almost since the inception of personal computing, the paperless office has been an unrealized promise. I’ve been wanting to move in that direction for many years, but it still not easy to do. Today, my office and basement are filled to capacity with documents, scientific papers, and books. I’m out of space and managing everything is killing my productivity. I throw away and give away what I can, but that is only a small dent in the problem. Some stuff I’ll always have to keep as paper, sadly. The remaining stuff, however, needs to move to the computer.

The rest of this article will be on what I’m using to at least start this transition in terms of software and hardware. I’m primarily Macintosh based, so much of this will related to Macintosh software. Feel free to contact me to suggest Windows and Linux alternatives for the same tasks. The following are just some quick thoughts…

Software

My software requirements appear relatively simple. I need to scan and OCR documents and convert them to PDF. Once I have the PDF, I need to be able to organize the documents quickly and with a high degree of flexibility. Simple, right?

No, not simple at all. I don’t have any application that fits the complete bill for this work. Frustrating, to say the least. Part of the problem, in reality, is me. Just like the problems the IRS has on making simple tax code, my document management reality is too complicated for any one application to fit the bill. Unfortunate, but in the end, I hope it will not be too bad.

Here are quick reviews of some applications I’m trying out in my work environment.

Evernote (www.evernote.com)

At least on the Macintosh (I don’t know about Linux and Windows) notebook applications are flooding the market: DevonThink, Yojimbo, Evernote, etc. The idea behind all of these applications is to capture all those little pieces of information you gather each day and put them into an organized system to help you find it again. Evernote is nice because it’s cross platform, stores data in the cloud (if you want), and free for certain usage levels. Originally, I was going to make Evernote my PDF document solution, but that experiment was a failure. Evernote bogged down quickly with a large number of PDFs and simply won’t import large documents (25 meg or larger). Even worse, there is no simple way to export the PDFs once their in the system; you have to export them in an html-style folder system and then pull the pdfs out from that. Yuk. You can find the PDFs within Evernote’s file system under ~/Library/Application Support/Evernote but the file names are encoded and fairly meaningless to a human reader. That said, this system is fine for most text and html documents. Collecting tidbits of information together is still a strong use-case for Evernote. So, Evernote will likely remain a tool I continue to use, but it will not be in my paperless office workflow.

Endnote (www.endnote.com)

Endnote is a reference manager that has been around for a long time. It’s a powerful database solution that’s a great tool for writing scientific papers. For a while, Endnote included the ability to embed a PDF of a paper along with it’s reference. As it turned out, however, that this process is clumsy, slow, and annoying. Like evernote, the PDFs are renamed as they are included in the database. Great for the database; not great for human readers. Since the goal of Endnote is academic, it’s not good for many day to day PDFs either. So, endnote wont do for managing PDFs, though I will continue to use it to manage references and write papers.

Papers (www.mekentosj.com)

Papers is an example of what a modern Endnote could have been. Paper’s is a PDF document-centric reference manager. For citations, it falls far short of Endnote and has a lot of problems. For example, much of the clever interface is too clever for it’s own good. Just try to get a Jr. on the end of an author’s name! Not to mention, it essentially thinks all PDFs are journal articles and does not contain enough fields to properly cite a reference (this could be my noobie-ness with the app,though). It’s a bit buggy as well. But despite these shortcomings, it’s a powerful program. In particular, you can use online resources, such as Google Scholar and Google Books to identify the PDFs and quickly give them a citation. Viewing PDFs is quite nice. And just to make you more mobile, there is an iPhone app as well. So, I suspect I’ll spend a great deal of time in this app, but Papers viewed as companion app to Endnote rather than a replacement.

Adobe Acrobat (www.adobe.com)

The main reason I have the pro version of Acrobat these days is for its OCR capabilities. I can scan a PDF and quickly OCR the document right within Acrobat (and make sure that I keep the original images and hide the OCR text). OCR is not an option for a paperless office, it’s required for modern operating systems like Snow Leopard because of all the content information that’s index in the OS making the files easy to find.

Yep (www.ironicsoftware.com)

This application is specifically for PDF management. It’s designed to avoid dealing the directory structure directly and supports tagging using their openmeta standard. The idea here is that the combination of spotlight searching and tags would provide enough resources to quickly find the material you need. I’m still experiencing the demo version of this app, but I plan to purchase their bundle which goes beyond PDFs. Part of the advantage of Yep is that it doesn’t take control of your PDFs, it leaves there where they are. Other solutions, including Papers and Evernote above, move your files into their own system. Papers at least mimics a system I would have used for sorting papers (based on publication years and author names). However, it will certainly take some getting used to. It also plays nicely with Papers, since it can actually search within the Papers folder structure. The latest version of Yep also supports Microsoft Word and Pages files along with PDFs. Yep even supports scanning, although it doesn’t OCR.

The obvious question of why get both Yep and Papers arises. The answer is that my PDF documents play different roles. Papers handles research papers; Yep handles everything else that’s a PDF. This difference is critical. Scientific papers have a large set of metadata associated with them; their citations. Yep doesn’t support this at all, whereas Papers does. Plus, Papers is partially designed for reading the PDFs as well - it has a full screen mode, for example. Yep is about management and passes off the viewing to other applications. So, together, the apps do nicely…

XCode (www.apple.com)

Apple’s software development environment has an API for working with PDFs (probably what Papers and Yep use). I use this API for something very specific. I commonly scan double-sided, multi-page documents using a document feeder. However, I can only scan one side at a time. As a result, if I flip the stack and scan the back sides, the pages are seriously out of order. So, I use the API to rearrange the pages automatically. Otherwise, I’d have to manually sort the pages in Acrobat or some other app. I did have an Acrobat script that used to do this work, but this solution seems a bit faster.

Hardware

Scanner

Right now, I have an HP all-in-one with a document feeder. It’s a bit slow, but it works. It’s also very noisy, so I can only use it when the kids are awake. Finally, it’s big, so I can’t move it to where the documents are, I have to move the documents to the scanner. This combination discourages me from getting all the scan work done. If I had the money, I’d probably get a small Fujita portable scanner that can scan double-sided paper.

Storage

Today, I have a great deal of storage on a Drobo. I like the device as it’s easy to use and flexible and relatively safe for data storage. The data on the Drobo are well protected from a drive failure, but not from a Drobo hardware failure. So, the Drobo is about reducing risk, but not eliminating it. But even without the drobo, all of my documents probably take up less than 2 gb today. If I get everything converted, it would probably be less than 200 gb. Even today, 200 gb is a fairly trivial amount of space. Thus, backing up this content should be relatively easy and maybe a good candidate for cloud storage, such as Amazon S3 (http://aws.amazon.com/s3/), Mozy (www.mozy.com), or other such service.

Summary

Today, there is no single solution that works for my paperless office needs. Such a solution probably will never exist because some documents must be treated differently than the rest. Furthermore, you have to have solutions for each step: 1) acquiring, via external source or scanning, 2) ensure you have the text (e.g. OCR), 3) name and file the content in a meaningful way, and 4) have an easy way to pull it out of the massive amounts of content on your system.