X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,SPF_PASS X-Spam-Check-By: sourceware.org Message-ID: From: Mike Marchywka To: Subject: RE: pdftk and apropos - general questions Date: Wed, 4 Mar 2009 15:33:07 -0500 In-Reply-To: <20090304175648.GA5388@KCJs-Computer> References: <49AE9494 DOT 1000804 AT veritech DOT com> <20090304175648 DOT GA5388 AT KCJs-Computer> Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Note-from-DJ: This may be spam ---------------------------------------- > Date: Wed, 4 Mar 2009 09:56:49 -0800 > From: garyjohn AT spocom DOT com > To: cygwin AT cygwin DOT com > Subject: Re: pdftk and apropos - general questions > > On 2009-03-04, Mike Marchywka wrote: > >>> Mike Marchywka wrote: >>>> I've had a persistent problem getting apropos to work >>>> as it never finds anything appropriate. Is there >>>> something I need to do to make this work? >>>> >>> After each setup session, you need to run, /usr/sbin/makewhatis -u. >> >> >> Thanks but I did get that far after earlier hints and you list >> below is about what I ended up with too. One problem >> I ran into was trying to extract sensical text from the >> IRS instructions. > > I have that problem with the printed versions. > >> I used the pdftotext utility IIRC from >> >> http://www.foolabs.com/xpdf/download.html >> >> and it didn't seem to be able to separate multi-column text >> automatically ( with sed and awk I got what I needed but what >> a mess). > > Did you use the -layout option to pdftotext? It makes a huge > difference on the documents I've converted, but they've all been > single column. I played with the options but I'm not sure the information is in the source PDF. I don't imagine the authors really cared too much about layout. IIRC, selection gave rectangles of the whole page wi= dth but also IIRC from scientific papers normally the selection went column by column. Somewhere between intelligent formatting and scanned pdf is probably the authoring tool that just puts out blocks of text that can't be extracted properly ( probably even be design to stop people from using information without pictures that someone spent a lot of time authoring ). I did try the pdftk on an f1040.pdf download but I finally had to install Acrobat Reader to look at the forms and fill it in. pdftk let me examine the filled in form but there was not immediate way to identify form fields- I have to look for meaningful names etc. I guess if I could enter input data into something I could use it would be worthwhile writing a script to fill out the form. I'll use a web form for a few lines of input but if I have to type 100 numbers into an information black hole I'm happy to kill a tree or two. > > Regards, > Gary > > > > -- > Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple > Problem reports: http://cygwin.com/problems.html > Documentation: http://cygwin.com/docs.html > FAQ: http://cygwin.com/faq/ > _________________________________________________________________ Windows Live=99 Groups: Create an online spot for your favorite groups to m= eet. http://windowslive.com/online/groups?ocid=3DTXT_TAGLM_WL_groups_032009 -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/