delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/03/04/15:33:29

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,SPF_PASS
X-Spam-Check-By: sourceware.org
Message-ID: <BLU113-W29F6E906F6793615A4A75DBEA70@phx.gbl>
From: Mike Marchywka <marchywka AT hotmail DOT com>
To: <cygwin AT cygwin DOT com>
Subject: RE: pdftk and apropos - general questions
Date: Wed, 4 Mar 2009 15:33:07 -0500
In-Reply-To: <20090304175648.GA5388@KCJs-Computer>
References: <BLU113-W74226535EC192149C5AEABEA60 AT phx DOT gbl> <49AE9494 DOT 1000804 AT veritech DOT com> <BLU113-W51FC38A48F454394262F2CBEA70 AT phx DOT gbl> <20090304175648 DOT GA5388 AT KCJs-Computer>
MIME-Version: 1.0
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Note-from-DJ: This may be spam




----------------------------------------
> Date: Wed, 4 Mar 2009 09:56:49 -0800
> From: garyjohn AT spocom DOT com
> To: cygwin AT cygwin DOT com
> Subject: Re: pdftk and apropos - general questions
>
> On 2009-03-04, Mike Marchywka wrote:
>
>>> Mike Marchywka wrote:
>>>> I've had a persistent problem getting apropos to work
>>>> as it never finds anything appropriate. Is there
>>>> something I need to do to make this work?
>>>>
>>> After each setup session, you need to run, /usr/sbin/makewhatis -u.
>>
>>
>> Thanks but I did get that far after earlier hints and you list
>> below is about what I ended up with too. One problem
>> I ran into was trying to extract sensical text from the
>> IRS instructions.
>
> I have that problem with the printed versions.
>
>> I used the pdftotext utility IIRC from
>>
>> http://www.foolabs.com/xpdf/download.html
>>
>> and it didn't seem to be able to separate multi-column text
>> automatically ( with sed and awk I got what I needed but what
>> a mess).
>
> Did you use the -layout option to pdftotext? It makes a huge
> difference on the documents I've converted, but they've all been
> single column.

I played with the options but I'm not sure the information
is in the source PDF. I don't imagine the authors really cared
too much about layout. IIRC, selection gave rectangles of the whole page wi=
dth but also IIRC from scientific papers normally the selection
went column by column. Somewhere between intelligent formatting
and scanned pdf is probably the authoring tool that just
puts out blocks of text that can't be extracted properly
( probably even be design to stop people from using information
without pictures that someone spent a lot of time authoring  ).

I did try the pdftk on an f1040.pdf download
but I finally had to install Acrobat Reader to look
at the forms and fill it in. pdftk let me examine the
filled in form but there was not immediate way to
identify form fields- I have to look for meaningful names etc.

I guess if I could enter input data into something I could use
it would be worthwhile writing a script to fill out the form.
I'll use a web form for a few lines of input but if I have
to type 100 numbers into an information black hole I'm
happy to kill a tree or two.




>
> Regards,
> Gary
>
>
>
> --
> Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
> Problem reports: http://cygwin.com/problems.html
> Documentation: http://cygwin.com/docs.html
> FAQ: http://cygwin.com/faq/
>

_________________________________________________________________
Windows Live=99 Groups: Create an online spot for your favorite groups to m=
eet.
http://windowslive.com/online/groups?ocid=3DTXT_TAGLM_WL_groups_032009

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019