X-Authentication-Warning: delorie.com: mail set sender to geda-user-bounces using -f X-Recipient: geda-user AT delorie DOT com X-Mailer: exmh version 2.8.0 04/21/2012 (debian 1:2.8.0~rc1-2) with nmh-1.5 X-Exmh-Isig-CompType: repl X-Exmh-Isig-Folder: inbox From: karl AT aspodata DOT se To: geda-user AT delorie DOT com Subject: [geda-user] Re: pdf table extraction In-reply-to: References: <20150904095423 DOT 31827809DB80 AT turkos DOT aspodata DOT se> Comments: In-reply-to gedau AT igor2 DOT repo DOT hu message dated "Fri, 04 Sep 2015 13:06:09 +0200." Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Message-Id: <20150904112133.85560809DB82@turkos.aspodata.se> Date: Fri, 4 Sep 2015 13:21:33 +0200 (CEST) X-Virus-Scanned: ClamAV using ClamSMTP Reply-To: geda-user AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: geda-user AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk Igor2: > On Fri, 4 Sep 2015, karl AT aspodata DOT se wrote: > > Igor2: > > [ about tables in pdf's ] > > > > It's true that pdf doesn't have a table structure. > > > > I have some experimetal code to extract tables from pdf's, the is in: > > > > http://turkos.aspodata.se/git/openhw/pdftosym/Experimental/ > > Thanx, will check it out. What you wrote suggests your script works > similar to mine. Yes, but I got the impression you used the graphical elements in the file and that you possible used pdftohtml in "html" mode, which doesn't give you the text positions. I have been working purely on the textual part. And beware that the code above is a big mess. Perhaps you can have a look at: http://turkos.aspodata.se/computing/pdfextr.pl which is a little less unpolished, it extracts things from an invoice (sorry can't provide you with the input data example). Regards, /Karl Hammar ----------------------------------------------------------------------- Aspö Data Lilla Aspö 148 S-742 94 Östhammar Sweden +46 173 140 57