X-Authentication-Warning: delorie.com: mail set sender to geda-user-bounces using -f X-Recipient: geda-user AT delorie DOT com X-Mailer: exmh version 2.8.0 04/21/2012 (debian 1:2.8.0~rc1-2) with nmh-1.5 X-Exmh-Isig-CompType: repl X-Exmh-Isig-Folder: inbox From: karl AT aspodata DOT se To: geda-user AT delorie DOT com Subject: pdf table extraction (was Re: [geda-user] Interesting blog post from a commercial EDA vendor - pdf) In-reply-to: References: Comments: In-reply-to gedau AT igor2 DOT repo DOT hu message dated "Fri, 04 Sep 2015 06:00:42 +0200." Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Message-Id: <20150904095423.31827809DB80@turkos.aspodata.se> Date: Fri, 4 Sep 2015 11:54:22 +0200 (CEST) X-Virus-Scanned: ClamAV using ClamSMTP Reply-To: geda-user AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: geda-user AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk Igor2: [ about tables in pdf's ] It's true that pdf doesn't have a table structure. I have some experimetal code to extract tables from pdf's, the is in: http://turkos.aspodata.se/git/openhw/pdftosym/Experimental/ /// If you use the "-xml" argument to pdftohtml, you get the positions of the text. What's missing in the output below is text rotation, the pdf below have vertical text in the headers. It could be useful to patch pdftohtml to get that info. Also it would be useful to know the font metrics so you'll know if text elements is separated with a simple space, i.e. belong to the same text, or more, i.e. possible be in different columns. Example: pdftohtml -f 40 -l 51 -c -xml ~/Net/http/www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL_LITERATURE/DATASHEET/CD00237391.pdf a generates a.xml: Pinouts and pin description STM32F205xx, STM32F207xx ... BAT ... 176 /// That is rather simple to parse, so you get one array with fontspecs (to get the size) and one for the text with page number and position (and font size). sort "text" after top and left find same text in same positions in different pages, that's the page header and footer, and it's probable not part of the table, so remove that and the page counter (use some heuristics to find that) since you have top and height of the text elements, you now can find text elements that overlaps vertically - thay are your table lines sometimes you have to merge more lines, e.g. the last col. could be multiline. basically use the same proceadure to find limits of the columns. possible identify sub/superscrips and possible remove them /// With that procedure I could generate something resempling: http://turkos.aspodata.se/git/openhw/pdftosym/stm32f100h.tbl and then http://turkos.aspodata.se/git/openhw/pdftosym/stm32f100h.pins which I could use as input to http://turkos.aspodata.se/git/openhw/pdftosym/symtopin.pl to generate footprints with. =================== It has been some time I worked on "pdftosym", maybe we could toss some ideas. Regards, /Karl Hammar ----------------------------------------------------------------------- Aspö Data Lilla Aspö 148 S-742 94 Östhammar Sweden +46 173 140 57