Mail Archives: geda-user/2017/03/21/06:38:05
John:
> On 03/20/2017 10:12 AM, karl AT aspodata DOT se wrote:
> > As a proof of concept I have made pdfextr.pl [1]. Witch with [2] as
> > indatafile I can procude [3]:
> >
> > ./pdfextr.pl run=stm32 table=27,31 stm32f105r8.pdf > stm32f105r8.table
>
> So, the table=27,31 tells it which pages to use to extract the text from.
Yes. I could potetially search for the "list of tables" and some line
containing "table 5. Pin definition" or the like, and follow the page
referens to the correct page, and have something that identifies the
end of the table.
> Sounds like a great start for making a symbol.
:)
> What kind of tables does it work on?
Mind you, it probably just work on the table in the file given above.
But for tables with similar layout, I could identify package names
(LQFP100 etc.) in headers and ajust the logic to that. Also I could
include logic to identify other header names with some kind of
dictionary.
The core code basically works on any table, thought the program
pdftohtml, which provides a dump of the text together with bounding
boxes, sometimes groups together tokens which belongs to different
columns, and it doesn't provide with bounding boxes of rotated text.
So I would like to find another program or adjust pdftohtml so the
cell finding process would be easier, currently I have to iclude some
guessing code. Also finding out where the lines goes would be helpful
in assinging straw text to cells.
And, I have the argument run=xxx so you could switch final recognition
and editing code.
> How do you recognize them from the pdf appearance
> in a pdf reader?
The program finds lines in text, i.e. sequences of text that overlap
vertically, and then finds columns, i.e. parts of the lines that
overlap horizontally. Which basically gives me the table cells, then I
just have add heuristics of how the real world behaves...
Regards,
/Karl Hammar
-----------------------------------------------------------------------
Aspö Data
Lilla Aspö 148
S-742 94 Östhammar
Sweden
+46 173 140 57
- Raw text -