X-Authentication-Warning: delorie.com: mail set sender to geda-user-bounces using -f
X-Recipient: geda-user AT delorie DOT com
Date: Fri, 4 Sep 2015 06:00:42 +0200 (CEST)
X-X-Sender: igor2 AT igor2priv
To: "Ouabache Designworks (z3qmtr45 AT gmail DOT com) [via geda-user AT delorie DOT com]" <geda-user AT delorie DOT com>
X-Debug: to=geda-user AT delorie DOT com from="gedau AT igor2 DOT repo DOT hu"
From: gedau AT igor2 DOT repo DOT hu
Subject: Re: [geda-user] Interesting blog post from a commercial EDA vendor
 - pdf
In-Reply-To: <CAOP4iL3YWQ_MH3HNnyDHMGCGeYFBmazwcw7Af_GATQzAUQJ57g@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1509040545240.6924@igor2priv>
References: <CAOP4iL3YWQ_MH3HNnyDHMGCGeYFBmazwcw7Af_GATQzAUQJ57g AT mail DOT gmail DOT com>
User-Agent: Alpine 2.00 (DEB 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Reply-To: geda-user AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: geda-user AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com
Precedence: bulk


On Thu, 3 Sep 2015, Ouabache Designworks (z3qmtr45 AT gmail DOT com) [via geda-user AT delorie DOT com] wrote:

>
>https://medium.com/@zakhomuth/disrupting-electronic-design-automation-8988f
>72299e3

Btw, somewhat off-topic, the part not covered by geda-user discussions 
usually: pdf datasheets. I really like his rant on how useless 
distributing data in pdf is.

I face that problem from time to time. Last december I had it with an arm 
cortex. I wanted to extract the register names, bit names and magic values 
(e.g. this bit in this register always has to be 1). C source and 
other stuff comes with an EULA that doesn't let me do what I want. 
Datasheet is in pdf. Most of the relevant data are in almost uniform 
tables.

I thought I'd just convert the pdf to html and extract <table> nodes... I 
laugh at this idea in retrospect. I tried with various tools and various 
settings. Never got a <table>. Turned out the pdf just draws the borders 
and draws the text separately. The render looks like if it was a table. 
The html some tools produce look the same as the pdf. In practice, it's 
not a table in those htmls, just a big background bitmap with the lines 
and the text printed onto it at pixel coords.

I ended up with a "table mapping" script that takes the bitmap, scans 
lines and columns to map cell coordinates then reads all the text from the 
html and determine which cell they are in.

And this is only the first step to convert the data of a datasheet 
to a machine readable form on the lowest level... Upper levels in separate 
scripts took the table map and tried to read the header and convert the 
info into a register description.

I agree with the upverter guy. In the age of thousand page datasheets, 
non-machine-readable format is a bug that needs to be fixed. On the other 
hand I'm highly sceptic about vendors being cooperative on this.

Regards,

Igor2