delorie.com/archives/browse.cgi   search  
Mail Archives: opendos/1998/10/13/05:23:24

X-Authentication-Warning: central.caverock.co.nz: viking set sender to flying-brick.caverock.net.nz!viking using -f
Date: Tue, 13 Oct 1998 22:16:01 +1300 (NZDT)
From: Eric Gillespie <viking AT flying-brick DOT caverock DOT net DOT nz>
To: opendos AT delorie DOT com
Subject: Re: Sed script for stripping HTML
In-Reply-To: <Pine.LNX.3.96.981001203656.567A-100000@brick.flying-brick.caverock.net.nz>
Message-ID: <Pine.LNX.3.96.981013212426.1121A-100000@brick.flying-brick.caverock.net.nz>
MIME-Version: 1.0
Reply-To: opendos AT delorie DOT com

Subject was: Re: DRDOS suggestions
On Thu, 1 Oct 1998, Eric Gillespie wrote about Win95:

:... but just try doing this with a mouse...: 
:
:sed 's/<.[:+"?&A-Za-z0-9.= #~,_;%\/\-]*>//g' old.html > new.txt
:
:If someone could suggest any improvements (this won't even WORK under
:COMMAND.COM, but it does under bash - something to do with handling the " 
:symbol), I'd be pleased to hear them...

Sean King (maybe from the list?) confirmed it was indeed a problem in the way
that COMMAND.COM handles ", and suggested quoting them, \" - I think I tried
this but it didn't work, don't know why.  Seems strange it worked under bash
(in DOS!) though...

He also mentioned that DOS versions of SED expect their arguments in double
quotes, an example from my dos machine:

     zoom:# sed "s/...." old.file > new.file

                ^      ^   not ' but "

...well, I came up with a way to eliminate the problem, AND I managed to
simplify it too - though I could understand it would have been found by
someone else who knows more about regex parsing. 

 I finally clicked when I realised I could eliminate a whole range of
characters - in fact everything between a space and a squiggle =7E (~) in
two ranges (leave out the < and >, because they're the delimiters after
all...) so I came up with the shorter (and easier to remember) version in a
matter of a few minutes of working between two consoles on Linux...
(gee, can't you tell I love it?)

~$ sed 's/<.[?-~\ !-;=]*>//g' origfile.html > newfile.txt

or for DOS, it's

    sed "s/<.[?-~\ !-;=]*>//g" origfile.htm > newfile.txt

and it works quite well under DR-Dos when you replace the single quotes with
double quotes, otherwise OpenDOS throws a wobbly about Invalid Directory
specified - at least OpenDOS 7.02 does. 

Could someone confirm with MS-DOS or Win95/DOS 7?

P.S.  I was using the DJGPP version of sed (1.18)

And before anyone asks, I'll break it down here and now...

sed     <--- progname (gobvious, aye?)
' or "  <--- depending on OS (Unix = ', DOS = ")
s       <--- substitute command
/       <--- begin pattern
<       <--- first char to match
.       <--- one of the following chars or set
[       <--- begin a set
?-~     <--- first subset
\       <--- an escaped space
!-;     <--- second subset of chars to eliminate
=       <--- the last character to eliminate (all between 0x20 and 0x7e
             except for the two angle brackets)
]       <--- close set
*       <--- any number of chars of the preceding set
//      <--- end pattern, and specify empty replace string
g       <--- do it right through the file...
' or "  <--- depending on OS (Unix = ', DOS = ") finishes the command
old.html     original filename
>       <--- redirect (unless you want it all going to screen?)
new.txt      new filename


I haven't worked it out for removing stuff across lines yet (for removing
Javascript and Java...) but it's 22:15 NZDT here, my wife's getting my
tablets, and I'm off to bed - ahhh the pleasures of married life...
&@$#((^%% NO CARRIER

 /|   _,.:*^*:.,   |\  Cheers from the Viking family, including Marmalade 
| |_/'  viking@ `\_| | Running Linux and OpenDOS in Christchurch!
|    flying-brick    | $FunnyMail  Bilbo   : Now far ahead the Road has gone,
 \_.caverock.net.nz_/     5.39    in LOTR  : Let others follow it who can!


- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019