X-Authentication-Warning: central.caverock.co.nz: viking set sender to flying-brick.caverock.net.nz!viking using -f Date: Tue, 13 Oct 1998 22:16:01 +1300 (NZDT) From: Eric Gillespie To: opendos AT delorie DOT com Subject: Re: Sed script for stripping HTML In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Reply-To: opendos AT delorie DOT com Subject was: Re: DRDOS suggestions On Thu, 1 Oct 1998, Eric Gillespie wrote about Win95: :... but just try doing this with a mouse...: : :sed 's/<.[:+"?&A-Za-z0-9.= #~,_;%\/\-]*>//g' old.html > new.txt : :If someone could suggest any improvements (this won't even WORK under :COMMAND.COM, but it does under bash - something to do with handling the " :symbol), I'd be pleased to hear them... Sean King (maybe from the list?) confirmed it was indeed a problem in the way that COMMAND.COM handles ", and suggested quoting them, \" - I think I tried this but it didn't work, don't know why. Seems strange it worked under bash (in DOS!) though... He also mentioned that DOS versions of SED expect their arguments in double quotes, an example from my dos machine: zoom:# sed "s/...." old.file > new.file ^ ^ not ' but " ...well, I came up with a way to eliminate the problem, AND I managed to simplify it too - though I could understand it would have been found by someone else who knows more about regex parsing. I finally clicked when I realised I could eliminate a whole range of characters - in fact everything between a space and a squiggle =7E (~) in two ranges (leave out the < and >, because they're the delimiters after all...) so I came up with the shorter (and easier to remember) version in a matter of a few minutes of working between two consoles on Linux... (gee, can't you tell I love it?) ~$ sed 's/<.[?-~\ !-;=]*>//g' origfile.html > newfile.txt or for DOS, it's sed "s/<.[?-~\ !-;=]*>//g" origfile.htm > newfile.txt and it works quite well under DR-Dos when you replace the single quotes with double quotes, otherwise OpenDOS throws a wobbly about Invalid Directory specified - at least OpenDOS 7.02 does. Could someone confirm with MS-DOS or Win95/DOS 7? P.S. I was using the DJGPP version of sed (1.18) And before anyone asks, I'll break it down here and now... sed <--- progname (gobvious, aye?) ' or " <--- depending on OS (Unix = ', DOS = ") s <--- substitute command / <--- begin pattern < <--- first char to match . <--- one of the following chars or set [ <--- begin a set ?-~ <--- first subset \ <--- an escaped space !-; <--- second subset of chars to eliminate = <--- the last character to eliminate (all between 0x20 and 0x7e except for the two angle brackets) ] <--- close set * <--- any number of chars of the preceding set // <--- end pattern, and specify empty replace string g <--- do it right through the file... ' or " <--- depending on OS (Unix = ', DOS = ") finishes the command old.html original filename > <--- redirect (unless you want it all going to screen?) new.txt new filename I haven't worked it out for removing stuff across lines yet (for removing Javascript and Java...) but it's 22:15 NZDT here, my wife's getting my tablets, and I'm off to bed - ahhh the pleasures of married life... &@$#((^%% NO CARRIER /| _,.:*^*:., |\ Cheers from the Viking family, including Marmalade | |_/' viking@ `\_| | Running Linux and OpenDOS in Christchurch! | flying-brick | $FunnyMail Bilbo : Now far ahead the Road has gone, \_.caverock.net.nz_/ 5.39 in LOTR : Let others follow it who can!