X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Message-ID: <484F32BD.E4BBC57A@dessent.net> Date: Tue, 10 Jun 2008 19:04:45 -0700 From: Brian Dessent X-Mailer: Mozilla 4.79 [en] (Windows NT 5.0; U) MIME-Version: 1.0 To: =?iso-8859-1?Q?Ren=E9?= Berber CC: cygwin AT cygwin DOT com Subject: Re: Extra spaces in text files in cygwin References: <17764646 DOT post AT talk DOT nabble DOT com> <484EFB14 DOT 65C9E56F AT dessent DOT net> <17766865 DOT post AT talk DOT nabble DOT com> Content-Type: text/plain; charset=iso-8859-1 X-IsSubscribed: yes Reply-To: cygwin AT cygwin DOT com Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id m5B25B9o000802 René Berber wrote: > If you like to look at what it really is, try: > > $ od -tx2z Document.txt > 0000000 feff 0054 0068 0069 0073 0020 0069 0073 >..T.h.i.s. .i.s.< > 0000020 0020 0061 0062 0063 0020 0066 0069 006c > .a.b.c. .f.i.l.< > 0000040 0065 000d 000a >e.....< > 0000046 > > So your spaces are really null bytes (some fonts put little smileys), vi > was wrong no CR in there. Sure there is, 000d 000a is \r \n in UTF-16. > As pointed out by Gary Johnson, `cat Document.txt` doesn't result in > spaced text, it just shows "˙ŝThis is abc file" (this is using mrxvt and > Bitstream Vera Sans mono font). Those NUL bytes are still being printed, it's just that that your particular combination of terminal and font doesn't show anything for them; but they're still there in the output stream. > Better use the file command to see what it is. And no, there are no > converting software that I know of, Cygwin 1.5.x just doesn't support > wide characters. Sure there is: iconv. And this is not a matter of Cygwin supporting or not supporting something -- that would be true if we were talking about wide characters in the filenames. But we're talking about the file's contents, and what an app does with the bytes is up to it, not Cygwin. For example, vi is a Cygwin app and can read the UTF-16 file just fine, displaying the characters as ascii. So in this case it depends on the app, not the libc. And as already stated, the Unix tradition is to use UTF-8 since it fits into the "a string is a null-terminated series of bytes" definition that is borrowed from C. But anyway you can freely transform anything to anything with iconv. Brian -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/