I was using the regex package to format data read from a excel spread sheet. I wanted to collapse a set of continous white spaces to a single white space. so I used the Pattern "\\s{2,)" and replaced it with " ".
But I found that it worked only partially. Trying to debug I realised that the some of the data used some "no break space" in unicode with int value of the char being 160, which was missing in the \s pattern.
So I had to do some thing clumsy like char space = 160; pattern = "[\\s" + space + "]{2,1}";
Now what I cannot understand is why does the \s pattern class not include this space (char 160). And how do I know that tommorow if I try to read data from another file system in another platform I will not encounter a new space char. Does it not make my code platform dependant (otherwise \s should have handled all possible white spaces)
just a thought, I am sure there is a better explanation
Originally posted by Stefan Wagner: Ascii(160) isn't allways a kind of space. On Dos it is � AFAIK.
Does that mean that when I see a " " on screen on one platform save it and read it in another platform then I will see a "�". i'snt that a strange behaviour.
does java not promise platform independence? is there no way to guarantee a common interpretation on all platforms
ps: forgive my ignorance but I thought in java every thing was converted to unicode. apparently i am wrong [ May 16, 2005: Message edited by: Rajagopal Manohar ]
Alan Moore
Ranch Hand
Joined: May 06, 2004
Posts: 262
posted
0
Yes, Java uses Unicode internally, so ASCII 160 will always be a non-breaking space as far as Java is concerned. To match it, just use the Unicode escape for the character:If you're normalizing the whitespace, shouldn't you also be converting single linefeeds, tabs, NBSP's, etc. into space characters?That is, any two or more consecutive whitespace characters, or any single whitespace character that isn't a space (ASCII 32).
Rajagopal Manohar
Ranch Hand
Joined: Nov 26, 2004
Posts: 183
posted
0
If you're normalizing the whitespace, shouldn't you also be converting single linefeeds, tabs, NBSP's, etc. into space characters?