Ron McLeod wrote:Are you sure that the blank characters are actually spaces?
utf8Lines = Files.readAllLines(
filePath, StandardCharsets.UTF_8)
.stream()
.filter(line -> !line.matches("^\\s*$")) // skip blank lines
.collect(Collectors.toList());
Paul Clapham wrote:I recently spent quite a while trying to find why two strings from two totally different external sources appeared to be identical but were actually not. The strings were only two words long, how could they be different? It turned out that one of them had a non-breaking space between the two words -- took me a long time to find that.
More like UTF-16, with some kind of marker at the beginning.Mike London wrote:Doesn't seem to be UTF-8, right?
Ron McLeod wrote:If you can determine which characters are causing you grief, you could remove them before using trim. U+00A0, U+2007, and U+202F are the likely ones.
Ron McLeod wrote:Can you share a file (attachment to post, not text in post) which is problematic?
Ron McLeod wrote:It looks like that file is UTF-8 with a BOM sequence.
I tried this and it did work. It helps show what the problem is, but is a bit of a hack and a better solution should be used.
It is a long time since I used trim() and I might be mistaken, but I believe it doesn't remove such characters as hard space (\u00a0). Try String#strip() instead.Ron McLeod wrote:. . . remove them before using trim. U+00A0, U+2007, and U+202F are the likely ones. . . .
Campbell Ritchie wrote:
It is a long time since I used trim() and I might be mistaken, but I believe it doesn't remove such characters as hard space (\u00a0). Try String#strip() instead.Ron McLeod wrote:. . . remove them before using trim. U+00A0, U+2007, and U+202F are the likely ones. . . .
Try the isBlank() method without bothering to trim or strip anything.Mike London wrote:. . . a better solution should be used. . . .
They are better than beer, aren't theyDon't you love Encoding issues? . . .
I think you are right that the byte order mark is the real problem.\ufeff \u0020 \u0020 \u0020 \u0020 \u0031
\u0032
\u0033
\u0034
Tried a few tweaks. All kludgy to the worst degree.~/java$ java EncodingDemo.java /run/media/campbell/TOSHIBA/Trimming-issue/list1.txt
showChars() method
\ufeff \u0020 \u0020 \u0020 \u0020 \u0031
\u0032
\u0033
\u0034
Output unchanged Even better, guess what I got from JShell!java$ java EncodingDemo.java /run/media/campbell/TOSHIBA/Trimming-issue/list1.txt
showChars() method
\u00a0 \u0020 \u0020 \u0020 \u0020 \u0031
\u0032
\u0033
\u0034
So the hard space doesn't seem to count as whitespace.Now we have got what you wantjshell> Character.isWhitespace((char)0xa0)
$1 ==> false
/java$ java EncodingDemo.java /run/media/campbell/TOSHIBA/Trimming-issue/list1.txt
showChars() method
\u0031
\u0032
\u0033
\u0034
Ron McLeod wrote:In the example file that ML posted, the BOM sequence was ef bb bf (UTF-8), not fe ff (UTF-16).
Won't you please? Please won't you be my neighbor? - Fred Rogers. Tiny ad:
Gift giving made easy with the permaculture playing cards
https://coderanch.com/t/777758/Gift-giving-easy-permaculture-playing
|