Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Tim Cooke
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • paul wheaton
  • Devaka Cooray
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Ron McLeod
  • Piet Souris
  • Ganesh Patekar
Bartenders:
  • Tim Holloway
  • Carey Brown
  • salvin francis

remove invisible control character from file

 
Ranch Hand
Posts: 54
Netbeans IDE MySQL Database Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hey there,

I need your help.
in my XML file could be some invisible control character: 0x00–0x1F and 0x7F–0x9F.
The XML file is like this:


maybe there some whitespaces, empty rows or tabs (till three in a row like this > \t\t\t <)
It comes from the XML generator...
I need a solution to replace all the control chars. For this I wrote this:


So far so good, but now I have to add each char to the array
Then I'll tryed with this one private final String patternString = "[\\x{00}-\\x{1F}]"; wich I found on https://stackoverflow.com/questions/26897810/using-java-regexes-to-match-a-range-of-unicode-code-points-outside-the-bmp-it?rq=1
But when I replace the match with patternString and remove the foreach loop, is the result that the hole document is full of control chars
On https://www.regular-expressions.info/unicode.html I found the \p{Cc} pattern wich is the same result...

Why? What is my mistake?

Cheers
Chris
 
Marshal
Posts: 65447
248
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There must be an easier way to do that, particularly if you consider that a char isn't a letter, but a number (unsigned 16‑bit integer). You can therefore apply operators like + - and < to it. Maybe < and > are the most useful at the present. And let's remind ourselves of the meanings of the smaller Unicode characters. For chars > 0x80 try here. I cannot remember whether the left half of UTF‑8 code points does or does not fall in the range 7f...9f.
Please explain what you mean about the whole file being full of control characters. What sort of characters? How many? Remember that newline, return and tab are included ni chars < 0x20.
Don't make the buffered reader a field. Make it a local variable. The class name regex isn't good because it doesn't represent a regex; try something like ControlCharacterFinder. Sorry it is a lot longer.

I suggest maybe the following to test your reading:-
  • 1: Pass a Charset object representing UTF‑8 to the FileReader's constructor.
  • 2: Record line numbers; there is a kind of Reader which automatically records line numbers.
  • 3: Divide each line into its individual chars. Strings have a method to do this.
  • 4: Iterate each String printing any chars in the ranges you want to monitor.
  • Something like this:-Note the use of the unary + operator, which turns the char into an int. I wouldn't expect you to find any newline/return characters because they will be omitted by the reader.
     
    Campbell Ritchie
    Marshal
    Posts: 65447
    248
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Half an hour ago, I wrote:I cannot remember whether the left half of UTF‑8 code points does or does not fall in the range 7f...9f. . . ..

    But I don't think that is going to be an issue.
    Once you have found the control characters, it shouldn't be too difficult to replace them.
     
    Bartender
    Posts: 21000
    128
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Likes 2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    You can use the Character isISOControl() method to literally test for Unicode control characters. That includes all of the codes you mentioned.
     
    Campbell Ritchie
    Marshal
    Posts: 65447
    248
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Tim Holloway wrote:. . . Character isISOControl() method . . . .

    Yes, that method does exactly what the OP wants
     
    Tim Holloway
    Bartender
    Posts: 21000
    128
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Campbell Ritchie wrote:

    Tim Holloway wrote:. . . Character isISOControl() method . . . .

    Yes, that method does exactly what the OP wants



    And, unless I miss my guess, it's an excellent excuse to code up a lambda.
     
    Sheriff
    Posts: 24635
    56
    Eclipse IDE Firefox Browser MySQL Database
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Chris Ernst wrote:in my XML file could be some invisible control character: 0x00–0x1F and 0x7F–0x9F.



    Well, you know, if that happens then you have a malformed XML file. The normal process would be to send the file back to whoever produced it, point out the error, and request that the production process be fixed. In the XML world it's not your responsibility to repair malformed XML.

    It's possible that you're getting the malformed XML from some customer who you have to be nice to. I've been in that position. But if it's politically possible to reject the malformed file then you should definitely do that.

    As for Campbell's solution: it's not necessary to read the file one line at a time. Lines have no meaning in the XML world. So reading the file one character at a time and discarding all of the bad characters should be sufficient. Seems to me that if you do it that way you don't need to mess about with complicated regexes, you just need something which tells you which characters aren't valid in XML documents.

    However... it looks to me like you're also wanting to remove tabs and line-feeds, which are valid XML characters. Those don't make the XML file malformed. So you should be able to deal with them in the code which deals with parsed XML, rather than writing special code to remove them in advance. Chances are that they are in the whitespace which you're ignoring anyway.
     
    Chris Ernst
    Ranch Hand
    Posts: 54
    Netbeans IDE MySQL Database Java
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Thank you!
    I'll try it all right now...
    I create the XML file with a program (it's an export from a database) to used it for an import in Apache Solr (and when a control char is in the file it sends me an error)
    The main problem in my case is, it must be fast. The file is written in a temp folder and after clean up it will send to the right destination.
    The other problem is, that we can't see maybe the chars (the most problem was that the char 31 was set in the text).

    Please explain what you mean about the whole file being full of control characters


    It can be in all "string" areas

    What sort of characters? How many?


    I don't know so I want to replace them all

     
    Campbell Ritchie
    Marshal
    Posts: 65447
    248
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Paul Clapham wrote:. . . send the file back to whoever produced it

    Agree. I didn't realise the OP had been sent a malformed file.

    . . . reading the file one character at a time and discarding all of the bad characters should be sufficient.

    But wouldn't that mean using that abomination of a method, read(), or its overloaded brother? You are going to get the characters as ints, you have to use while ((i = reader.read()) >= 0) ..., which aren't either really a problem, but you will get slower execution than reading the file buffered. That is why I thought of buffering the reading.

    . . . tabs and line-feeds . . . they are in the whitespace which you're ignoring anyway.

    Reading line by line will omit all the \r and \n sequences, but you will have to deal with the \t characters (\t (char)0x0009, I think).
     
    Chris Ernst
    Ranch Hand
    Posts: 54
    Netbeans IDE MySQL Database Java
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I'll make it like this


    Now I have only to handle the double tabs in front of the xml element, but I think its OK when the tabs will be removed, cause I hope that no special char is inside .
    Bit in the case if one or more is inside it have to remove and the format is only for the programmers eyes to read it clear and it looks great

    Reading line by line will omit all the \r and \n sequences


    I don't create some of this only the tabs

    and again a big lovely THANK YOU

    Now I make some bench test how fast it is
     
    Campbell Ritchie
    Marshal
    Posts: 65447
    248
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    A few minutes ago, I wrote: . . . I didn't realise the OP had been sent a malformed file. . . .

    And it now appears he wasn't. Sorry for not noticing that post sooner. There may be a fault in the program which reads the database, however. Please consider correcting that program. What is 31? Is that a decimal number? I always count in hexadecimal when I think of Unicode. If we look in the first Unicode chart I showed you, you will find this:-

    Unicdde chart 0000...007f wrote:. . .
    001F<control>
    = INFORMATION SEPARATOR ONE
    = unit separator (US)
    . . .

     
    Campbell Ritchie
    Marshal
    Posts: 65447
    248
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Chris Ernst wrote:I'll make it like this

    This would be better:

    Now I have only to handle the double tabs . . .

    What's a double tab? If it simply means \t\t, that's two characters. Your test will skip it and not change it.

    and again a big lovely THANK YOU . . .

    That's a pleasure

    Please consider what Tim H said about a λ.
     
    Tim Holloway
    Bartender
    Posts: 21000
    128
    Android Eclipse IDE Tomcat Server Redhat Java Linux
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    There are 3 allowable control characters in XML 1.0: 0x0a (NL/LF), 0x0D (CR) and 0x09 (horizontal tab). You don't need to remove any of those to be valid XML.
     
    Chris Ernst
    Ranch Hand
    Posts: 54
    Netbeans IDE MySQL Database Java
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    There are only tabs inside no CR or NL, but thanks for the hint tim

    I have see it with the snippet from Campbell that there only tabs inside.

    And it works now and I am very happy
     
    Campbell Ritchie
    Marshal
    Posts: 65447
    248
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Chris Ernst wrote:There are only tabs inside no CR or NL . . .

    As I said earlier, if you use readLine(), you will remove all the \rs and \ns. Well done gettin git to work.
     
    You don't like waffles? Well, do you like this tiny ad?
    Enterprise-grade Excel API for Java
    https://products.aspose.com/cells/java
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!