Hi, I'm trying to read from xml files generated by Microsoft Access. However, in one of the files, there seems to be a problem with parsing certain characters. Is it possible to specify the encoding somehow or avoid these problems another way? Code:
I'm thinking you need to find out if there really is an invalid XML character in the file. Try looking at the file using other XML-aware applications, like XMLSpy or even InternetExplorer - if these give you an error message, that will tell you the problem is with MS Access' XML conversion. You'll probably get more informed responses to this in the XML forum, so I'm moving this post there (follow the link at the top of the page.)
I had recently had a similar problem trying to parse an XML file with "special" characters. Anyway here is some code:
Hope this helps, James.
Joined: Jan 30, 2000
This thread was just referenced elsewhere, so I'll add some more now, moths later. I don't believe that character entities for illegal characters (e.g. control characters) are any more legal than the characters themselves. Some parsers may allow them, but they're illegal according to the XML spec. The list of allowed characters refers to parsed entities - meaning it tells you what's allowed after the character entities have already been interpreted as their equivalent Unicode characters. (Again, see the spec for a list of what's legal.) Note also that character values over Byte.MAX_VALUE are by no means illegal. In James Swan's code, I suspect the problem he solved was that there was some confusion over what encoding was used in a file, and so converting to Unicode references solved the problem. But Thomas Goorden's problem seems to be characters like a vertical tab (#x0Bh) or start of text char (#x02). These are well under the Byte.MAX_VALUE limit, but quite illegal nonetheless. So it seems the best solution is probably to replace them with spaces. From Thomas's comments here, it sounds like he's on the right track, but the problem is he can't successfully read the characters in the first place to be able to replace them. The exception UTFDataFormatException: invalid byte 3 of 3-byte UTF-8 sequence (0x3f) implies that Thomas has successfully created a reader that assumes UTF-8 encoding, but the data is not actually in UTF-8. Contrary to the Microsoft spec - what a surprise. The real problem seems to be finding out what encoding is really used. I recommend just concentrating on creating a reader that can read the whole file without throwing an exception - forget about parsing as XML until you can do that. Example:
Experiment with different encodings (UTF-16 is just one possibility to try) and see how your output looks. Another option is to make sure the file has a .xml extension, and then open it with Internet Explorer 5.50 or later. Go to View -> Encoding to specify a different encoding to use. (You may need to make sure you've got the appropriate fonts installed, if the file contains foreign characters.) You may not get as many encoding choices as are ultimately available in Java, but it's easy to use for a quick answer in many cases. Once you know what encoding is really used, try inserting a proper XML encoding declaration into the file, and try again to parse it. Your problems may suddenly go away. If not, go back to reading each char and replacing the illegal ones.
Joined: Aug 15, 2001
Jim, First of all, thanks for the help, this is becoming quite the encoded nightmare (if ya know what I mean). I had (after my post) already tried different encoding scheme's for reading the file (partly thanks to your help), to the point where I can just specify it as a command-line option now. However, none of these seem to help, Cp1251, UTF8, UTF-16 all seem to break on different characters (plus, judging from the hex content, I wouldn't think it's anywhere near UTF-16, it also breaks on the very first character, which seem the worst to me). I'm now just trying to find out how they actually "tried" to encode. One weird things lies in the accented characters (�, �, etc.), all of which don't get coded in their ASCII representation, but instead by a sequence of #xC3 #xA9 (just for the �) character. Other stuff, like "<" gets the hexadecimal #x.. treatment. UTF-8 only seems to _break_ on the VT characters (#x0B), but the thing is that the accented characters don't get read properly either (although the parser keeps running) and stay in their "ugly" form. IE6.0 doesn't give me the option of changing the encoding through "view -> encoding", but it seems to indicate it's Unicode (but what kind? It doesn't specify). I've tried to read the files in one encoding, then store it in another (so I can might get rid of the accents problem, after which or before which I could tackle the VT problem, but I don't get the characters out properly). To add insult to injury, notepad as well as IE seem to render these accented characters properly, but are unwilling to save them in a proper manner. Aaargh. I keep trying different combinations, but what would really interest me is what encoding would mess up the accents like that (for reference: UltraEdit, if you know it, doesn't render them properly, but notepad does). Thanks! Addendum: Wordpad doesn't render the accents properly, but doesn't break on the file either... [ April 19, 2002: Message edited by: Thomas Goorden ]
Joined: Jan 30, 2000
One weird things lies in the accented characters (�, �, etc.), all of which don't get coded in their ASCII representation, but instead by a sequence of #xC3 #xA9 (just for the �) character. That seems to be the correct UTF-8 representation of �. You can read a description of how UTF-8 works here. The Unicode value of � is 0xE9 - since it's > 0x7F, it uses the 2-byte format. Take 110xxxxx 10xxxxxx as the template in binary, and fill in the x's with the bits from E9 (11101001) - go right to left, and fill in 0 for the unused leftmost x's. You get 11000011 10101001 Which is C3A9 in hex. Good evidence that you're dealing with UTF-8. Other stuff, like "<" gets the hexadecimal #x.. treatment. That's an additional level of encoding superimposed on the UTF-8. The XML parser will decode everything using UTF-8 first, then interpret the character entities. Not a problem. UTF-8 only seems to _break_ on the VT characters (#x0B OK, track down whoever inserted vertical tabs into an XML doc, and shoot them. Repeatedly. You'll probably have to read all bytes using an InputStream (rather than Reader) first, and nuke the vertical tabs - and any other illegal chars. Then pass the results on to an XML parser. but the thing is that the accented characters don't get read properly either (although the parser keeps running) and stay in their "ugly" form I would guess that the problem here is that you're not looking at them in a medium capable of displaying them correctly. Write a loop to print the numeric Unicode value of each char after you decode it, and verify that you're getting a char value of E9 (233). If you are, then the problem isn't decoding, it's displaying. IE6.0 doesn't give me the option of changing the encoding through "view -> encoding", but it seems to indicate it's Unicode (but what kind? It doesn't specify). Strange - I use IE 6.0 at home, and I can change the encoding. Try changing the file extension to .html instead - that should be similar, but more flexible. Also try looking for any encoding declaration, and deleting it. If IE finds an HTML file with no encoding declaration, it should let you change the assumed encoding.
Joined: Aug 15, 2001
Jim, I can't thank you enough... You really got me out of this mess; I used a FileInputStream to "nuke" the VT characters (in fact, I had tried this with a Reader before, which didn't work, for which you gave me the solution right away). After this, I let the parser have a go at the result, with UTF8 encoding specified and now I am the happy owner of nicely parsed and rendered accented characters, with not one glitch... I really owe you one there. Did I mention how much I learned from this? A LOT!
Well, this is really well after the fact. But I sorted out my issues using this page, so I suppose other people may still be finding it as well. If so, here is the regex that I ended up using to screen for valid xml characters in my incoming file. I am checking character by character, using this pattern:
Pattern p = Pattern.compile("[\\u0009\\u000A\\u000D\u0020-\\uD7FF\\uE000-\\uFFFD\\u10000-\\u10FFF]+");
Now, with the given solution, I'm able to get rid of these characters through pattern matching. But I don't want to lose these charactes but parse them somehow. Is there a way to do this? Please find below the code that I'm using:
File pdffile = new File(filePath);
// Setup output
out = new FileOutputStream(pdffile);
As shown in tables of ASCII character values those illegal characters are "control codes" - mostly left over from the days of 7 bit ascii and teletypes. What rational character susbstitution could be made?
I would worry about how these characters got into your input text.
I'm not very sure about how these characters came into my XML. But we deal with different languages, so I'm afraid removing such characters could be a loss of data as they might be of relevance to other regions.
Do you know of some other parser that can parse such characters?
Those characters are NOT VALID in any XML document. If somebody put them in there, then they did it wrong. It's not your job to fix that -- not unless you know what they should have done instead, anyway. And XML parsers are required to follow the rules of XML, so no XML parser will accept those characters. Because they are against the rules of XML.
It's possible that those characters were put in there by some other software in the process of transferring the XML from wherever it was produced. Or perhaps they were put in there (deliberately or carelessly) when the XML was originally created. Either of those processes would produce XML which is not well-formed. So you should get the incorrect process fixed.
Or as Jim Yingst said ten years ago:
Jim Yingst wrote:OK, track down whoever inserted vertical tabs into an XML doc, and shoot them. Repeatedly.
Author and all-around good cowpoke
Joined: Mar 22, 2000
One likely source for characters illegal in XML is (wait for it)
yes! Microsoft WORD. If people edit in WORD and then cut and paste into a database entry, these nefarious characters lie in wait for the poor unsuspecting Java programmer trying to create XML from database entires.
I had to write the following to catch WORD "smart punctuation" - there is probably a better way but this worked for me.
Joined: Feb 06, 2012
Thanks Bill and Paul.
Bill, the mentioned characters may be illegal for XML, but my problem was with characters such as 0xe, 0xf, etc.
What I found out is that there is a set of 32 control characters in ASCII which are not meant to be printed on screen, but only perform some special keyboard functions. You can find them here:
They're obsolete now and somehow coming into my XML. There are two solutions to it:
1. Remove them by pattern matching as mentioned in this thread. (best soln)
2. If you wanna keep them (just like me) then change your XML version to 1.1 and escape the characters as:
Here ch is the illegal character. After this you can unescape them using StringUtils unescapeNumericChar() method.
Hope this is useful to someone.
Author and all-around good cowpoke
Joined: Mar 22, 2000
I can't imagine what useful purpose would be served by recreating these control codes in some future application reading your XML. The words "data time bomb" occur to me.
These are not just "keyboard functions" - many control printer output - imagine sending a bunch of form feeds to a high speed printer.
Issues brought up through this discussion are rather essential for me, and have been so for quite a while.
As an example :
How can you as legal authority identify a person by a name, address etc, if you meanwhile have to transpose between different character formats ?
Formats like Win 1251, EBCDIC, ASCII, Latin, Unikode . You really have to be strictly and sure about the scope of transposition
For the time being I'm going to make a draft describing how to convert from Unikode to Latin and vice versa.
I would grateful if You some links or hints that could help me about the work above.