File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes XML and Related Technologies and the fly likes XPath Encoding Problem Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "XPath Encoding Problem" Watch "XPath Encoding Problem" New topic

XPath Encoding Problem

Paulo Carvalho
Ranch Hand

Joined: Nov 12, 2008
Posts: 57

I don't have lots of experience in that domain this is why I'm going to ask for your help.

I'm going to simplify my problem to explain it better:

I have a XML file with the following structure:

<?xml version="1.0" encoding="UTF-8"?>

With a Java class, I want to obtain the "values" tags values. Here is my java method to do that:

final XPath xpath = XPATHFACTORY.newXPath();
String result = "";
final XPathExpression nodesXpath = xpath.compile(xpathQuery);

// Gets the element
final Element nd =
(Element) nodesXpath.evaluate(doc, XPathConstants.NODE);

if (nd != null) {
result = nd.getTextContent();

The obtained values are the following ones:

Value1: France
Value2: Grèce

As you can see the 2nd one is not well formed. What can I do do get it correctly?
(My XML file is already UTF-8 encoded so I don't know what is the problem)

Thanks in advance.
Best regards
Paul Clapham

Joined: Oct 14, 2005
Posts: 19973

When you say you "get" that value, exactly what do you mean by that? Show us how you are looking at it; it's possible you are using an incorrect charset in the process of looking.
g tsuji
Ranch Hand

Joined: Jan 18, 2011
Posts: 632
That looks very symptomatic of the application outputting character stream encoded in utf-8 (e-accent grave 0xc3 oxa8) and being read either on a cp1252 console screen or on a text editor like notepad with "ansi" encoding. If that's the case, it seems the parsing and output streaming are in good control. If the output stream had been using something other than utf8 like cp1252 and you still get that on the console/notepad, that would be a problem meaning the original xml document failed to be read properly or being badly encoded. As I suspect more of the former, I would say it is good news and you simply need to use a utf-8 console or text editor that support utf-8 encoding to read the characters as they should look like.
I agree. Here's the link:
subject: XPath Encoding Problem
It's not a secret anymore!