File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Groovy and the fly likes Wanted:Help converting utf-8 to XML entities Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Languages » Groovy
Bookmark "Wanted:Help converting utf-8 to XML entities" Watch "Wanted:Help converting utf-8 to XML entities" New topic
Author

Wanted:Help converting utf-8 to XML entities

Siegfried Heintze
Ranch Hand

Joined: Aug 11, 2000
Posts: 388
I want to write a groovy script (preferably a one-liner!) that will accept an input file as stdin and output as stdout and converts from utf-8 to XML entities in the &#dddddd; format optionally perform the reverse operation too.

Support for UTF-16 and UTF-32 in addition to UTF-8 would be nice too.

Is this really a groovy scripting question or a "which JVM library do I use" question?

Thanks,
Siegfried
[ August 25, 2008: Message edited by: Siegfried Heintze ]
Gregg Bolinger
GenRocket Founder
Ranch Hand

Joined: Jul 11, 2001
Posts: 15300
    
    6

I want to write a groovy script (preferably a one-liner!)

I can write 10,000 lines of java code on one line. That doesn't mean its better. So lets worry about line #'s when it matters, which is rarely.

Is this really a groovy scripting question or a "which JVM library do I use" question?

Might not be either. Are you asking for help or for someone to do this for you? If its the former, what have you tried so far? What specifically is giving you problems?


GenRocket - Experts at Building Test Data
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18987
    
    8

Do you want everything converted to character entities, or only those characters outside of US ASCII? The latter seems improbable; the latter can be achieved by doing an identity XSL transformation and forcing the output encoding to be US-ASCII.

Not sure how you would do this in Groovy though.
Matthew Taylor
Rancher

Joined: Jun 13, 2004
Posts: 110
I may be able to help you if you gave examples. What do you mean UTF-8 to XML?


Grails Consultant
http://dangertree.net
Siegfried Heintze
Ranch Hand

Joined: Aug 11, 2000
Posts: 388
OK, here are some more specific questions:

(1) Is there a library function that will read UTF-8 (or UTF-16) from a disk file and convert each character, including the multi-byte sequences, into a single value (perhaps an integer or a java char or java string element -- although the latter two would not work so well for UTF-32). Ideally this function would read an entire record or file into a string, I think.

(2) How would I implement a regular expression to perform a search and replace to look for all the characters between 0x7f and 0xffffff and convert cast them to an integer and the use toString() to get their representations using ASCII digits '0'-'9' and (finally) prepend "&#" and append ";".

(3) How would I write a script to perform the inverse operation: read an ASCII XML document, search for all the patterns "&#([0-9]+);" and replace the first group with a single wide charcter and write it to a UTF-8 or UTF-16 stream? Is there a library function for writing UTF-8 or UTF-16 representations of my wide character strings? Is there a library function to write UTF-32? I wonder what it would accept (an integer array, perhaps?).

Thanks!
Siegfried
Matthew Taylor
Rancher

Joined: Jun 13, 2004
Posts: 110
Here is an example of how read a file directly into text:



And here is how to replace a regex in a string (see here):



So you can combine these (in one line ) like this:



Of course, I have left the real regex work for you. This will only find simple numbers, not hexadecimal numbers.

[ August 30, 2008: Message edited by: Matthew Taylor ]
[ August 30, 2008: Message edited by: Matthew Taylor ]
Siegfried Heintze
Ranch Hand

Joined: Aug 11, 2000
Posts: 388
Thanks! That is close.

How do I search for all the characters whose ordinal values are are greater than 128 and replace them with the digit sequence from toString?

The above code searches for ordinal values are between 48 and 57 which is not quite what I want.

Thanks!
Siegfried
Matthew Taylor
Rancher

Joined: Jun 13, 2004
Posts: 110
Originally posted by Siegfried Heintze:
How do I search for all the characters whose ordinal values are are greater than 128 ...


Like I said, I'll leave the regex to you. You need a regular expression (the first parameter to the Groovy replaceAll() method on String) that matches the hexadecimal values you want. This isn't a Groovy question, but a regex question, and I'm no regexpert (sorry about the bad pun).

Originally posted by Siegfried Heintze:
... and replace them with the digit sequence from toString?

The above code searches for ordinal values are between 48 and 57 which is not quite what I want.


In the 'replaceAll()' method of the Groovy String, you specify within the closure what to do with the matched value. So when I have val = "pre-${val}-post" in the closure for replaceAll, that means I'll be taking every match, and adding "pre-" to the front and "-post" to the back of it before putting it back into the original String. There is no messing with toString() anywhere.

Maybe I should clarify that the above code assumes the input string has hexadecimal values representing UTF characters, not the characters themselves. So all this regex replacing assumes that you're looking for the actual '0xFFFFFF' type value in the input string and replacing it with something else like '�xFFFFFF;'.

Also, the regex in my code above matches any string of digits. I don't know where you are getting '48 and 57'.
Marc Peabody
pie sneak
Sheriff

Joined: Feb 05, 2003
Posts: 4727

Code with test data:
String test = (('!'..140) as Character[]).join()
test = test.collect{ (it>128)?"&#${it as Integer};":it }.join()

Result: "!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€&#129;&#130;&#131;&#132;&#133;&#134;&#135;&#136;&#137;&#138;&#139;&#140;"

*blows smoke from gun*
[ September 03, 2008: Message edited by: Marc Peabody ]

A good workman is known by his tools.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Wanted:Help converting utf-8 to XML entities