*
The moose likes Java in General and the fly likes Special characters parsing in Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Special characters parsing in Java" Watch "Special characters parsing in Java" New topic
Author

Special characters parsing in Java

N Naresh
Ranch Hand

Joined: Nov 04, 2008
Posts: 66
Hi i have the following program

if we run above we are getting following output where it is prep-ending extra  character for each special character.

<html>
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<title>Character Set Test</title>
</head>
<body>
<h2>Character Set Test</h2>

<table border="1" >
<caption>Character entity references in HTML 4.0</caption>

<col align="LEFT" />
<col align="CENTER" style="font-size:1.4em" />
<col align="LEFT" />
<col align="RIGHT" />
<col align="RIGHT" />
<col align="LEFT" />
<tr>
<th class="main" ><em>ISO 8859-1
characters</em><br />
</th>
</tr>

<tr>
<td>nbsp</td>
<td> </td>
<td>no-break space (non-breaking space)</td>
<td>160</td>
</tr>

<tr>
<td>iexcl</td>
<td>¡</td>
<td>inverted exclamation mark</td>
<td>161</td>
</tr>

<tr>
<td>cent</td>
<td>¢</td>
<td>cent sign</td>
<td>162</td>
</tr>

<tr>
<td>pound</td>
<td>£</td>
<td>pound sign</td>
<td>163</td>
</tr>

<tr>
<td>curren</td>
<td>¤</td>
<td>currency sign</td>
<td>164</td>
</tr>

<tr>
<td>yen</td>
<td>Â¥</td>
<td>yen sign (yuan sign)</td>
<td>165</td>
</tr>

<tr>
<td>brvbar</td>
<td>¦</td>
<td>broken bar (broken vertical bar)</td>
<td>166</td>
</tr>

<tr>
<td>sect</td>
<td>§</td>
<td>section sign</td>
<td>167</td>
</tr>

<tr>
<td>uml</td>
<td>¨</td>
<td>diaeresis (spacing diaeresis)</td>
<td>168</td>
</tr>

<tr>
<td>copy</td>
<td>©</td>
<td>copyright sign</td>
<td>169</td>
</tr>

<tr>
<td>ordf</td>
<td>ª</td>
<td>feminine ordinal indicator</td>
<td>170</td>
</tr>

<tr>
<td>laquo</td>
<td>«</td>
<td>left-pointing double angle quotation mark (left pointing
guillemet)</td>
<td>171</td>
</tr>

<tr>
<td>not</td>
<td>¬</td>
<td>not sign</td>
<td>172</td>
</tr>

<tr>
<td>shy</td>
<td>­</td>
<td>soft hyphen (discretionary hyphen)</td>
<td>173</td>
</tr>

<tr>
<td>reg</td>
<td>®</td>
<td>registered sign (registered trade mark sign)</td>
<td>174</td>
</tr>

<tr>
<td>macr</td>
<td>¯</td>
<td>macron (spacing macron, overline APL overbar)</td>
<td>175</td>
</tr>

<tr>
<td>deg</td>
<td>°</td>
<td>degree sign</td>
<td>176</td>
</tr>

<tr>
<td>plusmn</td>
<td>±</td>
<td>plus-minus sign (plus-or-minus sign)</td>
<td>177</td>
</tr>

<tr>
<td>sup2</td>
<td>²</td>
<td>superscript two (superscript digit two, squared)</td>
<td>178</td>
</tr>

<tr>
<td>sup3</td>
<td>³</td>
<td>superscript three (superscript digit three, cubed)</td>
<td>179</td>
</tr>

<tr>
<td>acute</td>
<td>´</td>
<td>acute accent (spacing acute)</td>
<td>180</td>
</tr>

<tr>
<td>micro</td>
<td>µ</td>
<td>micro sign</td>
<td>181</td>
</tr>

<tr>
<td>para</td>
<td>¶</td>
<td>pilcrow sign (paragraph sign)</td>
<td>182</td>
</tr>

<tr>
<td>middot</td>
<td>·</td>
<td>middle dot (Georgian comma, Greek middle dot)</td>
<td>183</td>
</tr>

<tr>
<td>cedil</td>
<td>¸</td>
<td>cedilla (spacing cedilla)</td>
<td>184</td>
</tr>

<tr>
<td>sup1</td>
<td>¹</td>
<td>superscript one (superscript digit one)</td>
<td>185</td>
</tr>

<tr>
<td>ordm</td>
<td>º</td>
<td>masculine ordinal indicator</td>
<td>186</td>
</tr>

<tr>
<td>raquo</td>
<td>»</td>
<td>right-pointing double angle quotation mark (right pointing
guillemet)</td>
<td>187</td>
</tr>

<tr>
<td>frac14</td>
<td>¼</td>
<td>vulgar fraction one quarter (fraction one quarter)</td>
<td>188</td>
</tr>

<tr>
<td>frac12</td>
<td>½</td>
<td>vulgar fraction one half (fraction one half)</td>
<td>189</td>
</tr>

<tr>
<td>frac34</td>
<td>¾</td>
<td>vulgar fraction three quarters (fraction three quarters)</td>
<td>190</td>
</tr>

<tr>
<td>iquest</td>
<td>¿</td>
<td>inverted question mark (turned question mark)</td>
<td>191</td>
</tr>

<tr>
<td>Agrave</td>
<td>À</td>
<td>Latin capital letter A with grave (Latin capital letter A
grave)</td>
<td>192</td>
</tr>

<tr>
<td>Aacute</td>
<td>Ã?</td>
<td>Latin capital letter A with acute</td>
<td>193</td>
</tr>

<tr>
<td>Acirc</td>
<td>Â</td>
<td>Latin capital letter A with circumflex</td>
<td>194</td>
</tr>

<tr>
<td>Atilde</td>
<td>Ã</td>
<td>Latin capital letter A with tilde</td>
<td>195</td>
</tr>

<tr>
<td>Auml</td>
<td>Ä</td>
<td>Latin capital letter A with diaeresis</td>
<td>196</td>
</tr>

<tr>
<td>Aring</td>
<td>Ã…</td>
<td>Latin capital letter A with ring above (Latin capital letter A
ring)</td>
<td>197</td>
</tr>

<tr>
<td>AElig</td>
<td>Æ</td>
<td>Latin capital letter AE (Latin capital ligature AE)</td>
<td>198</td>
</tr>

<tr>
<td>Ccedil</td>
<td>Ç</td>
<td>Latin capital letter C with cedilla</td>
<td>199</td>
</tr>

<tr>
<td>Egrave</td>
<td>È</td>
<td>Latin capital letter E with grave</td>
<td>200</td>
</tr>
</table>
</body>
</html>

where original URLwebpage characters are different as follows.

Character Set Test
Character entity references in HTML 4.0 ISO 8859-1 characters
nbsp no-break space (non-breaking space) 160
iexcl ¡ inverted exclamation mark 161
cent ¢ cent sign 162
pound £ pound sign 163
curren ¤ currency sign 164
yen ¥ yen sign (yuan sign) 165
brvbar ¦ broken bar (broken vertical bar) 166
sect § section sign 167
uml ¨ diaeresis (spacing diaeresis) 168
copy © copyright sign 169
ordf ª feminine ordinal indicator 170
laquo « left-pointing double angle quotation mark (left pointing guillemet) 171
not ¬ not sign 172
shy ­ soft hyphen (discretionary hyphen) 173
reg ® registered sign (registered trade mark sign) 174
macr ¯ macron (spacing macron, overline APL overbar) 175
deg ° degree sign 176
plusmn ± plus-minus sign (plus-or-minus sign) 177
sup2 ² superscript two (superscript digit two, squared) 178
sup3 ³ superscript three (superscript digit three, cubed) 179
acute ´ acute accent (spacing acute) 180
micro µ micro sign 181
para ¶ pilcrow sign (paragraph sign) 182
middot · middle dot (Georgian comma, Greek middle dot) 183
cedil ¸ cedilla (spacing cedilla) 184
sup1 ¹ superscript one (superscript digit one) 185
ordm º masculine ordinal indicator 186
raquo » right-pointing double angle quotation mark (right pointing guillemet) 187
frac14 ¼ vulgar fraction one quarter (fraction one quarter) 188
frac12 ½ vulgar fraction one half (fraction one half) 189
frac34 ¾ vulgar fraction three quarters (fraction three quarters) 190
iquest ¿ inverted question mark (turned question mark) 191
Agrave À Latin capital letter A with grave (Latin capital letter A grave) 192
Aacute Á Latin capital letter A with acute 193
Acirc  Latin capital letter A with circumflex 194
Atilde à Latin capital letter A with tilde 195
Auml Ä Latin capital letter A with diaeresis 196
Aring Å Latin capital letter A with ring above (Latin capital letter A ring) 197
AElig Æ Latin capital letter AE (Latin capital ligature AE) 198
Ccedil Ç Latin capital letter C with cedilla 199
Egrave È Latin capital letter E with grave 200

Anybody please suggest me how to solve this problem it is urgent.

Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41604
    
  55
Does the console -or wherever you're printing this- support those characters?

Ping & DNS - my free Android networking tools app
N Naresh
Ranch Hand

Joined: Nov 04, 2008
Posts: 66
in any browser as well as in eclipse console output is same.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41604
    
  55
in any browser as well as in eclipse console output is same.

What do you mean by "in any browser"? The Java code runs on the command line, not in a browser, right? And again, does the Eclipse console support those characters?

Also, while I'm not sure what " tidy.setCharEncoding(org.w3c.tidy.Configuration.UTF8)" does, if the numerical entities are replaced by their corresponding characters (is that what "tidy.setNumEntities(true)" does?), then "stream.toString()" uses the platform default encoding - which is likely not ISO-8859 or UTF-8.
N Naresh
Ranch Hand

Joined: Nov 04, 2008
Posts: 66
i took that code from my original web project and try to test it in eclipse console where in both cases the output which i am getting is same.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19682
    
  19

N Naresh wrote:

You are using UTF-8 when reading the contents from the InputStream and writing to the ByteArrayOutputStream, but then you're using the system default encoding to convert that byte[] into a String. Try using this:


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
N Naresh
Ranch Hand

Joined: Nov 04, 2008
Posts: 66
Thank you it is working fine it seems..
 
Don't get me started about those stupid light bulbs.
 
subject: Special characters parsing in Java