File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Regex with Unicode Text (Devanagri Script) Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Java » Java in General
Reply Bookmark "Regex with Unicode Text (Devanagri Script)" Watch "Regex with Unicode Text (Devanagri Script)" New topic
Author

Regex with Unicode Text (Devanagri Script)

Jasmine Arondekar
Greenhorn

Joined: Aug 27, 2009
Posts: 1
I am posting the topic in the forum "Java in General". I dont know if this is the right place for it.

I am using regular expressions on Devanagri Script files (Unicode text).
Here is the program :

public class KonRegex extends JFrame implements ActionListener{

Container cp;
JTextField itxt;
String kip = null;

public KonRegex() {

cp = getContentPane();
cp.setLayout(new FlowLayout());

itxt = new JTextField(15);
cp.add(itxt);

JButton b1 = new JButton("View");
b1.addActionListener(this);
b1.setActionCommand("View");
cp.add(b1);

addWindowListener(new WindowAdapter( ) {
public void windowClosing(WindowEvent e) {
setVisible(false);
dispose( );
System.exit(0);
}
});

setVisible(true);
setSize(500,400);
}

public void actionPerformed(ActionEvent e) {
if ("View".equals(e.getActionCommand())) {
kip = itxt.getText();
System.out.println(kip.getCharacterEncoding());
RegexMatch();
}
}

// Find a match
public void RegexMatch() {
String value = null;
try{
Pattern pat = Pattern.compile(kip,Pattern.CANON_EQ);
Matcher match = pat.matcher(fileContent("Kon.txt"));
while(match.find()){
value = match.group();
cp.add(new JLabel(value));
}
validate();
}
catch(IOException ioe){
System.out.println("Error in io");
}
}

// convert input to CharSequence
public CharSequence fileContent(String fname) throws IOException {

FileInputStream f = new FileInputStream(fname);
FileChannel fc = f.getChannel();

ByteBuffer buf = fc.map(FileChannel.MapMode.READ_ONLY,0,(int)fc.size());
CharBuffer cbuf = Charset.forName("UTF-16").newDecoder().decode(buf);

f.close();
fc.close();
return cbuf;
}

public static void main(String args[]){
KonRegex kt = new KonRegex();

}
}

1. The program throws a PatternSyntaxException for conjuncts having 3 or more Devanagri chars combined eg. ध्वं, ल्ल्य
Am I doing something wrong in the program for this to happen?

2. Combined chars represented by multiple code points do not match with certain regex. For eg. the expression स.र does not match प्रसार as सा is a combination letter. Is there a way to handle these type of cases?

3. Any input on the above code or any suggestions for related reading material appreciated.
Sagar Rohankar
Ranch Hand

Joined: Feb 19, 2008
Posts: 2896
    
    1

Welcome to JR !!

First, UseCodeTags

Second, you're talking about regex, and I haven't found one.

1. The program throws a PatternSyntaxException for conjuncts having 3 or more Devanagri chars combined eg. ध्वं, ल्ल्य
Am I doing something wrong in the program for this to happen?

We need regex for the same.

2. Combined chars represented by multiple code points do not match with certain regex. For eg. the expression स.र does not match प्रसार as सा is a combination letter. Is there a way to handle these type of cases?

Why ? सा and स are two different letters, right ? Why do you want to match them ?

any suggestions for related reading material appreciated.

RE & Unicode

[LEARNING bLOG] | [Freelance Web Designer] | [and "Rohan" is part of my surname]
 
I agree. Here's the link: http://ej-technologies/jprofiler - if it wasn't for jprofiler, we would need to run our stuff on 16 servers instead of 3.
 
subject: Regex with Unicode Text (Devanagri Script)
 
Similar Threads
Problem with GroupLayout Manager
repaint() method problem
JDBC validating username and password against SQL database
text box - GridBagLayout
GUI runtime error