Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

catching regex group in repitition

 
Ranch Hand
Posts: 188
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator

I have text say "hallo a s d java world". Now asd is a acronym with whitespace. It could have been "a. s. d." as well. When there is space between these individual alphabets of an acronym, it creates lots of trouble in my application. Acronyms can be of minimum two alphabets, but upper limit is not defined. Now I want output as "hallo asd java world". What will be the regex for it and what will be the capturing group ? I've tried with "(\\s+)(([a-z](?:\\s+)){2,})". I'm intended to catch the first group to maintain the space, then second group should concatenate all the findings in iteration, which is what I am finding it difficult to do. In above example it retains only last alphabet i.e. 'd', while I intend to retain 'asd'. please help me realize this.
 
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
Welcome to JavaRanch

Please supply more details, including the code you are using to parse that String.

By the way: if asd is an acronym, then "a s d" isn't an acronym. In which case it should not be possible to parse "a s d" or "a. s. d."

Acronyms consist of a single word without spaces or stops in. Really, "asd" isn't an acronym because it is awkward to pronounce it as a word. "NATO" is an acronym because it is formed from the initials of 4 words and is prounced Nay-to, not enn-a-tee-o.
 
Rahul P Kumar
Ranch Hand
Posts: 188
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
Idea of "a s d" as acronym was just an example. You can take "NATO as an example". Now somewhere in text, someone writes it as "hallo N. A. T. O. java world", which he meant it to be "N.A.T.O.". In such cases what I want is to find such patterns and get the whole picture "N.A.T.O.". in my earlier example I had removed dot for ease of use. If dot is present, my pattern will look like "(\\s+)(([A-Z](?:\\.?)(?:\\s+)){2,})". Here my intention is a pattern starting with one ore more whitespace, followed by two or more repetition of combination of character followed by dot (one or not at all), followed by one or more whitespace. last two(dot and whitespace) are non-captured groups.

NB- please do not focus on that in previous post I've not supplied dot handling and lower/upper case handling, those are not the issues

My code looks like this:

private static final String COMPACT_ACRONYMS = "(\\s+)(([A-Z](?:\\.?)(?:\\s+)){2,})";
public static String compactSpacedAcronyms(String text){
Pattern p = Pattern.compile(COMPACT_ACRONYMS);
Matcher m = p.matcher(text);
text = m.replaceAll("$1$3)");
return text;
}


This code matches the pattern correctly, however for replacement, I need some trick to compact the acronym. Here, I understand that it finds 'N.', 'A.', 'T.', 'O.' individually. however overrides the previous findings and at last '$3' prints 'O.' only. Is there any way to print 'N.A.T.O.' so that my final text becomes "hallo N.A.T.O. java world".

 
Rahul P Kumar
Ranch Hand
Posts: 188
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
OK, it's done. I had to do some work around.

leave aside those upper/lower case and dots, so code looks like:

private static final String COMPACT_ACRONYMS = "(\\s+)(([a-z])(?:\\.?)(?:\\s+)){2,}";
public static String compactSpacedAcronyms(String text){
Pattern p = Pattern.compile(COMPACT_ACRONYMS);
Matcher m = p.matcher(text);
Pattern p1 = Pattern.compile("((?:\\s*)([a-z])(?:\\s*))");
String tempText = null;
if(m.find()){
tempText = m.group(); // capture above compact acronym in temp string
System.out.println(tempText);
Matcher m1 = p1.matcher(tempText);
tempText = m1.replaceAll("$2"); // process this temp String further
System.out.println(tempText);
}
// System.out.println(m.);
text = m.replaceAll("$1"+tempText+"$1"); // replace original patterns with this tempstring
return text;
}
 
Rahul P Kumar
Ranch Hand
Posts: 188
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
Sorry, this has a serious bug. what it does is, the first find becomes the replacement for all occurrences. temp string is not reset each time. How to do that?
 
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
Please Use Code Tags instead of colouring.
 
author
Posts: 23951
142
jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Report post to moderator
Topic restarted here...

https://coderanch.com/t/464471/Java-General/java/catching-regex-group-repitition


This topic will be locked.

Henry
 
Consider Paul's rocket mass heater.
    Bookmark Topic Watch Topic
  • New Topic