Win a copy of Design for the Mind this week in the Design forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Extract text lying between two patterns

 
Abhinav Srivastava
Ranch Hand
Posts: 354
Eclipse IDE Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have one file with content like



I need to create a new file which would look like



Basically finding all occurences of text between <?xml> and </Product>

I tried sed -n and awk range commands but they don't seem to give the desired output.


Any ideas?
 
Tim Holloway
Saloon Keeper
Pie
Posts: 18098
50
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's not going to produce a valid XML file. The <?xml> processing instruction can only appear once in an XML stream and only on the first line.

As for the rest of it, the main reason why your match fails is that "?" is a match control character. So instead of matching "<?xml", it's looking for [<]xml - where the square brackets indicate that the "<" is an optional character. You actually need to match "<\?xml".
 
Abhinav Srivastava
Ranch Hand
Posts: 354
Eclipse IDE Java Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't want it to be an XML doc, rather just a text file having xml fragments. Actually its not about XML at all, just the text.
My problem is that sed is spitting out the entire line where it finds the match, not just the text lying between the two patterns.
 
Tim Holloway
Saloon Keeper
Pie
Posts: 18098
50
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Abhinav Srivastava wrote:I don't want it to be an XML doc, rather just a text file having xml fragments. Actually its not about XML at all, just the text.
My problem is that sed is spitting out the entire line where it finds the match, not just the text lying between the two patterns.


You can use parenthesis to delimit match groups, like so:

<Product>(.*)</Product>

Then you can reference the match group by its group number. It's usually something like "$1" for the first group, "$2" for the second group - if you have multiple group patterns - and so forth. The exact form varies depending of the app/library doing the matching.

AWK is probably better for this than sed. Sed can be programmed to do it, but it requires various buffer tricks. AWK would be much simpler. Something vaguely like the following:



I'm out of practice with AWK, though, so expect to do some heavy tweaking to make it work.>
 
rajesh thiru
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Did you got this puzzle out ?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic