[Gllug] Extracting substrings in bash script

Wed Nov 7 00:30:16 UTC 2007

On 6 Nov 2007, dylan at dylan.me.uk spake thusly:

> Hi all,
>
> I'm writing a script to re-organise a pile of mp3 files. Basically, I need to 
> distribute them through an alphabetical directory tree. In a terminal, I get:
>
> dylan at hal:~> tagmp3 show test.mp3
> Test.mp3
>     Artist : The Streets
>     Title  : It Was Supposed to Be So Easy
>     Album  : A Grand Don't Come for Free
>     Track  : 1
>     Year   : 2004
>     Genre  : Other
>     Comment:
>
> But when I try:
>
> t=`tagmp3 show test.mp3`
> echo $t
>
> I get:
>
> Test.mp3 Artist : The Streets Title : It Was Supposed to Be So Easy Album : A 
> Grand Don't Come for Free Track : 1 Year : 2004 Genre : Other Comment:
>
> So, what are the many and various possibilities for extracting the Artist, 
> Title, Album and Track from that string?

The trick is to do the extraction before it hits the backticks. `` (or
its more recent nestable incarnation, $()), converts \n to whitespace.

I'd recommend:

artist="$(tagmp3 show test.mp3 | grep '^[[:space:]]*Artist : ' | sed 's,^[^:]*: ,,')"
title="$(tagmp3 show test.mp3 | grep '^[[:space:]]*Title : ' | sed 's,^[^:]*: ,,')"
album="$(tagmp3 show test.mp3 | grep '^[[:space:]]*Album : ' | sed 's,^[^:]*: ,,')"
track="$(tagmp3 show test.mp3 | grep '^[[:space:]]*Track : ' | sed 's,^[^:]*: ,,')"

If you want to avoid running tagmp3 four times (parsing some kinds of
tag is slow, although I think this is not true of media files), you
could do something like this (in bash):

eval $(tagmp3 show test.mp3 |\
       while read -r LINE; do
           value="$(echo $LINE | sed 's,^[^:]*: ,,')"
           case $LINE in
               *( )"Artist : ") echo artist=\"$value\";;
               *( )"Title : ") echo title=\"$value\";;
               *( )"Album : ") echo album=\"$value\";;
               *( )"Track : ") echo track=\"$value\";;
           esac
       done)

This is quite a lot more elaborate, but also a lot more efficient,
especially if there are many lines of output or you want to carry out a
lot of transformations (it's a bit over the top for just four but I
wanted to show the technique).

It runs the tagmp3 command exactly once, and loops over every line of
the result in turn, storing that line in the variable LINE. (The -r
stops the shell from expanding any \-escapes like \n that may be in the
line, so we get it unchanged.)

It rips the value we want to keep off that line with the same sed
expression used above, and then matches each element we're interested
in in turn. We use case for this because when you can use it is this
one of the most efficient possible ways to match stuff: it's shell-
internal and avoids forking. The *( ) is a useful bashism that matches
'one or more spaces': you can do it POSIXly by just typing in that fixed
set of spaces, in quotes, but this is more robust if you can rely on
using bash. (zsh of course has about a million ways to do this; ksh has
yet other ways.)

Note taht once it's got the value, it doesn't set a variable directly.
That won't do, because the entire while loop is on the right hand side
of a pipeline, and that means it's executed in a subshell. Subshells do
not share their environment or local variables with their parents, so
if we just set a variable it would get thrown away as soon as we left
the loop, which wouldn't be much use.

So instead we echo *the text of a variable assignment* (properly
quoted), and run the entire tagmp3 inside an `eval', which is a command
which takes that output and executes it as if it were typed at the
shell.

But why the hell would anyone want to do this? It's a lot more complex
than the first example and depends on a lot more subtle shell behaviour.
There are two reasons.

Firstly, if you're doing a lot of processing for each line, that loop can
be much simpler than the alternative. I wrote something once which parsed
the output of an ftp session in a `while read' loop like that, handling
different output from a dozen varieties of FTP server.

Secondly, it's *fast*. That loop forks once per round-trip, to execute
sed: that slows the loop down more than anything else.

Let's speed it up, and eliminate sed using ${VAR#...} prefix elimination
and some rather ugly quoting:

eval $(tagmp3 show test.mp3 |\
       while read -r LINE; do
           case $LINE in
               *( )"Artist : ") echo artist=\""${LINE#*( )Artist :}"\";;
               *( )"Title : ") echo title=\""${LINE#*( )Title :}"\";;
               *( )"Album : ") echo album=\""${LINE#*( )Album :}"\";;
               *( )"Track : ") echo track=\""${LINE#*( )Track :}"\";;
           esac
       done)

I ran that (with fake output) a thousand times. It took two seconds.

Beat *that*. :)

-- 
`Some people don't think performance issues are "real bugs", and I think 
such people shouldn't be allowed to program.' --- Linus Torvalds
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug