[dundee] Parse text file into CSV

Gavin Carr gavin at openfusion.com.au
Thu Dec 10 15:28:51 UTC 2009


On Thu, Dec 10, 2009 at 02:56:31PM +0000, Kris Davidson wrote:
> Okay I'm looking to turn a directory of text files into a CSV file.
> The CSV needs to have 3 columns: title, parapragh, text - with either
> each new file being added to the row or getting a newline.
> 
> An example text file would look like this (the number of paragraphs is
> variable depending on the text file)
> 
> -----------------------------------------------------------------------
> Title
> 
> first paragraph first paragraph first paragraph first paragraph first
> paragraph first paragraph etc
> 
> second paragraph second paragraph second paragraph second paragraph
> second paragraph etc
> 
> third paragraph third paragraph third paragraph third paragraph third
> paragraph third paragraph third paragraph
> -----------------------------------------------------------------------
> 
> Colunm 1 needs no explanation, column 2 will be just the first
> paragraph in the file, column 3 will be all of paragraphs with
> linebreaks being replaced with something say <BR>
> 
> Anyway does anyone have idea ideas, command strings or code.

I'm procrastinating. :-) 20 lines of perl later ...

--------------------------
#!/usr/bin/perl

use strict;
use Config::Directory;
use Text::CSV_XS;

die "usage: dir2csv <directory>\n" unless @ARGV == 1;
die "'$ARGV[0]' is not a directory\n" unless -d $ARGV[0];

my $dir = Config::Directory->new( $ARGV[0], { chomp => 0, trim => 0 }) 
  or die "Load of directory '$ARGV[0]' failed: $!";

my $csv = Text::CSV_XS->new;

for my $file (sort keys %$dir) {
  my @paragraphs = map { s/\n/<br>/g; $_ } split /\n\n+/, $dir->{$file};
  $csv->combine( $paragraphs[0], $paragraphs[1], join("<br><br>", @paragraphs[1..$#paragraphs]) )
    or die "combine failed: " . $csv->error_input;
  print $csv->string . "\n";
}
--------------------------

Produces output like:

Title,"first paragraph first paragraph first paragraph first paragraph first<br>paragraph first paragraph etc","first paragraph first paragraph first paragraph first paragraph first<br>paragraph first paragraph etc<br><br>second paragraph second paragraph second paragraph second paragraph<br>second paragraph etc<br><br>third paragraph third paragraph third paragraph third paragraph third<br>paragraph third paragraph third paragraph"
"Now is the Time","Now is the time for all good men to come to the aim of the party.","Now is the time for all good men to come to the aim of the party.<br><br>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor<br>incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis<br>nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.<br>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu<br>fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in<br>culpa qui officia deserunt mollit anim id est laborum.  <br><br>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor<br>incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis<br>nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.<br>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu<br>fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in<br>culpa qui officia deserunt mollit anim id est laborum."


Cheers,
Gavin




More information about the dundee mailing list