[dundee] Parse text file into CSV

Kris Davidson davidson.kris at gmail.com
Thu Dec 10 21:37:24 UTC 2009


Actually nevermind the UTF8 stuff I was hoping that would strip out
the smart quotes but no luck

2009/12/10 Gavin Carr <gavin at openfusion.com.au>:
> On Thu, Dec 10, 2009 at 02:56:31PM +0000, Kris Davidson wrote:
>> Okay I'm looking to turn a directory of text files into a CSV file.
>> The CSV needs to have 3 columns: title, parapragh, text - with either
>> each new file being added to the row or getting a newline.
>>
>> An example text file would look like this (the number of paragraphs is
>> variable depending on the text file)
>>
>> -----------------------------------------------------------------------
>> Title
>>
>> first paragraph first paragraph first paragraph first paragraph first
>> paragraph first paragraph etc
>>
>> second paragraph second paragraph second paragraph second paragraph
>> second paragraph etc
>>
>> third paragraph third paragraph third paragraph third paragraph third
>> paragraph third paragraph third paragraph
>> -----------------------------------------------------------------------
>>
>> Colunm 1 needs no explanation, column 2 will be just the first
>> paragraph in the file, column 3 will be all of paragraphs with
>> linebreaks being replaced with something say <BR>
>>
>> Anyway does anyone have idea ideas, command strings or code.
>
> I'm procrastinating. :-) 20 lines of perl later ...
>
> --------------------------
> #!/usr/bin/perl
>
> use strict;
> use Config::Directory;
> use Text::CSV_XS;
>
> die "usage: dir2csv <directory>\n" unless @ARGV == 1;
> die "'$ARGV[0]' is not a directory\n" unless -d $ARGV[0];
>
> my $dir = Config::Directory->new( $ARGV[0], { chomp => 0, trim => 0 })
>  or die "Load of directory '$ARGV[0]' failed: $!";
>
> my $csv = Text::CSV_XS->new;
>
> for my $file (sort keys %$dir) {
>  my @paragraphs = map { s/\n/<br>/g; $_ } split /\n\n+/, $dir->{$file};
>  $csv->combine( $paragraphs[0], $paragraphs[1], join("<br><br>", @paragraphs[1..$#paragraphs]) )
>    or die "combine failed: " . $csv->error_input;
>  print $csv->string . "\n";
> }
> --------------------------
>
> Produces output like:
>
> Title,"first paragraph first paragraph first paragraph first paragraph first<br>paragraph first paragraph etc","first paragraph first paragraph first paragraph first paragraph first<br>paragraph first paragraph etc<br><br>second paragraph second paragraph second paragraph second paragraph<br>second paragraph etc<br><br>third paragraph third paragraph third paragraph third paragraph third<br>paragraph third paragraph third paragraph"
> "Now is the Time","Now is the time for all good men to come to the aim of the party.","Now is the time for all good men to come to the aim of the party.<br><br>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor<br>incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis<br>nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.<br>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu<br>fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in<br>culpa qui officia deserunt mollit anim id est laborum.  <br><br>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor<br>incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis<br>nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.<br>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu<br>fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in<br>culpa qui officia deserunt mollit anim id est laborum."
>
>
> Cheers,
> Gavin
>
>
> _______________________________________________
> dundee GNU/Linux Users Group mailing list
> dundee at lists.lug.org.uk  http://dundeelug.org.uk
> https://mailman.lug.org.uk/mailman/listinfo/dundee
> Chat on IRC, #tlug on irc.lug.org.uk
>



More information about the dundee mailing list