[sclug] performance puzzle

Tom Dawes-Gamble tmdg at weardale.cl
Sat Jul 9 21:10:34 UTC 2005


On Sat, 2005-07-09 at 13:27 +0100, Alex Butcher wrote:
> On Sat, 9 Jul 2005, Tom Dawes-Gamble wrote:
> 
> > On Wed, 2005-01-19 at 22:57 +0000, Tom Dawes-Gamble wrote:
> >> Hi
> >>
> >> Here is an interesting situation.
> >>
> >> # time dd if=/mnt/tmp/foo of=/dev/null bs=1024
> >> 1048577+0 records in
> >> 1048577+0 records out
> >>
> >> real    0m8.793s
> >> user    0m2.346s
> >> sys     0m6.422s
> >> # time dd if=/mnt/tmp/foo of=/mnt/tmp/bar bs=1024
> >> 1048577+0 records in
> >> 1048577+0 records out
> >>
> >> real    0m27.614s
> >> user    0m3.117s
> >> sys     0m14.983s
> 
> Not grossly poor performance. After all, the first run the data was read and
> thrown away, the second time, it was read, then had to be written to disc
> (and, what's more, to the same filesystem as which the data was being read
> from). I'd therefore expect at least a doubling of 'user' and 'sys' times
> and the results you obtained are consistent with that.
> 

Yes,  but it's the 'real' time I'm on about. Why does foo always perform
better in 'real' time?  

> I'd also query your experimental method; did you run these tests multiple
> times and take an average in order to limit experimental error? did you run
> this on a system in run level 1 (i.e. with nothing else running)? The latter
> would appear not to be the case, since the difference between 'real' time and
> user+sys suggests to me that the dd process was pre-empted by something
> else.
> 

The results are always consistant.  Sure I was running in level 5 but I
didn't choose when the processes ran.  So I think you can discount the
fact it's running in level 5 and could be pre-empted by some other
processes.

I just re ran the tests a moment ago and they still perform the same.
Again at level 5.  

> >> # time dd if=/mnt/tmp/bar of=/mnt/tmp/baz bs=1024
> >> 1048577+0 records in
> >> 1048577+0 records out
> >>
> >> real    0m54.358s
> >> user    0m3.076s
> >> sys     0m14.857s
> 
> This appears more interesting on the surface, but looking a little closer,
> the big increase between the last run and this run is the increase of 'real'
> time, rather than 'user' or 'sys'. I wonder if this result would have been
> different if you ran 'sync' or rebooted (and ran the dd of=/dev/null again,
> to be fair) between the second and third runs.
> 

Well here are the results of doing that.

# time dd if=/mnt/tmp/bar of=/dev/null bs=1024
1048577+0 records in
1048577+0 records out

real    0m21.325s
user    0m0.427s
sys     0m2.392s
# time dd if=/mnt/tmp/bar of=/mnt/tmp/bar2 bs=1024
1048577+0 records in
1048577+0 records out

real    0m58.257s
user    0m0.611s
sys     0m8.953s

and now if I do foo without a dd to null first.

# time dd if=/mnt/tmp/foo of=/mnt/tmp/bar3 bs=1024
1048577+0 records in
1048577+0 records out

real    0m39.741s
user    0m0.639s
sys     0m10.338s



> >> # for i in /mnt/tmp/???
> >>> do
> >>> time md5sum $i
> >>> done
> >> 4763dafae2d293e98b62979a00308158  /mnt/tmp/bar
> >>
> >> real    0m21.199s
> >> user    0m3.679s
> >> sys     0m2.155s
> >> 4763dafae2d293e98b62979a00308158  /mnt/tmp/baz
> >>
> >> real    0m21.194s
> >> user    0m3.667s
> >> sys     0m2.163s
> >> 4763dafae2d293e98b62979a00308158  /mnt/tmp/foo
> >>
> >> real    0m5.770s
> >> user    0m3.685s
> >> sys     0m2.005s

These are the timings that IMHO are the ones that really give the clue.
I've just re run the for loop and this is the result.

# for i in * ; do  time md5sum $i; done
4763dafae2d293e98b62979a00308158  bar

real    0m21.416s
user    0m3.006s
sys     0m1.247s
4763dafae2d293e98b62979a00308158  bar2

real    0m21.432s
user    0m3.022s
sys     0m1.241s
4763dafae2d293e98b62979a00308158  bar3

real    0m24.970s
user    0m2.986s
sys     0m1.281s
4763dafae2d293e98b62979a00308158  baz

real    0m21.172s
user    0m3.010s
sys     0m1.225s

4763dafae2d293e98b62979a00308158  foo

real    0m4.554s
user    0m3.201s
sys     0m1.341s
# 

> >> # ll /mnt/tmp/???
> >> -rw-r--r--  1 root root 1073742848 Jan 19 16:33 /mnt/tmp/bar
> >> -rw-r--r--  1 root root 1073742848 Jan 19 16:34 /mnt/tmp/baz
> >> -rw-r--r--  1 root root 1073742848 Jan 19 16:31 /mnt/tmp/foo
> >>
> >>
> >> Why does foo perform so much better than bar or baz?
> 
> In this case, I'd suggest that foo is stored in a faster or more-dense zone
> of the disc. Is /mnt/tmp on a logical volume?
> 

Yes it is.  

> > I didn't ever get a correct answer to this little puzzle I set in
> > January.  A number of people did offer the 'disk fragmentation' solution
> > but that is not the case here.
> 
> I see no evidence indicating that that is definitely not the case. On what
> basis are you drawing that conclusion?
> 

Well the filesystem is brand new.
  
# df /mnt/tmp
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-lvol0
                      16126920   4271260  11036460  28% /mnt/tmp


I've not deleted any files from it so it's not been fragmented.  As you
can see there is plenty of space.

> > To add a little more information.
> >
> > The system has 1 Gig of ram.
> > The file system is an ext3 file system.
> > The file system was created mounted and then the program that originally
> > creates foo was run. The system was then rebooted so that no data would
> > be in the buffer cache. The file system was remounted and the commands
> > above run.
> >
> > So why do the files bar and baz perform so badly?
> 
> Note also that there appear to be some weirdness going on with various parts
> of the Linux kernel these days. Performance regressions have been reported
> with the IDE drivers, and I've personally noticed some very poor performance
> on RAID10 arrays hosted on LSI MegaRaid320 controllers that was only fixed
> by a) disabling the controller's built-in read-ahead and b) using blockdev
> --setra to crank the kernel's own read-ahead on /dev/sda, /dev/md? and all
> LVM LVs upto 4096 or 8192 or so.
> 

I can reproduce this problem on any kernel you like and any disk you
like.  I've only used ext3 and ext2 on linux so i can't say if i can
reproduce the effect on other filesystem types.  

The disk is not not raid or md just a single disk partition on an ATA
drive.  I've done nothing to influence where the partition is on the
disk or in Volume group.

Best regards,
Tom.

-- 
There are 10 types of people in the world.
Those that understand Binary and those that don't.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.tmdg.co.uk/pipermail/sclug/attachments/20050709/b322cf33/attachment.bin


More information about the Sclug mailing list