[Gllug] Disk performance

Fri Nov 4 16:18:35 UTC 2011

On Thu, Nov 03, 2011 at 12:49:35PM +0000, Alain Williams wrote:
> A system that I look after the dozen or so users report occasional 'slowdowns' of several seconds
> where there will be little response. The work is stock control/invoicing type app written using C-ISAM
> databases. 8GB RAM, 2 dual core AMD opterons.
> 
> The disks seem to be the bottle neck. If I pick a bad time I show iostat & dstat output below.
> I have straced a few applications and not seen fsync() so there is no real excuse for slowing
> down on write - there is plenty of RAM for buffer cache.

No fsync()'s at all? I would hope your database applications are doing some sort of sync on transaction commits.

How about msync() calls?

> 
> I wonder if increasing the value in /sys/block/sda/device/queue_depth (currently 31) would help.
> queue_type contains 'simple' - worth tweaking ?

What about sdb and sdc? Do they have the same queue_depth?

> 
> iostat output:
> 
> Time: 11:10:07
> Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00    62.00  0.00 369.00     0.00  3456.00     9.37   131.06  343.04   2.71 100.10
> sdb               0.00    53.00  0.00 417.00     0.00  3912.00     9.38    90.88  219.61   2.40 100.10
> sdc               0.00   106.00  0.00 344.00     0.00  3800.00    11.05   144.33  402.43   2.91 100.10
> 
> Time: 11:10:08
> Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     6.00  0.00 354.00     0.00  3000.00     8.47   136.61  417.13   2.83 100.10
> sdb               0.00     5.00  0.00 368.00     0.00  3008.00     8.17    50.60  156.20   2.72 100.10
> sdc               0.00    10.00  0.00 328.00     0.00  2864.00     8.73   140.25  417.20   3.05 100.10
> 
> Time: 11:10:10
> Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00    24.55  0.00 178.57     0.00  1639.29     9.18    59.50  406.08   3.73  66.56
> sdb               0.00    24.11  0.00 138.84     0.00  1310.71     9.44    11.31   89.51   2.68  37.23
> sdc               0.00    25.45  0.00 180.36     0.00  1660.71     9.21    55.93  404.14   3.10  55.98
> 
> 
> dstat:
> 
> -----time----- --dsk/sda-- --dsk/sdb-- --dsk/sdc-- ----total-cpu-usage---- ----------------interrupts--------------- ---load-avg--- ---procs--- ---system--
>   date/time   | read  writ: read  writ: read  writ|usr sys idl wai hiq siq|  1     4     12    50   169   209   225 | 1m   5m  15m |run blk new| int   csw
> 
> 02-11 11:10:07|   0  1700k:   0  1968k:   0  1880k|  0   0  27  72   1   0|   0     0     0    35     0  2296  1089 | 0.6    1  0.7|  0   4   0|4423   213
> 02-11 11:10:08|   0  1436k:   0  1492k:   0  1416k|  6   3  25  65   0   1|   0     0     0    20     0  2071  1006 | 0.6    1  0.7|  1   6   0|4096   145
> 02-11 11:10:10|   0  1752k:   0  1360k:   0  1736k|  1   1  50  48   0   0|   0     0     0    47     0  1965  1137 | 1.1  1.1  0.7|  0   0   0|5317   353
> 
> 
> 
> The 3 disks are set up as RAID 1 (ie a 3 way mirror - partly due to history of disc failure due to SATA speed
> negotiation errors [resulting in too high a speed] that meant data timeouts -- all now fixed).

In RAID 1, writes will take as long as the slowest device. So any device servicing a read at the time will hold up writes. From your output above, though, nothing is reading at the time of the slow downs. Perhaps it'd be worth removing a drive from the RAID, and configuring it as a hot spare instead? This may make it quicker by just requiring 2 drives to complete instead of 3.

A good tool to try also is vmstat, which includes swap requests. Perhaps the system is swapping periodically (yes, the system may swap even if loads of memory is still available.)

>From this page:
http://cherry.world.edoors.com/Cdx6RbRHlOjQ

There is a formula at the bottom which calculates the number of concurrent requests processed by a device as:

concurrency = (r/s + w/s) * (svctm/1000)

Which in your case, is coming out at less than 1. So each disk appears to be serialising requests, however, this page:

http://www.xaprb.com/blog/2010/09/06/beware-of-svctm-in-linuxs-iostat/

Implies svctm is unreliable.

What is plain to see is that the devices are saturated with write requests, and from the figures, it looks like it's random write requests. The key figures are wrqm/s, which are the actual number of merged requests sent to the drive per second (the kernel will merge operations to adjacent sectors.) There's lots of merging going on, so I would guess that several files are being sync'd at the same time, perhaps via msync(). And inode changes would go via the journal, which may be far from the file data being sync'd resulting in lots of head movement.

One more thing you might want to try is using the "data=journal" ext3/4 filesystem option. That way, any synchronous data, and associated inode updates will be written in one big sequential journal sequence, deferring the in place updates until later. You may end up writing more actual data, but the data will be written quicker when it really matters. I've done this to great effect with SQLite databases (fsync heavy,) practically doubling the transaction rate.

Hope that helps,
Christian
--
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug