[Sussex] Re: C programming help

Wed Apr 27 09:01:07 UTC 2005

Captain Redbeard <hairy.one at virgin.net> writes:
>
> Probably true, but I don't know any other way of doing it.  I am not a trained programmed, I don't know
> any trained programmers, books only go so far and I don't currently have the resources to train
> professionally, so other than "hacking it" I don't see any other way
> of learning it.  

Well, now you know _several_ trained programmers.  It seems we have a
least a handful of experienced C programmers here.

I've mentioned it before, but if you're serious about learning C or
C++ to a professional standard then you could do a lot worse than join
the ACCU (The Association of C and C++ users), read the magazines they
produce and take part in the mentored development programs and code clinics.

See http://www.accu.org

>True, designing
> and debugging "on the fly" as I am doing is not a very good way of writing software but I know a lot
> more about C today than I did five days ago and my code, messy as it may be, is a damn site cleaner than
> what I had last week.  

This is testament to the fact that you can't learn something as
complex as programming without actually _doing_ it.  The point Thomas
was making (if I understood him correctly) is that just fiddling till
it works doesn't teach you anything unless you understand exactly
_why_ it worked.  

Unfortunately, not every situation allows you to understand exactly
what you're doing - but you should strive to have as much
understanding as possible - the temptation to fiddle til it's fixed is
a bad one.

Here's an example from my day to day existence right now.

The Dell Latitude X300 model seems to be unstable with kernel 2.6.10
in our ubuntu based build.  This seemed to be related to the
OOM-Killer.  I did some research and found out that ACPI modules
combined with certain hardware could cause the OOM-Killer to start
killing processes when it shouldn't have needed to run at all
(i.e. when plenty of VM was still available).

So we merrily shut off ACPI and handed back to testing.  Testing
didn't see the bug for a couple of days.  So we thought there was a
good chance that ACPI is the route of this problem.

However, we _need_ ACPI support for of the functionality we have so it
was suggested that we revert to an earlier kernel build in which we
hadn't seen these OOM-Killer messages but did have ACPI support.

Now, as it turns out, we no longer see OOM-killer messages, but the
machines become unresponsive and their hard disk light locks on.  In
this situation there is nothing at all in the logs to indicate an
error (not an OOPS or an ERROR or even a WARN).

The problem here is we didn't really understand what the first problem
was.  We've tested by exception (i.e. if we don't see the problem for
/n/ days we believe it is gone), but that never gives us confidence.
Is this same crash?  Is the OOM-Killer a completely innocent
bystander?  We simply don't know, and we can say with confidence that
we've made any progress.

So, I've been investigating again and now I have a theory. Under
certain circumstances setting:

/usr/sbin/hdparm -q -S 12 /dev/hda
/usr/sbin/hdparm -q -B 1 /dev/hda

... can cause a race condition that leaves the disc not spinning when
the kernel thinks it should be.  We can reproduce this by repeatedly
issuing:

/usr/sbin/hdparm -Y /dev/hda

... which is effectively the same as what will happen after 60
seconds of disk inactivity on the machine when the other two commands
are issued.  When ACPI support is turned on the machine can issue
those hdparm commands in quite a number of power-saving scenarios. 

This time we test a specific fix and a situation that is better
understood, but still hard to reproduce (we're also testing some other
theories on different instances of hardware, but I'll leave that
complexity aside).

We still have to test by exception (how else can we test a situation
that cannot reliably be reproduced) but if the machines survive for a
period that is many times longer than the typical time to failure then
we will feel more confident.

> Once I "hack" something together that actually works, no matter how crude it may
> be, at least I will have something, THEN I can work on a cleaner interface, better design, more robust
> code, etc.  At the moment I am at the stage where I don't know that I don't know, but by putting together
> something that *should* work, per my understanding, only to find out that it doesn't, though it may not
> be the best way, is still a valid way of learning.  Sure, so I spent two hours swearing at my monitor
> because my program crashed randomly when I tried to run it, only to discover (duh!) that I hadn't thought
> of setting each element in an array of structures to NULL before using the array but (a) the techniques
> I used to discover that will be useful to me in future now that I know how to look for this kind of error
> and (b) I won't be making that particular error again.  Inefficient yes, but I don't know any other way
> of doing so short of a professional course and right now, as much as I would kill for the opportunity, I
> won't be doing one.
>
> Of course, if you do know a better way, ignore everything I've just said and tell me about it :)

See about the ACCU as detailed above.  Otherwise, keep reading, keep trying and keep asking questions.

-- 
Geoff Teale
CMed Technology            -   gteale at cmedresearch.com
Free Software Foundation   -   tealeg at member.fsf.org