[Sussex] Additional to "John's stupid question"

Sun Oct 12 15:58:26 UTC 2003

John,

OK - let's take this from the beginning.

There are two different things we need to do.

As is typical with IT, in order to do anything useful, we need to do both,
and we can only do the second once we've done the first... however, we need
to make decisions about HOW we're going to do the second thing now, since
those decisions will influence exactly WHAT we do in the first.

Step 1: Convert the articles to something readable on computer
Step 2: Work out how to search those articles so that we are able to
retrieve the pertinent ones

The issue is that there are three different ways of doing Step 1. I'll start
with the simplest first:

Step 1: Option 1, Tag and Index

This is one of Tony's suggestions. That you take each article at a time, and
add a simple reference number to it. You then read the article and type in
some basic keywords into a database, then you put the paper copy of the
article into a filling cabinet. When you then need to search for an article
about Aardvarks, your database tells you that you need to go to the file and
pull out articles 79 and 136. The key advantage - it's VERY easy to do. The
key disadvantages - 1: the keywording is a manual process. 2: you still need
to manually store the articles. Normal application: Something like a
library, where keywords are entered into a computer and whole books are kept
in paper version on the shelf.

Step 1: Option 2, Scan, Tag and Index

This is the next step up. Instead of putting the article on a shelf, you
scan it in, and save the file in some graphics format for later use. The bad
news is that simply scanning saves the article just as if it were a big
picture - it's not intellingent, so you still need to read through, and add
keywords. When you search for Aardvarks, it'll tell you you need articles 79
and 136, and offer to bring them up on screen. The key advantages - 1:
relatively easy to set up a system to do this, 2: no need to keep reams of
paper. The key disadvantages - 1: the keywording is still a manual process.

Step 1: Option 3, Scan, OCR and Index

The more complex. You scan the article, THEN you pass it through OCR
(optical character recognition) software which turns it back into words
(with a greater or lesser degree of accuracy.) You can then use "Full Text
Indexing" so that every word becomes a searchable term. When you search for
Aardvarks, it brings up articles 79 and 136, but also 36, 54, 97, 104, and
156, all of which happened to mention them in passing. Key advantages - 1:
less human-intensive to get the information into the system, 2: picks up
more references since can scan all words, not just keywords. Key
disadvantages: 1: Still requires reasonable human input because articles are
seldom all of a page, hence you have to scan, and then tell your software
which areas to OCR (to avoid adverts etc.) 2: Complex to integrate to OCR
software, 3: Can turn up far too many references.

Step 2: Index....

Rather hinges on step 1... If you are just indexing keywords, then any
number of databases will do, and you can put together a database structure
easily that maps article number to keywords... If, however, you are OCRing
the whole text, you need to be cleverer - the database would be too big to
index every word, so you need to strip out words like "The", "a",
"if","then", "etc." and so on. Also, you need to think about whether a word
appearing in a title is given a higher weight than a word appearing in the
main text, and other issues like that.

The first thing you need to do is work out which approach to Step 1 you're
taking. I've listed three - there are many other variants.... such as OCRing
and ALSO keyword scanning

M.

----- Original Message ----- 
From: "John D." <big-john at dsl.pipex.com>
To: "LUG email list for the Sussex Counties" <sussex at mailman.lug.org.uk>
Sent: Sunday, October 12, 2003 4:22 PM
Subject: [Sussex] Additional to "John's stupid question"

>
> Erm,
>
> because I can't find anything about databases that makes sense to me, and
> notwithstanding what Tony posted in reply to my first Q,
>
> perhaps someone could enlighten me.
>
> My aim, is to get Clare's (my partner) pile of magazines, work sheets and
> other paper media that she uses for work and store it somehow on disc.
>
> As there's quite a lot of it, and especially with the magazines, it would
be
> rather convenient to be able to vet it, i.e. loose the ad's and so on, so
we
> are left with just the pertinent material.
>
> So, with that in mind, it seems that a database type "thing" would be the
> elegant solution.
>
> But when ever I surf for database stuff, I get lots of hits about MySQL
and
> PostgreSQL that I don't follow. All sorts of things about how the two apps
> can do this or that, but as I don't think that I've ever seen database
> operations "in anger", I'm not sure if that's really what I need to do.
>
> Yes, I have seen some of the stuff that they do at my work, using MS
access,
> but as that is all table related, I have no idea how, or if scanning
Clares
> paperwork in would be a suitable thing to be able to "database".
>
> I have had one suggestion that I could possibly be able to scan stuff in
as
> pdf, and while I know this is commonplace with documents, I haven't seen
it
> done with images.
>
> I also understand that it could be just as easy to scan the "stuff" in and
> just burn it to cdr. Again, possible though it would mean having someway
of
> checking an index to see what is on each disc. I would rather be able to
open
> an application, enter a searchable keyword, and possibly see thumbnails of
> what is "thrown up".
>
> So is it some sort of "database" that I need? (also, this maybe because I
> don't understand what some/most/all of you understand by the term
database).
>
> Hence the request for enlightenment.
>
> regards
>
> John
>
> p.s. sorry if it seems that I'm just repeating things that I've posted
> earlier, but the more I investigate this, the more confused I become.
>
> _______________________________________________
> Sussex mailing list
> Sussex at mailman.lug.org.uk
> http://mailman.lug.org.uk/mailman/listinfo/sussex
>
>