Packrat - Information Storage

Packrat is a project of mine intended initially to store all my email in a big flat searchable database. 'virtual folders' will allow me to reuse searches easily, but their contents will be built dynamically when they are opened. This eliminates the necessity to sort data. Of course data can be given additional sort criteria at any time, but they don't lose any of their original metadata when this data is added.

This is my response to a friends questions about Packrat.

From jeske@... Mon Nov 30 16:50:25 1998
Date: Mon, 30 Nov 1998 16:50:25 -0800
From: David Jeske 
To: Paul Bleisch 
Subject: Re: killer app
Message-ID: <19981130165025.O5324@home.chat.net>
References: <8744DF3002FBD011BDDF000092970B465B95E8@iron.digitalanvil.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.94.13i
In-Reply-To: <8744DF3002FBD011BDDF000092970B465B95E8@iron.digitalanvil.com>; from Paul Bleisch on Mon, Nov 30, 1998 at 04:51:50PM -0600
Status: RO
Content-Length: 7331

On Mon, Nov 30, 1998 at 04:51:50PM -0600, Paul Bleisch wrote:
> I have been thinking about the work you did on your 
> uber-information manager (packrat).  Then, the other
> day I was reading something in some magazine where
> some columnist predicts that end users will want 
> personal firewalls in the next year.  I don't doubt
> that, but it made me think more.  On top of that,
> Oracle's Internet Server 8i (or whatever it is called)
> has moved to replace file system services, web, and
> application serving services from the OS to the DB.
> (Basically, there is a built in JavaVM, web server,
> and you can store java applets in the DB.)

Yeah, that new Oracle DB/filesystem thing is interesting. It can do
more than serve java applets, it can serve as a versioned NFS drive
with database like searching capabilities.

> I really think that PackRat is the ultimate killer
> app for the wired.  If someone could set up a package
> that contained a good collection of goodies and a
> nice API to access the DB easily, information management
> becomes very easy. 

Agreed, and that was my motivation... I'd really like to see the
'heirarchial filesystem' go by the wayside altogether. After all,
everything in the world is relative to something else, not to "/". I'm
talking with someone at Be who is looking over their installation and
package management, and I'm going to see what they think of dropping
the heirarchy. (i.e. Be is a heirarchy of files which can have
optional attributes, wheras I'd like it to be a collection of files
with attributes, any of which could be a heirarchy)

> Specifically, I find myself wanting to do the following tasks daily.
> Some of these are the same as you had, but some are more complex.
> All of them seem to be solved problems, just not integrated.
> 
> o  e-mail.  get it.  sort it.  catalog it.  grok it for
>    'important content'.  The use here is obvious.  The 
>    hard part is overcoming the nice features of most
>    mailers.  Then again, most mailers suck.

Most mailers suck bigtime. It surprised me how quick it was for me to
wack up a basic mailreader with server side html generation. The
biggest impediment to me using it is that there is no way for me to
get it to let me edit my mail in emacs.

> o  usenet.  get what i want.  catalog it.  The use here
>    is to build my knowledge base.  This could be something
>    as simple as an app that connects to Deja News and
>    pulls down useful articles based on keywords and then
>    prunes.

agreed... and IMO this app should be the same as the mailreader. 

> o  web.  basically the same thing as the usenet.  Added
>    functionality of sticking a page into packrat while
>    browsing.

I'd prefer it just to always stick every page I ever see into
packarat. Derived from Alan's ideas about "personal proxy server".

> o  digital library.  this is the most recent addition.  I
>    now have over 500 megs of papers/documentation/whatever
>    that I need to keep track of...  pain in the ass.

yup.... and they are all just filenames and completely out of context.

> o  scheduling/to-do.  obvious.

Yeah, although I tend not to use electronic schedulers or todo lists,
paper is more obtrusive for me, and that's what I need out of a
scheduler.

> o  packrat replicator.  allow packrat to travel easily.  
>    this should be as easy as connecting and hitting replicate.

yeah... I would really like an easy way to put all my information into
one big information store (I'm thinking mostly contact information,
but anything applies) and flag only certain things to sync with my
pilot, but have it be the same information store. Then I could make
some password protected webpage to access the same information, and
have it sync to my pilot automatically.

> o  personal information management.  e-wallet management.  
>    purchase tracking, etc.  (I've bought alot of books and
>    stuff lately and would like to keep track of it in 
>    one place.

Yeah... I havn't thought much about this. I don't use Quicken or
anything yet, I guess I don't pay much attention to my personal
finances. However, it would be nice if there was one place to put it
all.

> o  publishing.  publish data to friends and coworkers.

Yes... we could have the 'tuna contact publishing link' and everyone's
information would be avialable and up to date. There is really no
reason that allowing group publishing of data like this should have
anything specific to do with contact information either. Just a
network replication strategy for stored data.

> Along with the data, there would need to be access (remote).
> Enter the 'attached' (builtin, whatever) webserver.  Which
> brings up firewalls.  Enter the personal firewall.

ahh... gotcha... 

> Hmm...  anyway... I am just rambling.

Sounds like you're rambling along the same lines I've been thinking.

I see this really as a movement from storing unstructured data
(files) in an unstructured world (pathnamespace) to storing structured
data (records) in a structured world (database). My interesting
thought about this is:

  - traditional databases impose the structure. When a client asks
    for a column (field), the database already knew about it, and 
    it spits up the information passively out of it's datastore.
  - Packrat shouldn't 'impose' the structure. Data is structured
    whether we recognize it or not. Today's systems dosn't
    have mechanisms in place to remember information about data-structure
    and uniqueness. Today's systems also don't have a mechanism to connect
    questions to answers.

So what I'd like to do is setup a 'type-relationship' system. If you
get a jpeg file, the system can do work to make a guess at the
filetype. If it finds something which makes sense, it can remember
it. If you then ask for all the 'pictures' in the system, it should
easily be able to bring up this jpeg file. If you ask for all the
pictures which are at least 240x200 big, it should be able to run the
appropriate software to figure out the dimensions of the jpeg file to
decide if it meets the criteria. If you either (a) do searches on a
field more often than you add records or (b) care about search speed
more than storage space, it should derive these fields and store them
in the cache when you insert the data in the system.

The hope is that as the data-mining capabilities of a system like this
demonstrated their worth, applications would expose their data in more
interesting (i.e. more structured) ways.

Some of this has already begun on BeOS. When you download files, it
attaches the source-URL to them as an attribute. Email messages are
stored with attributes on them for the important headers. However,
BeOS missed quite a few things: (a) there is no ownership, or identity
information for attribute names/types themselves, so the information
is only marginally reliable (b) files are still primarily stored in a
heirarchy and only secondarily have attributes. (c) you have to
pre-dictate the 'structure', because you have to tell the FS mechanism
which attributes to index. (d) last I checked, there wasn't a way to
write software which could access the index information so you could
create your own types of searches, and their searches are pretty
limited.

-- 
David Jeske (N9LCA) + http://www.chat.net/~jeske/ + jeske@...

From jeske@home.chat.net Mon Nov 30 17:16:39 1998
Date: Mon, 30 Nov 1998 17:16:39 -0800
From: David Jeske 
To: Paul Bleisch 
Subject: Re: killer app
Message-ID: <19981130171639.R5324@home.chat.net>
References: <8744DF3002FBD011BDDF000092970B465B95E9@iron.digitalanvil.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.94.13i
In-Reply-To: <8744DF3002FBD011BDDF000092970B465B95E9@iron.digitalanvil.com>; from Paul Bleisch on Mon, Nov 30, 1998 at 06:57:04PM -0600
Status: RO
Content-Length: 3198

On Mon, Nov 30, 1998 at 06:57:04PM -0600, Paul Bleisch wrote:
> The viewer should be the same, but the 'groper' 
> (the app that inputs news into the db) is obviously
> different.

yeah... although I'm beginning to think of most of the parts of this
as little mini-data-handling-components, not really
applications. Something would go out and understand how to talk to
deja-news, and "inject" information with whatever type inforamtion it
could attach, then another collection of little data-mining scripts
would come by and collection information from the text. Some of them
would be specific (like something made to deal with news headers),
some of them would be generic (like a text indexer). 

I think it's really important to separate the injection from the
data-mining, because information stored at inject time can't be
'recovered' but information derived from the data itself can easily be
discarded and reproduced as often as necessary.

> >I'd prefer it just to always stick every page I ever see into
> >packarat. Derived from Alan's ideas about "personal proxy server".
> 
> Hmm... that is interesting.  It would have to
> auto prune older data or something??

The first big point of my whole packrat thing was that I wanted to
never (manually) delete anything. I wanted to set it up to use a
collection of auto-prune and auto-backup to get rid of older data to
make space for new data.

> Part of my scheduling is done by taking my work log (a text file
> that I do syslog style work logging to).  I want all of this in one
> place.  Currently, I have to take this text file wherever I go. :(

gotcha..

>>  - Packrat shouldn't 'impose' the structure. Data is structured
>>  whether we recognize it or not. Today's systems don't have
>>  mechanisms in place to remember information about data-structure
>>  and uniqueness. Today's systems also don't have a mechanism to
>>  connect questions to answers.
> 
> Hmm.. interesting.

FWIW, this thinking is along the same lines as my thoughts on language
typing. That is 'code has static types whether we like it or
not'. Using dynamic languages is just ignoring the static type
relatinships which do exist. In addition, dynamic typing is really
just conforming to a static typed object reflection interface. Worse,
when you bury code one level 'behind' that object reflection
interface, you often lose access to the 'second order statics' of the
code. I want not to lose the ability to record these static 'type
requirements', even if they are second or third order.

The packrat problem is the same data-keeping in the opposite
direction. Wheras in languages, you take source and compile it down,
losing information as you go, in packrat you start with a data-source,
and try to data-mine 'up' looking for (a) specific data points, and
(b) connections to other data.

> I am slowly picking up DB skills... slooooowly.  Too busy
> really to do much work on this stuff.

I have this feeling that relational databases may be a good place to
explore this stuff, but that they are far more bloated with things
required to implement the 'SQL Standard' than packrat needs.

-- 
David Jeske (N9LCA) + http://www.chat.net/~jeske/ + jeske@...