NOTES

(Note to developers: a lot of these notes may be out of date.)

For a filesystem with writing you must have truncate as well as write.
I'm not sure if flush is necessary (since it doesn't do anything) but I have
it in there anyway. There also seems to be a weird problem associated with
disk monitors. Whenever I had gvfs monitor running it would look for some 
info and autorun files one my newly mounted directory. I'm not sure why this
would cause the filesystem to crash (it didn't *seem* to be matter of
recursion since there were only a few getattr calls and none of them would
give a directory name. Anyway.

There are some tags (like file type tags) that you might not want to see
in every directory because they would be distracting/uneccesary. I think a
suitable alternative is to have those directories list as dot directories
so that they are hidden while you can still cd to those directories without
the dot. It shouldn't create any conflicts since tags with filetype names
should refer to files with that type. That all is ultimately up to the user
though.
%%%
If there are such tags, and they shouldn't be visible in listings, then it is
the responsibility of the user to name the tags appropriately.
-- Mon Dec  1 04:05:57 CST 2014

Tag Types 
----------
(* This is entirely invalid now. All files have the same "type" which is just a character
string. The types may be re-expanded later to include integers and floats as well, but the
list type probably won't return, and the dictionary type definitely won't.
-- Mon Dec  1 03:57:47 CST 2014 *)
The tag types are given in the name.types file corresponding to the name.db file for a tagdb.
A tag's type is created when the tag is created and saved when the database is saved. No code
other than that dealing with the actual storage and retrieval of the database files
cares about the inclusion of types. However, including them allows us to do some pretty fancy
stuff without knowing about the type of a tag beforehand.

[This is now out of date, see the code for actual implementation]
In terms of actual code, the tagdb struct gets a new member tag_types a pointer to GHashTable.
This table has the form
{ tag_code_1=>value_type_1,
    tag_code_2=>value_type_2, ...}
Where
    tag_code_n is an integer tag_code which must be the same as those for the other tables--
    that is, it uses tag_codes for the "tag_code" s :p
and 
    value_type_n is an integer type code corresponding to those in types.h. These are also
    the ones used for query results. Note that the types include record types which may not
    be translatable into persistant disk storage easily.

The method that fills this data structure from name.types must skip the comments in this file
(they start with #) and for every non-comment line, 
    1) read in the tag name (all chars upto ':') and associate it with a code
    2) read in the type (everything after the ':') and convert it to an integer (like atoi)
    3) Store the tag name and type into the tag_types structure
There isn't a real sanity check for this process, but if a tag shows up twice, we just
give it the type of the last entry.

Once we have the tag_types structure we can dispatch on tag type like:
    switch (tagdb_get_tag_type(db, tag))
    {
        case (tagdb_dict_t):
            //do dict stuff
        case (tagdb_str_t):
            //do string stuff
    }
or use a function table since switch statements are EVIL (yeah, right). Also, when reading in
the database, reading in types should obviously be done before reading in the tags so we know
how to store the data we get in there.

Sending commands/queries to the tag db
--------------------------------------
(* This is entirely invalid now. -- Mon Dec  1 03:57:47 CST 2014 *)
To communicate directly with the tag database, we have a special file called #LISTEN# that
clients can write to from any location in the filesystem. The #LISTEN# file is not listed
in any readdir calls, but the name is hard coded into the filesystem but may be moved to a
configuration file in the future. Clients write to #LISTEN# with a query among those listed
in the cmd_query function. A message queue is created, named with the <process_id> of the
calling process. To get the result of the query back, the client must read the file
#QREAD-<process_id>#. The content of this file is a string representation of the result
which, if it is a list, will be a list of comma-separated values or if it is a dict, it will
be the same but with each item in the list being a colon-separated pair for easy processing
with a split command.

The typical flow for working with queries on files is to first do a SEARCH query to get a 
dict of file data objects, with file IDs as the keys. These objects can then be used in 
subsequent queries by providing the file ID extracted from the object as the first argument
to the query. Using file names directly wouldn't be feasible because we allow multiples files 
to have the same name within the file system. You can, however, specify the file names or file
ids in a search query as you would any other tags.

Example (-> denotes a query sent and <- denotes a result returned) :

-> FILE SEARCH tag1 AND tag2>9000
<- <id1> <file_data1> <id2> <file_data2> <id3> <file_data3> ...
[Examine file data to get the desired files]
for i in selected_ids: 
    -> FILE ADD_TAGS i tag3:"value"
    <- (SUCCESS|FAILURE)
-> TAG COPY tag3 tag4
<- (<tag4_id> <tag4_type>|FAILURE)
[etc...]

File Collisions
---------------
How do we handle mulitple files with identical names that share some tag?
Currently, if a new file with the same name is inserted into a given file
drawer, then the old file is replaced with the new. This isn't a problem in some
cases:
    1. If the two files in question each have tags that are exclusive to one
    another, then they will be unique in those respective tag-contexts.
    2. If the files share all of the same tags, then the new file must be
    intended as a replacement for the old
Still, there is a problem with systems like makefiles where a given file might
be positioned as a node for executing makefiles in further sub-contexts. If the
sub-context makefiles were written after the parent, the parent would get
overwritten:
    parent : main.c Makefile
    child1 : foo.c Makefile
    child2 : bar.c Makefile
We could even have multiple source files with the same name (not saying this is
a good idea, but it could happen) or, in the case of some version control
schemes, a special kind of directory that appears in every project directory.
There's even a problem with plain-old replacement of a file: we count a new
entry for the tag union every time a file is inserted into a file drawer, but
don't check to see if that file is actually new. It wouldn't be too hard to do
that, but to do a check wouldn't solve the whole problem. Instead, what I could
do is have 
%%%
See the README for how this is handled now. Currently, you can't rename a file
onto another file. The rename operation succeeds, but only the first file will be
listed under the name and both the just-renamed file and the other file will be
listed as <id>#<file-name>.
-- Mon Dec  1 03:56:47 CST 2014

Transactions and durability
---------------------------
(* This has been partially addressed by using a sqlite3 database for storage.
My initial thinking was that the transactions should be scoped to file system
operations, but I was disappointed by the lack of nested transactions in 
sqlite3. I'll either write a wrapper that checks if a transaction is in 
progress (not sure that's possible) or just accept database-scoped transactions
instead. Incidentally, SQLite supports the kind of online backups I talk about
here: https://www.sqlite.org/capi3ref.html#sqlite3backupinit
-- Mon Dec  1 03:34:19 CST 2014 *)
One of the big problems in this alpha stage is that during debugging there 
often come times when the simplest thing to do is kill TagFS. However, without
following the normal unmount procedure, the database file doesn't get written.
This is obviously annoying. My first idea for this was to have a 'back-up'
thread running in the background periodically saving the database state when
the rest of the system was inactive (or after a time-out which forced the back
up). This is, however, a complicated solution because of requirements on 
locking, detecting "low activity" across multiple threads. A better solution
is to use transactions in each of the file operations, writing out changes to
the database before transaction commit.

File storage and access in the run-time
---------------------------------------
(* This idea was, essentially, rolled into the sqlite3 database in the file_tag
table. The performance questions I was considering have been deferred until the
sqlite3 database shows significant slowdown for the relevant file creation and
deletion events. -- Mon Dec  1 03:15:50 CST 2014 *)
During normal operation, files are stored in heap memory and accessed through 
either a file-id indexed table or a tag-set indexed table (called FileCabinet).
In order to link an in-memory file back to its disk record, we keep a reference
to the location in the database file in the AbstractFile record. This reference
corresponds to a location in a memory-mapped file, specifically for a reference
`x':

    FILE_BASE + FILE_SIZE*x

where `FILE_BASE' is either the base address for tags or files (recorded at 
startup) and FILE_SIZE is the size of the file or tag record as it is stored in
the file.
The FileCabinet relation between files and tags is stored in yet another table 
and is just an array of 64-bit ints every two corresponding to a file and one of 
its tags:
    
    +-----------+-----------+
    |FILE_ID(64)|TAG_ID(64) |
    +-----------+-----------+

For performance reasons (I haven't measured this yet, but it seems reasonable)
the array only grows within a session--deleted tags are set to (NULL,NULL) in 
the table and cleaned out when the filesystem is unmounted. Additionally, the
array is unsorted. This is intended to prevent us from needing to relocate a
bunch of records when a tag is deleted or inserted. To make deletion quick, we
also store the locations of these records in the TagTable off of the File data 
structure as byte offsets from the base address of the mapped file.

Finally, tags have a tagdb tagdb_value_t associated with them and regular 
files have zero or more values associated with them (one for each tag on the 
file). A tagdb_value_t can be of variable size and, like the file-tag relation,
entries are only added while the file system is mounted. 

Tag Hierarchy
-------------
The tag hierarchy is intended to be a way of organizing tags where the rest of
the system is strictly tag-based. Super-tags serve the function of namespaces
allowing a user to distinguish tags that should have the same name, but which
have different semantics (e.g., "animal::bat", "baseball::bat", "wom::bat").

Deleting a tag will promote all of its children to the level of the tag being
deleted. For example, if a tag named "a" had a child "b", and the tag named "b"
had a child named "c", then, if the "b"-tag was deleted, the "c" tag becomes a
direct child of "a". Using shell file utilities, this would look like:

    $ ls
    a a::b a::b::c
    $ rmdir a::b
    $ ls
    a a::c

(* The things below here haven't been coded yet.
-- Mon Dec  1 09:14:50 CST 2014 *)
(* I opted to simply fail the operation rather than do the renaming. It's
easier and doesn't have the potential problems with additional conflicts and
unreasonably long names.
-- Sun Jan 11 16:59:02 CST 2015 *)
If "a" also had a child "c" in the situation described above, then there would
be a conflict between the child of "b" and that. We use two strategies to deal
with this. First, to allow all conflicted files to be recovered using info used
in the 'remove' operation, we prefix the deleted tag's name on to the child's
with an underscore character between. Although this doesn't completely respect
the user's attempt to destroy the semantics of the super-tag, it has the
advantage that we can list all of the collided files of a tag named "tag" by
doing "ls tag_*". Using shell utilities:

    $ ls 
    a a::b a::c a::b::c
    $ rmdir a::b
    $ ls
    a a::c a::b_c

Of course, this first correction may be insufficient if "a" also has a child
with the name "b_c". To resolve this, we append an underscore character to the
end of the new name:

    $ ls 
    a a::b a::b_c a::c a::b::c
    $ rmdir a::b
    $ ls
    a a::c a::b_c a::b_c_

Additional conflicts are addressed by appending more underscores until there is
no longer a conflict or the max file name size is reached. If the max file name
size is reached, then the tag-remove operation will fail completely, the tag
will still be there, and all of its children will have their original names.

Removing a sub-tag relationship does not do any promotion. For the "a::b::c"
example above, removing the "b::c" relationship would not then establish a new
"a::c" relationship. In the TagFS, you would remove a sub-tag relationship by
simply renaming the tag like this:

    $ ls
    a a::b a::b::c a::b::c::d
    $ mv a::b::c c
    $ ls
    a a::b c c::d

As demonstrated in the example above, a 'remove' is necessarily combined with an
'insert', so the removal of the "b::c" relationship is paired with the promotion 
of "c" to a root position.

It is appropriate to think of there being a single root tag that sits at the top
of the hierarchy, although I have discussed things as though there were multiple
'root' tags. TagDB is currently structured so that there are several 'root' tags
but that might change, and in any case it won't affect the description above.
-- Mon Dec  1 09:29:43 CST 2014