Is there a better way to store this data?

I’m beginning a Proof of Concept that will store a checksum of each and every file of each and every server on my lan. I want to keep repeat data to a minimum, and still be able to “reconstruct” and search for files based on file size, names, locations and even the checksum’s (hashes) themselves. I’ve attached a CSV that should give you a indication on how I envisioned the DB to be laid out. What I am curious about is if can be improved, if I can make any other optimizations to keep the DB as “light” as possible.
The reason it’s designed the way it is should be obvious, however the “path-to-hash” section still troubles me that there will be a lot of repeated data over and over in all the fields.
I have hundreds of workstations and servers, so keeping repeat data and overall DB size to a minimum is an obvious design goal. As such, Filenames are independent of file hash(checksum), and ID numbers (in base36) are used instead. Exact copies of files will hash the same, even if the name is changed, so again hash’s are independent of name. Root’s, like partition’s and drive letters are also independent of the rest of the path. Paths are also reduced to base36 number ID’s. All of that seems well and good, but when cataloging which file names, their hash’s, paths and on which computer they are on, it seems like I could probably reduce this further, but I’m wondering if I’ve already crossed a point of diminishing returns? Suggestions, questions and people telling me I’m crazy are welcome.

Any suggestions? Anyone think I’m crazy… )
(The CSV in excel or calc (oo) looks much better than this)
Filename-to-hash
__file_to_hash____file_ID___hash_ID
__908___________144_____543
__909___________145_____543

Filenames
file_ID_____file_name
__144_____Calc.exe
__145_____Copy of Calc.exe

hashes
__hash_ID__md5_____Sha-1___Sha-256
__543_____8144813__F0483F__etc…

machine
__mach_ID__mach_serial__mach_name___mach_UUID
__99_______7xkfs0_______Laptop-007___4C4C4544…
__100______5k1p13______Dev-31_______FABC73F0…

Path-to-hash
__machine_ID__root_ID__path_ID__file_ID
__99__________3_______12______144
__100_________3_______12______144
__99__________3_______12______145

paths
__path_ID__path_name
__12______windows\system32

root
root_ID__root_name
1_______a:
2_______b:
3_______c:
4
/root
5
__/boot

Your example is too complex for me to take the time to understand fully, but I think you are asking the wrong question. Start with a fully normalized design, and if you need to for performance, then denormalize it – but not until you prove the need.

I agree with xaprb. With decent/proper indexes I’m not sure you’re going to run into problems. Perhaps it’s as simple as 1 table:
file_hash, file_name, file_path (machine, root, full path)

Obviously pulling the machine info and file hash out to new tables would be a smart move since multiple columns define each.

However, the duplicated paths would probably bug me too - way too much redundant data.

I guess it depends on how you’re using the data. If you always search by file name or file hash the path is just a look up value. If, on the other hand, you want to see what all is in a directory on a specific machine or to compare the same directory on multiple machines then your version seems like it would work better.

I don’t think I understand the “Filename-to-hash” table. I think some table would contain the ID from “Path-to-hash” (where you define properties of each file on each system) and “hashes” (where you define a unique file across the network). Honestly, you could add the hash ID column to “Path-to-hash” which would give you the entire path to each file and the hash in one spot. The “Filename-to-hash” table would go away, queries could be on any key, look ups would be simple.

Troy