Is there a better way to store this data?

I’m beginning a Proof of Concept that will store a checksum of each and every file of each and every server on my lan. I want to keep repeat data to a minimum, and still be able to “reconstruct” and search for files based on file size, names, locations and even the checksum’s (hashes) themselves. I’ve attached a CSV that should give you a indication on how I envisioned the DB to be laid out. What I am curious about is if can be improved, if I can make any other optimizations to keep the DB as “light” as possible.
The reason it’s designed the way it is should be obvious, however the “path-to-hash” section still troubles me that there will be a lot of repeated data over and over in all the fields.
I have hundreds of workstations and servers, so keeping repeat data and overall DB size to a minimum is an obvious design goal. As such, Filenames are independent of file hash(checksum), and ID numbers (in base36) are used instead. Exact copies of files will hash the same, even if the name is changed, so again hash’s are independent of name. Root’s, like partition’s and drive letters are also independent of the rest of the path. Paths are also reduced to base36 number ID’s. All of that seems well and good, but when cataloging which file names, their hash’s, paths and on which computer they are on, it seems like I could probably reduce this further, but I’m wondering if I’ve already crossed a point of diminishing returns? Suggestions, questions and people telling me I’m crazy are welcome.

Any suggestions? Anyone think I’m crazy… )
(The CSV in excel or calc (oo) looks much better than this)

__145_____Copy of Calc.exe






Your example is too complex for me to take the time to understand fully, but I think you are asking the wrong question. Start with a fully normalized design, and if you need to for performance, then denormalize it – but not until you prove the need.

I agree with xaprb. With decent/proper indexes I’m not sure you’re going to run into problems. Perhaps it’s as simple as 1 table:
file_hash, file_name, file_path (machine, root, full path)

Obviously pulling the machine info and file hash out to new tables would be a smart move since multiple columns define each.

However, the duplicated paths would probably bug me too - way too much redundant data.

I guess it depends on how you’re using the data. If you always search by file name or file hash the path is just a look up value. If, on the other hand, you want to see what all is in a directory on a specific machine or to compare the same directory on multiple machines then your version seems like it would work better.

I don’t think I understand the “Filename-to-hash” table. I think some table would contain the ID from “Path-to-hash” (where you define properties of each file on each system) and “hashes” (where you define a unique file across the network). Honestly, you could add the hash ID column to “Path-to-hash” which would give you the entire path to each file and the hash in one spot. The “Filename-to-hash” table would go away, queries could be on any key, look ups would be simple.