Log in

No account? Create an account
entries friends calendar profile My Website Previous Previous Next Next
Mark Atwood
Idea: Design for a GFS on top of AWS S3
I know that someone has already written a distributed multimountable filesystem for S3. But it's commercial and closed source.

I've not looked at how it works. But I've been thinking...

There exist already filesystems that are based on preallocated extents, and filesystems that are based on immutable extents. One can combine the two, and build a filesystem that builds on S3, like so...

Each inode structure is an S3 item. Also, each extent is likewise an S3 item. Actually, they are a sequence of S3 items, because they will be versioned. Every time an inode is changed, or an extent is modified, what actually happens is a new one gets written to S3, and the item names for them have a delimited suffix with the version number.

This allows multiple hosts to mount the filesystem readwrite, without being incoherent, and without needing a "live" distributed lock manager. If a host has it mounted, and is reading from some extent, and some other host writes to that extent, the first host will keep reading from the old one.

On a regular basis, such as on sync, a host will issue a list request against all the extents and inodes it is using. It will then thus discover any updated ones, and act accordingly.

Also, each host will write a "ping" item, probably at every such sync. Something can monitor the bucket, and delete all extents and inodes that are older than newest ping of the farthest behind mounting host.

If instead old extents are not deleted after they are obsoleted, it would in fact be possible to mount the filesystem readonly as it appeared at time X, for any arbitrary time between "now" and "just ahead of the reaper process".

Tags: , , ,
Current Location: Victrola Cafe, Capitol Hill, Seattle WA

6 comments or Leave a comment
loganb From: loganb Date: September 12th, 2007 02:36 am (UTC) (Link)
Interesting... How would you handle concurrent writes to the same extent from two different mounted hosts without a lock manager?
fallenpegasus From: fallenpegasus Date: September 12th, 2007 05:30 am (UTC) (Link)
Y'know, I was just noticing that when rereading my text as part of looking at the email that was sent to me from LJ as a result of you posting that comment.

I suppose I could make one of the writers block, waiting for the one in front to get done.

Or I could code the writer's id into the versioning suffix, and then have the reaper do a 3 way forced merge of the original extent and the two modified extents (generalize to N).
loganb From: loganb Date: September 12th, 2007 04:32 pm (UTC) (Link)
Last I checked, S3 had no locking/concurrency primitives, so the clients would have to be directly connected to each other or a lock server. A reaper could merge most file operations, but it wouldn't be able to support a POSIX filesystem since file locking would be non-existent and some errors wouldn't surface synchronously (e.g. two clients create the same file simultaneously).

I was thinking about this stuff several months ago when planning a storage solution for my new venture. I prototyped a file store over HTTP similar to S3 or MogileFS. The key differences are that files are mutable, albeit append-only like GoogleFS, they have key-value metadata that is exchange via the HTTP headers, and mutation operations are atomic with test-and-set style metadata checks. I believe those modifications make it possible to implement a Log-Structured filesystem on top quite efficiently. Although, I'd implement an indexing layer (ala GiST) instead of a filesystem--both have the same base requirements. It's a really primitive implementation atm, but I've almost got it to the point of open-sourcing (gotta fix the damn wiki).
fallenpegasus From: fallenpegasus Date: September 12th, 2007 05:42 pm (UTC) (Link)
S3 also has key-value metadata on HTTP headers, and I'm hearing rumors of rumors that they will add append, and maybe even ranged PUT.

The Nirvanix HTTP store may be rich enough to support your design.

And as for the non-existance of POSIX file locking, I've learned thru sad experence to never trust file locking. The OS may claim it supports it, it may make a best effort to support it, but every filesystem bug breaks file locks first. Or you'll have a use who mounts something over NFS anyway. (And if tell me about the NFS lock manager, I will point and laugh.)
fallenpegasus From: fallenpegasus Date: September 12th, 2007 05:32 am (UTC) (Link)
So, Hi!

Who are you, and how did you find me?
loganb From: loganb Date: September 12th, 2007 04:13 pm (UTC) (Link)
Hello! I'm a local software engineer; I used to work at Zillow, although several months ago I co-founded a startup (exciting times!).

Your presentation on your MySQL <-> S3 connector came up in conversation when a friend asked about using S3 for DB storage from EC2 (it's a pretty cool concept although risky in production). I think I then googled you and found your blog.

I'm almost certain, however, that I've seen your name floating around town before. Maybe SeattleWireless, STS, Ignite, or dBUG? I can't say for sure...
6 comments or Leave a comment