team: UW Systems Lab:Research Projects
How can we ensure data consistency?
This is very important in data warehousing and major corporations are spending millions of dollars to advance in this field.
What architecture should we use to make it linear with a reasonable coefficient?
It's a common thing that nodes go down, so how do we ensure that critical data will not be lost?
Especially in multi-purpose stream processing frameworks.
There are a lot of hot topics in distributed systems such as:
Achieving Consistency and Availability together (CAP theorem)
Designing a low-latency consistency protocol
Solving Byzantine fault tolerance with less machines and fewer round trips
Measuring the consistency requirement in transactional models.
know some old projects:
NFS (including variants such as NQNFS and Spritely NFS)
AFS (and Coda)
Berkeley XFS (not the same as SGI XFS)
For conferences, you'd probably want to follow (and dig into the archives for) FAST, PDOS, OSDI, and HotStorage. It also wouldn't hurt to read up on related stuff in databases and general distributed systems - e.g. Dynamo, BigTable, PNUTS, everything Lamport ever wrote.
here are some ideas.
Placing and replicating data between storage with wildly different latency patterns. We already have SSD vs. spinning disk, but things like phase-change memory on one and end shingled disks on the other are going to increase that diversity even further.
Maintaining data integrity across large systems and long time periods. With thousands of disks and a period of years, data loss at the component level is not a possibility you need to deal with. It's a certainty you need to deal with. Doing it space-efficiently and/or with minimal impact on performance is still a challenge.
Security. How do we ensure that only the "right people" can access each piece of data, guarantee that it is (or is not) deleted at a specific time, etc. This is particularly important as storage becomes more distributed.
Classification and indexing. It can be kind of hard to find stuff in a multi-billion-file archive, and the hierarchy-directory structure might not be up to the task. What should replace it, and how do the storage systems deal with more complex naming without introducing even more consistency/performance issues?