6.1 network hiccup 6 RESYNC: MAGIC HEALING
Two questions to be answered:
• which blocks, and
• which direction?
6.1 network hiccup
Lets keep it simple: One Primary, one Secondary, it has been just a network hick up, no node-role changes
involved. We know the direction: Primary -> Secondary. We know which blocks to transfer, because the
Primary keeps an in-memory bitmap, which is dirtied whenever a WRITEs is completed to the upper layers
without being acknowledged by the Secondary.
This may transfer some blocks more than necessary, because of the granularity of the bitmap, and
because for some blocks only the ack was lost, but the data had been written correctly.
6.2 node crash
If a Secondary node had crashed and was revived, the procedure is just the same as above.
If the Primary was rebooted while the Secondary was down, we’d lose the information stored in the
dirty bitmap, so we do keep a copy of it in some reserved meta-data area on disk, where we can initialize
the in-memory bitmap from, once we are configured again.
If a Primary node had crashed, we have a different problem.
There could have been in-flight io, and we have no idea whether that made it to disk or to the network,
or to both. Even though only very few blocks will be different, we have no idea which ones, we have to
assume that any block might be different.
For the sake of data-integrity, we would have to retransmit the entire disk, just to be sure...
6.3 Full Sync? No Way.
To avoid this, we could dirty the on-disk bitmap with each incoming write request, submit the write, and
clear it after it has been successfully completed.
This would make three requests out of one. Worse, to be correct, the dirty write would have to be
synchronous, we’d have to wait for it to complete before we could submit the application write.
We are smarter than that.
6.4 Peanuts . . .
To reliably keep track of the target blocks of in-flight IO, while minimizing the required additional io-
requests for this housekeeping, we came up with the concept of the "Activity-Log".
Think of your storage as a huge heap of peanuts. Sisyphus has tagged them all with a distinct block
number. There are many people running around, taking some of the peanuts in their pockets (that is the
in-flight io), and throwing them back on the heap (that is the io-completion). Painting them blue is allowed,
these are WRITEs we are missing the acknowledgment of the other node for (dirty bits). Eating peanuts is
strictly forbidden, as is re-tagging.
4
Blocks corresponding to the in-pocket peanuts have to be retransmitted, those corresponding to the
heap don’t need to (but it would do no harm if some of them are).
Our mission is to know at each given moment as precisely as possible which peanuts are NOT in the
pockets of those people (and not painted blue, yet), because if we know that, we can avoid retransmitting
the corresponding blocks after Primary crash.
First, we get into control of the situation. We structure the heap, and put the peanuts in order into boxes
(activity-log extents) which in turn are numbered. We draw a line in the sand.
4
Some do that, anyways; call them Eh-i-oh and Silent Corruption ;)
4