XFS文件系统结构与算法详解

需积分: 4 129 浏览量更新于2024-06-26 收藏 2.17MB PDF 举报

"XFS文件系统结构" XFS文件系统是一种高度可扩展的、高性能的文件系统，被广泛用于Linux操作系统中。它最初由Silicon Graphics, Inc开发，并随着时间的推移，通过社区的努力不断更新和完善。XFS的设计目标是支持大规模的数据存储，能够处理PB级别的数据量，并且包含数十亿个inode。 XFS的核心算法和数据结构是其可扩展性的基石。这些设计使得在磁盘上的结构和索引能够高效地进行迭代，以适应大型文件系统的需要。尽管如此，这种巨大的可扩展性也带来了一个挑战，即验证文件系统的结构。随着文件系统的增大，验证其完整性和正确性变得越来越困难。这是因为检查数以亿计的inode和相关元数据的正确性是一个极其复杂的过程。在XFS中，文件系统的布局和管理方式是关键。例如，它使用B+树（B-tree）数据结构来组织inode，这允许快速查找、插入和删除操作。B+树的特性使得磁盘空间的分配和回收更加高效。此外，XFS还采用了日志记录机制，确保了文件系统的事务一致性，即使在系统崩溃或不正常关闭后也能恢复到一致状态。 XFS的实时设备功能也是一个亮点，它允许用户创建具有即时可用性的文件系统，适合需要低延迟写入的应用场景。这种特性对于大数据处理和实时分析至关重要。文档中还提到了XFS的journal格式，这是用来记录文件系统变更的序列化数据结构，确保了在系统重启或故障后的快速恢复。同时，XFS对元数据的完整性给予了特别的关注，确保了数据的安全性。 XFS的另一个重要方面是其free inode B+树，这个数据结构用于跟踪未分配的inode，使得inode的分配和回收过程更为高效。此外，文档还列出了一个魔法数字的索引，这些魔法数字是XFS文件系统中特定数据结构的标识符，有助于开发者理解和调试文件系统。 XFS文件系统通过其先进的算法和数据结构，实现了在大规模存储环境中的高性能和可靠性。然而，随着规模的增长，如何有效验证其结构的正确性成为了一个需要解决的重要问题。这需要开发出更智能的工具和方法来应对这种挑战，以保证在面对海量数据时，XFS仍然能够提供可靠的服务。

XFS Algorithms & Data Structures 8 / 184

{

struct xfs_mount *mp = bp->b_target->bt_mount;

if ((xfs_sb_version_hascrc(&mp->m_sb) &&

!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),

XFS_FOO_CRC_OFF)) ||

!xfs_foo_verify(bp)) {

XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);

xfs_buf_ioerror(bp, EFSCORRUPTED);

}

e code ensures that the CRC is only checked if the lesystem has CRCs enabled by checking the superblock of the

feature bit, and then if the CRC veries OK (or is not needed) it veries the actual contents of the block.

e verier function will take a couple of dierent forms, depending on whether the magic number can be used to

determine the format of the block. In the case it can’t, the code is structured as follows:

static bool

xfs_foo_verify(

struct xfs_buf *bp)

{

struct xfs_mount *mp = bp->b_target->bt_mount;

struct xfs_ondisk_hdr *hdr = bp->b_addr;

if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))

return false;

if (!xfs_sb_version_hascrc(&mp->m_sb)) {

if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))

return false;

if (bp->b_bn != be64_to_cpu(hdr->blkno))

return false;

if (hdr->owner == 0)

return false;

}

/* object specific verification checks here */

return true;

}

If there are dierent magic numbers for the dierent formats, the verier will look like:

static bool

xfs_foo_verify(

struct xfs_buf *bp)

{

struct xfs_mount *mp = bp->b_target->bt_mount;

struct xfs_ondisk_hdr *hdr = bp->b_addr;

if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {

if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))

return false;

if (bp->b_bn != be64_to_cpu(hdr->blkno))

return false;

if (hdr->owner == 0)

XFS Algorithms & Data Structures 9 / 184

return false;

} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))

return false;

/* object specific verification checks here */

return true;

}

Write veriers are very similar to the read veriers, they just do things in the opposite order to the read veriers. A

typical write verier:

static void

xfs_foo_write_verify(

struct xfs_buf *bp)

{

struct xfs_mount *mp = bp->b_target->bt_mount;

struct xfs_buf_log_item *bip = bp->b_fspriv;

if (!xfs_foo_verify(bp)) {

XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);

xfs_buf_ioerror(bp, EFSCORRUPTED);

return;

}

if (!xfs_sb_version_hascrc(&mp->m_sb))

return;

if (bip) {

struct xfs_ondisk_hdr *hdr = bp->b_addr;

hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);

}

xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);

}

is will verify the internal structure of the metadata before we go any further, detecting corruptions that have

occurred as the metadata has been modied in memory. If the metadata veries OK, and CRCs are enabled, we then

update the LSN eld (when it was last modied) and calculate the CRC on the metadata. Once this is done, we can

issue the IO.

2.5 Inodes and Dquots

Inodes and dquots are special snowakes. ey have per-object CRC and self-identiers, but they are packed so

that there are multiple objects per buer. Hence we do not use per-buer veriers to do the work of per-object

verication and CRC calculations. e per-buer veriers simply perform basic identication of the buer - that

they contain inodes or dquots, and that there are magic numbers in all the expected spots. All further CRC and

verication checks are done when each inode is read from or wrien back to the buer.

e structure of the veriers and the identiers checks is very similar to the buer code described above. e only

dierence is where they are called. For example, inode read verication is done in xfs_iread() when the inode is

rst read out of the buer and the struct xfs_inode is instantiated. e inode is already extensively veried during

writeback in xfs_iush_int, so the only addition here is to add the LSN and CRC to the inode as it is copied back into

the buer.

XFS Algorithms & Data Structures 10 / 184

Chapter 3

Delayed Logging

3.1 Introduction to Re-logging in XFS

XFS logging is a combination of logical and physical logging. Some objects, such as inodes and dquots, are logged

in logical format where the details logged are made up of the changes to in-core structures rather than on-disk

structures. Other objects - typically buers - have their physical changes logged. e reason for these dierences

is to reduce the amount of log space required for objects that are frequently logged. Some parts of inodes are more

frequently logged than others, and inodes are typically more frequently logged than any other object (except maybe

the superblock buer) so keeping the amount of metadata logged low is of prime importance.

e reason that this is such a concern is that XFS allows multiple separate modications to a single object to be carried

in the log at any given time. is allows the log to avoid needing to ush each change to disk before recording a

new change to the object. XFS does this via a method called ”re-logging”. Conceptually, this is quite simple - all it

requires is that any new change to the object is recorded with a new copy of all the existing changes in the new

transaction that is wrien to the log.

at is, if we have a sequence of changes A through to F, and the object was wrien to disk aer change D, we

would see in the log the following series of transactions, their contents and the log sequence number (LSN) of the

transaction:

Transaction Contents LSN

A A X

B A+B X+n

C A+B+C X+n+m

D A+B+C+D X+n+m+o

E E Y (> X+n+m+o)

F E+F Y+p

In other words, each time an object is relogged, the new transaction contains the aggregation of all the previous

changes currently held only in the log.

is relogging technique also allows objects to be moved forward in the log so that an object being relogged does

not prevent the tail of the log from ever moving forward. is can be seen in the table above by the changing

(increasing) LSN of each subsequent transaction - the LSN is eectively a direct encoding of the location in the log

of the transaction.

is relogging is also used to implement long-running, multiple-commit transactions. ese transaction are known

as rolling transactions, and require a special log reservation known as a permanent transaction reservation. A typical

XFS Algorithms & Data Structures 11 / 184

example of a rolling transaction is the removal of extents from an inode which can only be done at a rate of two

extents per transaction because of reservation size limitations. Hence a rolling extent removal transaction keeps

relogging the inode and btree buers as they get modied in each removal operation. is keeps them moving

forward in the log as the operation progresses, ensuring that current operation never gets blocked by itself if the log

wraps around.

Hence it can be seen that the relogging operation is fundamental to the correct working of the XFS journalling

subsystem. From the above description, most people should be able to see why the XFS metadata operations writes

so much to the log - repeated operations to the same objects write the same changes to the log over and over again.

Worse is the fact that objects tend to get dirtier as they get relogged, so each subsequent transaction is writing more

metadata into the log.

Another feature of the XFS transaction subsystem is that most transactions are asynchronous. at is, they don’t

commit to disk until either a log buer is lled (a log buer can hold multiple transactions) or a synchronous opera-

tion forces the log buers holding the transactions to disk. is means that XFS is doing aggregation of transactions

in memory - batching them, if you like - to minimise the impact of the log IO on transaction throughput.

e limitation on asynchronous transaction throughput is the number and size of log buers made available by the

log manager. By default there are 8 log buers available and the size of each is 32kB - the size can be increased up

to 256kB by use of a mount option.

Eectively, this gives us the maximum bound of outstanding metadata changes that can be made to the lesystem at

any point in time - if all the log buers are full and under IO, then no more transactions can be commied until the

current batch completes. It is now common for a single current CPU core to be to able to issue enough transactions

to keep the log buers full and under IO permanently. Hence the XFS journalling subsystem can be considered to

be IO bound.

3.2 Delayed Logging Concepts

e key thing to note about the asynchronous logging combined with the relogging technique XFS uses is that we

can be relogging changed objects multiple times before they are commied to disk in the log buers. If we return

to the previous relogging example, it is entirely possible that transactions A through D are commied to disk in the

same log buer.

at is, a single log buer may contain multiple copies of the same object, but only one of those copies needs to

be there - the last one ”D”, as it contains all the changes from the previous changes. In other words, we have one

necessary copy in the log buer, and three stale copies that are simply wasting space. When we are doing repeated

operations on the same set of objects, these ”stale objects” can be over 90% of the space used in the log buers. It is

clear that reducing the number of stale objects wrien to the log would greatly reduce the amount of metadata we

write to the log, and this is the fundamental goal of delayed logging.

From a conceptual point of view, XFS is already doing relogging in memory (where memory == log buer), only it

is doing it extremely ineciently. It is using logical to physical formaing to do the relogging because there is no

infrastructure to keep track of logical changes in memory prior to physically formaing the changes in a transaction

to the log buer. Hence we cannot avoid accumulating stale objects in the log buers.

Delayed logging is the name we’ve given to keeping and tracking transactional changes to objects in memory outside

the log buer infrastructure. Because of the relogging concept fundamental to the XFS journalling subsystem, this

is actually relatively easy to do - all the changes to logged items are already tracked in the current infrastructure.

e big problem is how to accumulate them and get them to the log in a consistent, recoverable manner. Describing

the problems and how they have been solved is the focus of this document.

One of the key changes that delayed logging makes to the operation of the journalling subsystem is that it disassoci-

ates the amount of outstanding metadata changes from the size and number of log buers available. In other words,

instead of there only being a maximum of 2MB of transaction changes not wrien to the log at any point in time,

XFS Algorithms & Data Structures 12 / 184

there may be a much greater amount being accumulated in memory. Hence the potential for loss of metadata on a

crash is much greater than for the existing logging mechanism.

It should be noted that this does not change the guarantee that log recovery will result in a consistent lesystem.

What it does mean is that as far as the recovered lesystem is concerned, there may be many thousands of trans-

actions that simply did not occur as a result of the crash. is makes it even more important that applications that

care about their data use fsync() where they need to ensure application level data integrity is maintained.

It should be noted that delayed logging is not an innovative new concept that warrants rigorous proofs to determine

whether it is correct or not. e method of accumulating changes in memory for some period before writing them

to the log is used eectively in many lesystems including ext3 and ext4. Hence no time is spent in this document

trying to convince the reader that the concept is sound. Instead it is simply considered a ”solved problem” and as

such implementing it in XFS is purely an exercise in soware engineering.

e fundamental requirements for delayed logging in XFS are simple:

1. Reduce the amount of metadata wrien to the log by at least an order of magnitude.

2. Supply sucient statistics to validate Requirement #1.

3. Supply sucient new tracing infrastructure to be able to debug problems with the new code.

4. No on-disk format change (metadata or log format).

5. Enable and disable with a mount option.

6. No performance regressions for synchronous transaction workloads.

3.3 Delayed Logging Design

3.3.1 Storing Changes

e problem with accumulating changes at a logical level (i.e. just using the existing log item dirty region tracking)

is that when it comes to writing the changes to the log buers, we need to ensure that the object we are formaing is

not changing while we do this. is requires locking the object to prevent concurrent modication. Hence ushing

the logical changes to the log would require us to lock every object, format them, and then unlock them again.

is introduces lots of scope for deadlocks with transactions that are already running. For example, a transaction

has object A locked and modied, but needs the delayed logging tracking lock to commit the transaction. However,

the ushing thread has the delayed logging tracking lock already held, and is trying to get the lock on object A to

ush it to the log buer. is appears to be an unsolvable deadlock condition, and it was solving this problem that

was the barrier to implementing delayed logging for so long.

e solution is relatively simple - it just took a long time to recognise it. Put simply, the current logging code

formats the changes to each item into an vector array that points to the changed regions in the item. e log write

code simply copies the memory these vectors point to into the log buer during transaction commit while the item

is locked in the transaction. Instead of using the log buer as the destination of the formaing code, we can use an

allocated memory buer big enough to t the formaed vector.

If we then copy the vector into the memory buer and rewrite the vector to point to the memory buer rather than

the object itself, we now have a copy of the changes in a format that is compatible with the log buer writing code.

that does not require us to lock the item to access. is formaing and rewriting can all be done while the object is

locked during transaction commit, resulting in a vector that is transactionally consistent and can be accessed without

needing to lock the owning item.

剩余191页未读，继续阅读

abelard2008

粉丝: 2
资源: 4

XFS文件系统结构与算法详解

xfs_filesystem_structure

XFS 文件系统

XFS-0.1-XFS_Filesystem_Structure-en-US.pdf

XFS.rar_XFS_visual c_xfs linux_分布式文件系统

XFS完整版pdf格式

XFS文件系统结构详解

XFS分布式文件系统深入解析与实践指南

【机器人】将ChatGPT飞书机器人钉钉机器人企业微信机器人公众号部署到vercel及docker_pgj.zip

图数据分析中基于对比学习的异常检测算法的Python实现及应用-含代码及详细解释说明

专题调研登记表.docx

最新资源