xfs文件系统使用总结-niao5929-ChinaUnix博客

整理得有些乱，有些翻译可能不太准确，望见谅。
1. 相关概念及说明
1.1 xfs文件系统相关概念
1.1.1 文件系统日志
1.1.2 分配组
1.1.3 条带化分配
1.1.4 基于Extent的分配方式
1.1.5 可变块尺寸
1.1.6 延迟分配
1.1.7 稀疏文件
1.2 mkfs.xfs 常用参数说明:
-b选项 size= 逻辑块尺寸大小。The default block size is 4096 bytes (4 KB).

-n选项 size= 文件系统目录块大小。目录块应大于文件系统逻辑块。
-l选项每个xfs文件系统有一个文件系统日志记录。这个日志需要专用的磁盘空间。这个空间不能被df显示，也不能以文件名来访问。日志记录分为外部和内部日志。外部指得是使用一个外部设备。内部指得是占用一个专用的磁盘空间。关于内部日志，这个大小是以-l size=选项来指定的。这个默认日志大小会越来越大，直到最大的日志大小，128M,在一个1TB的文件系统中。
For filesystems with a very high transaction activity, a large log size is recommended. You should avoid making your log too large because a large log can increase filesystem mount time after a crash.在一个比较大量活跃的业务文件系统中，推荐一个大的日志size.你应避免让你的日志大大，因为一个大日志能加重在crash后文件系统挂接时间

-d选项。主要用于数据部份的参数指定。例如在一个raid设备中如何进行分配。
For a RAID device, the default stripe unit is 0, indicating that the feature is disabled. You should configure the stripe unit and width sizes of RAID devices in order to avoid unexpected performance anomalies caused by the filesystem doing non-optimal I/O operations to the RAID unit. For example, if a block write is not aligned on a RAID stripe unit boundary and is not a full stripe unit, the RAID will be forced to do a read/modify/write cycle to write the data. This can have a significant performance impact. By setting the stripe unit size properly, XFS will avoid unaligned accesses。
如果一个块写入不是被列在一个条带单元范围里并且不是一个完整的条带单元，这个raid会被强制执行一个r/m/wm周期以写入数据。这能带来重大性能的影响。通过正确设置条带大小，xfs会避免不连续的访问。
##基于raid创建xfs文件系统实例:

1.3 xfs相关常用命令
xfs_admin: 调整 xfs 文件系统的各种参数
xfs_copy: 拷贝 xfs 文件系统的内容到一个或多个目标系统（并行方式）
xfs_db: 调试或检测 xfs 文件系统（查看文件系统碎片等）
xfs_check: 检测 xfs 文件系统的完整性
xfs_bmap: 查看一个文件的块映射
xfs_repair: 尝试修复受损的 xfs 文件系统
xfs_fsr: 碎片整理
xfs_quota: 管理 xfs 文件系统的磁盘配额
xfs_metadump: 将 xfs 文件系统的元数据 (metadata) 拷贝到一个文件中
xfs_mdrestore: 从一个文件中将元数据 (metadata) 恢复到 xfs 文件系统
xfs_growfs: 调整一个 xfs 文件系统大小（只能扩展）
xfs_freeze    暂停（-f）和恢复（-u）xfs 文件系统
xfs_logprint: 打印xfs文件系统的日志
xfs_mkfile: 创建xfs文件系统
xfs_info: 查询文件系统详细信息
xfs_ncheck: generate pathnames from i-numbers for XFS
xfs_rtcp: XFS实时拷贝命令
xfs_io: 调试xfs I/O路径

2. 关于xfs文件系统的规划
援引实例来说明。这是一个mysql分配空间的实例，主要说明如何进行计算来分配空间。

2.1 查看磁盘信息
#fdisk -ul
Disk /dev/sda: 438.5 GB, 438489317376 bytes
255 heads, 63 sectors/track, 53309 cylinders, total 856424448 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00051fe9
   Device Boot      Start         End      Blocks   Id System
/dev/sda1            2048     7813119     3905536   82 Linux swap / Solaris
Partition 1 does not end on cylinder boundary.
/dev/sda2   *     7813120    27344895     9765888   83 Linux
/dev/sda3        27344896   856422399   414538752   83 Linux
2.2 计算块使用
We want to use mysql on /dev/sda3, but how can we ensure that it is aligned with the RAID stripes? It takes a small amount of math:

    Start with your RAID stripe size. Let’s use 64k which is a common default. In this case 64K = 2^16 = 65536 bytes. 默认尺寸是64K
    Get your sector size from fdisk. In this case 512 bytes. 扇区大小512b
    Calculate how many sectors fit in a RAID stripe.   65536 / 512 = 128 sectors per stripe. 每个条带大小128个扇区。
    Get start boundary of our mysql partition from fdisk: 27344896. 查看mysql分区的起始数为27344896
    See if the Start boundary for our mysql partition falls on a stripe boundary by dividing the start sector of the partition by the sectors per stripe: 27344896 / 128 = 213632. This is a whole number, so we are good. If it had a remainder, then our partition would not start on a RAID stripe boundary. 查看如果由起始扇区划分的起始边界落到条带的边界，再计算扇区数，得到一个整数。如果有余数，那么我们的分区不会从raid条带边界开始。

Create the Filesystem

XFS requires a little massaging (or a lot). For a standard server, it’s fairly simple. We need to know two things:

    RAID stripe size
    Number of unique, utilized disks in the RAID. This turns out to be the same as the size formulas I gave above:
        RAID 1+0: is a set of mirrored drives, so the number here is num drives / 2.
        RAID 5: is striped drives plus one full drive of parity, so the number here is num drives – 1.
In our case, it is RAID 1+0 64k stripe with 8 drives. Since those drives each have a mirror, there are really 4 sets of unique drives that are striped over the top. Using these numbers, we set the ‘su’ and ‘sw’ options in mkfs.xfs with those two values respectively.

2.3 格式化文件系统
通过以上实例总结执行命令 mkfs.xfs -d su=64k,sw=4 /dev/sda3

3. xfs文件系统的创建
3.1 默认方法
#mkfs.xfs /dev/sdc1
meta-data=/dev/sdc1 isize=256    agcount=18, agsize=1048576 blks
data     =                       bsize=4096   blocks=17921788, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=0
naming   =version 2              bsize=4096
log      =internal log           bsize=4096   blocks=2187, version=1
         =                       sunit=0 blks
realtime =none                   extsz=65536 blocks=0, rtextents=0

3.2 指定块和内部log大小

# mkfs.xfs -b size=1k -l size=10m /dev/sdc1
meta-data=/dev/sdc1 isize=256    agcount=18, agsize=4194304 blks
data     =                       bsize=1024   blocks=71687152, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=0
naming   =version 2              bsize=4096
log      =internal log           bsize=1024   blocks=10240, version=1
         =                       sunit=0 blks
realtime =none                   extsz=65536 blocks=0, rtextents=0
3.3 使用逻辑卷做为外部日志的卷
# mkfs.xfs -l logdev=/dev/sdh,size=65536b /dev/sdc1
meta-data=/dev/sdc1              isize=256    agcount=4, agsize=76433916
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=305735663,
imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =/dev/sdh               bsize=4096   blocks=65536, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

3.3 目录块

# mkfs.xfs -b size=2k -n size=4k /dev/sdc1
meta-data=/dev/sdc1              isize=256    agcount=4,
agsize=152867832 blks
         =                       sectsz=512   attr=2
data     =                       bsize=2048   blocks=611471327,
imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=2048   blocks=298569, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

3.4 扩展文件系统
新增的空间不会使原有文件系统上的文件不会被改动，而且被增加的空间变成可用的附加的文件存储
XVM支持xfs系统的扩展
# xfs_growfs /mnt
meta-data=/mnt                   isize=256    agcount=30, agsize=262144 blks
data     =                       bsize=4096   blocks=7680000, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=0
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=1200 version=1
         =                       sunit=0 blks
realtime =none                   extsz=65536 blocks=0, rtextents=0
data blocks changed from 7680000 to 17921788

4. 文件系统的维护
4.1 碎片的整理
查看文件块状况: xfs_bmap -v file.tar.bz2
查看磁盘碎片状况: xfs_db -c frag -r /dev/sda1
整理碎片: xfs_fsr /dev/sda1

4.2 文件系统一致性检测
xfs_repair -n /dev/cciss/cpd0p
    xfs_repair -n (非更改模式)
    xfs_check
不同于fsck,xfs_check和xfs_repair都不会在启动时自动调用。你应在觉得文件系统有问题时使用这些命令

4.3 修复文件系统
修复不一致的文件系统
xfs_repair不使用-n选项检查xfs的一致性，并且如果检测到问题，也会尽可能的检验它们。被检查和修复的文件系统必须被卸载。
xfs_repair检查大体分为7个阶段。修复错误信息大体为5种信息.
如果xfs_repair把文件和目录放进lost_found目录中并且你没有移动它们，下一步运行xfs_repair,它临时关闭这些文件和目录的inode.它们被重新连接在xfs_repair终止之前。由于关闭的inodes在lost_found，会看到以下的输出。

Phase 1 - find and verify superblock...
Phase 2 - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        ...
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - clear lost+found (if it exists) ...
        - clearing existing “lost+found” inode
        - deleting existing “lost+found” entry
        - check for inodes claiming duplicate blocks...
        - agno = 0
imap claims in-use inode 242000 is free, correcting imap
        - agno = 1
        - agno = 2
        ...
Phase 5 - rebuild AG headers and trees...
        - reset superblock counters...
Phase 6 - check inode connectivity...
        - ensuring existence of lost+found directory
        - traversing filesystem starting at / ...
        - traversal finished ...
        - traversing all unattached subtrees ...
        - traversals finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 242000, moving to lost+found
Phase 7 - verify and correct link counts...
done
In this example, inode 242000 was an inode that was moved to lost+found during a previous xfs_repair run. This run of xfs_repair found that the filesystem is consistent. If the lost+found directory had been empty, in phase 4 only the messages about clearing and deleting the lost+found directory would have appeared. The imap claims and disconnected inode messages appear (one pair of messages per inode) if there are inodes in the lost+found directory
在这个例子里 inode 242000是一个被移到lost_found在上次xfs_repair运行时侯。这个xfs_repair运行发现文件系统是连读的。如果lost_found被置空，在第四阶段只有关于清理和删除的信息出现。如果在lost_found目录里，imap claims和disconnected inode信息会出现。

参考文档:

http://www.percona.com/blog/2011/12/16/setting-up-xfs-the-simple-edition/
http://blog.chinaunix.net/uid-451-id-3188926.html