附件中的文档,可以完整的学习ubi和ubifs。但本文重点关注文档中关于 ubi 和 ubifs 的 unclean reboot 和 power cut 场景下的数据一致性。
Both UBI and UBIFS are designed with tolerance to power-cuts in mind.
UBI has an internal debugging infrastructure that can emulate power failures for testing. The advantage of the emulation is that it emulates power failures at the critical points where control data structures are written to the device, whereas the probability of interrupting the system at those precise moments with physical power-cut testing is rather low.
UBI suppors power-cut emulation for testing which emulates power-cuts after a random number of writes. When a power-cut is emulated, UBI switches to read-only mode and disallows any further write to the UBI volume, thus emulating a power cut. The main idea of this mode is to emulate power cuts in interesting places, e.g. when writing the vid header.
有3个debugfs可以用来模拟ubi的异常断电:
/sys/kernel/debug/ubi/ubi0/
tst_emulate_power_cut
tst_emulate_power_cut_max
tst_emulate_power_cut_min
Emulation type | Flag value |
Allow power-cut to be emulated during EC header write | 1 |
Allow power-cut to be emulated during VID header write | 2 |
min和max指定成功写入的次数。
tolerance to unclean reboots - UBIFS is a journaling file system and it tolerates sudden crashes and unclean reboots; UBIFS just replays the journal and recovers from the unclean reboot; mount time is a little bit slower in this case, because of the need to replay the journal, but UBIFS does not need to scan whole media, so it anyway takes fractions of a second to mount UBIFS; note, authors payed special attention to this UBIFS aspect
UBIFS has internal debugging infrastructure to emulate power failures and the authors used it for extensive testing. It was tested for long time with power-fail emulation. The advantage of the emulation is that it emulates power failures even at the situations which happen not very often. For example, when the master node is updated, or the log is changed. The probability to interrupt the system at those moments is very low in real-life.
There is also a powerful user-space test program called integck
which performs a lot of random I/O operations and checks the integrity of the FS after remount. This test can also handle emulated power-cuts and check the FS integrity.
The write-buffer implementation is a little more complex, and we actually have several of them - one for each journal head. But this does not change the basic idea behind the write-buffer.
Few notes with regards to synchronization:
"sync()
" also synchronizes all write-buffers;
"fsync(fd)
" also synchronizes all write-buffers which contain pieces of "fd
";
synchronous
files, as well as files opened with "O_SYNC
", bypass write-buffers, so the I/O is indeed synchronous for this files;
write-buffers are also bypassed if the file-system is mounted with the "-o sync
" mount option.
Take into account that write-buffers delay the data synchronization timeout defined by "dirty_expire_centisecs
" (see here) by 3-5 seconds. However, since write-buffers are small, only few data are delayed.
jffs2将meta data存储在data node的头中。所以jffs2扫描到最新的节点时,就知道了meta data。顺序写入发生断电的时候,知会丢失结尾的一部分数据。
In JFFS2 all the meta-data (like inode atime
/mtime
/ctime
, inode size, UID/GID, etc) are stored in the data node headers. Data nodes carry 4KiB of (compressed) data. This means that the meta-data information is duplicated in many places, but this also means that every time JFFS2 writes a data node to the flash media, it updates inode size as well. So when JFFS2 mounts it scans the flash media, finds the latest data node, and fetches the inode size from there.
In practice this means that JFFS2 will write these 10MiB of data sequentially, from the beginning to the end. And if you have a power cut, you will just lose some amount of data at the end of the inode. For example, if JFFS2 starts writing those 10MiB of data, write 5MiB, and a power cut happens, you will end up with a 5MiB f.dat
file. You lose only the last 5MiB.
ubifs的情况点复杂,因为ubifs的meta data存在单独的inode节点。ubifs的策略是:
写入的data node不能超过flash上的inode中的size。但是可以超过内存中inode的size。如果超过了,ubifs
会先更新inode节点,然后在更新数据节点。如果更新数据节点发生了丢失,将导致文件结尾有些空洞。
Every piece of information UBIFS writes to the media has a CRC-32 checksum. UBIFS protects both data and meta-data with CRC. Every time the meta-data is read, the CRC checksum is verified.
The data CRC is not verified by default. We do this to improve the default file-system read speed.
But UBIFS allows to switch the data verification on using the chk_data_crc
mount option.
Note, currently UBIFS cannot disable CRC-32 calculations on write, because UBIFS recovery process depends on in. When recovering from an unclean reboot and re-playing the journal, UBIFS has to be able to detect broken and half-written UBIFS nodes and drop them, and UBIFS depends on the CRC-32 checksum here.
In other words, if you use UBIFS with data CRC-32 checking disabled, you still have the CRC-32 checksum attached to each piece of data, and you may mount UBIFS with the chk_data_crc
option to enable CRC-32 checking at any time
meta-data和 data node总是写入CRC。但是只有meta-data会做CRC校验。data node默认不做CRC检查。但是可以通过
挂载选项做CRC检查。
Changing a file atomically means changing its contents in a way that unclean reboots could not lead to any corruption or inconsistency in the file.
The only reliable way to do this in UBIFS (and in most of other file-systems, e.g. JFFS2 or ext3) is the following:
make a copy of the file;
change the copy;
synchronize the copy (see here);
re-name the copy to the file (using the rename()
libc function or the mv
utility).
Note, if a power-cut happens during the re-naming, the original file will be intact because the re-name operation is atomic. This is a POSIX
requirement and UBIFS satisfies it.
OpenWrt UCI使用这个方法。它在uci commit的时候,先写入到一个文件,最后rename。
Zero-length files are a special case of corruption which happens when an application first truncates a file, then updates it. The truncation is synchronous in UBIFS, so it is written to the media straight away. But when the data are written, they go to the page cache, not to the flash media. So when an unclean reboot happens, the file becomes empty (truncated) because the data are lost.
Zero-length files also appear when an application creates a new file, then writes to the file, and a power cut happens. The reason is similar - file creation is a synchronous operation, data writing is not.
Well, the description is a bit simplified. Actually, when a file is created or truncated, the creation/truncation UBIFS information is written to the write-buffer, not straight to the media. So if a power cut happens before the write-buffer is synchronized, the file will disappear (creation case) or stay intact (truncation case). But since the write-buffer is small and all UBIFS writes go there, it is usually synchronized very soon. After this point the file is created/truncated for real.
参考
【1】http://www.linux-mtd.infradead.org/doc/ubi.html
【2】Thomas Gleixner, Frank Haverkamp, Artem Bityutskiy. UBI - Unsorted Block Images.
http://www.linux-mtd.infradead.org/doc/ubidesign/ubidesign.pdf
【3】https://www.linux-mtd.infradead.org/doc/ubifs.html
【4】Adrian Hunter, Artem Bityutskiy. UBIFS file system, NOKIA.
http://www.linux-mtd.infradead.org/doc/ubifs.pdf
【5】Adrian Hunter. A Brief Introduction to the Design of UBIFS. 2008.
http://www.linux-mtd.infradead.org/doc/ubifs_whitepaper.pdf
【6】UBI FAQ and HOWTO
http://www.linux-mtd.infradead.org/faq/ubi.html
【7】UBIFS FAQ and HOWTO
http://www.linux-mtd.infradead.org/faq/ubifs.html
【8】https://www.kernel.org/doc/html/latest/filesystems/ubifs.html
【9】Katsuki. Evaluation of UBI and UBIFS. TOSHIBA. 2009
https://elinux.org/images/f/f8/CELFJamboree30-UBIFS_update.pdf
Theodore Ts'o. Delayed allocation and the zero-length file problem. 2009
https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
Theodore Ts'o. Don’t fear the fsync! 2009
https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/