diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2013-05-09 13:07:40 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2013-05-09 13:07:40 -0700 |
commit | 983a5f84a4a11c8706ca70615125db711336b684 (patch) | |
tree | bbf16b836903aaf523e7c637a0d7191ba8fa172d | |
parent | 8769e078a9a2bce13d39c08e0e5a513f5320e1de (diff) | |
parent | 667e7d94a1683661cff5fe9a0fa0d7f8fdd2c007 (diff) |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
"These are mostly fixes. The biggest exceptions are Josef's skinny
extents and Jan Schmidt's code to rebuild our quota indexes if they
get out of sync (or you enable quotas on an existing filesystem).
The skinny extents are off by default because they are a new variation
on the extent allocation tree format. btrfstune -x enables them, and
the new format makes the extent allocation tree about 30% smaller.
I rebased this a few days ago to rework Dave Sterba's crc checks on
the super block, but almost all of these go back to rc6, since I
though 3.9 was due any minute.
The biggest missing fix is the tracepoint bug that was hit late in
3.9. I ran into problems with that in overnight testing and I'm still
tracking it down. I'll definitely have that fixed for rc2."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
Btrfs: allow superblock mismatch from older mkfs
btrfs: enhance superblock checks
btrfs: fix misleading variable name for flags
btrfs: use unsigned long type for extent state bits
Btrfs: improve the loop of scrub_stripe
btrfs: read entire device info under lock
btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
btrfs: handle errors returned from get_tree_block_key
btrfs: make static code static & remove dead code
Btrfs: deal with errors in write_dev_supers
Btrfs: remove almost all of the BUG()'s from tree-log.c
Btrfs: deal with free space cache errors while replaying log
Btrfs: automatic rescan after "quota enable" command
Btrfs: rescan for qgroups
Btrfs: split btrfs_qgroup_account_ref into four functions
Btrfs: allocate new chunks if the space is not enough for global rsv
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
btrfs: move leak debug code to functions
Btrfs: return free space in cow error path
Btrfs: set UUID in root_item for created trees
...
50 files changed, 3418 insertions, 1911 deletions
diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt index 7671352216f..b349d57b76e 100644 --- a/Documentation/filesystems/btrfs.txt +++ b/Documentation/filesystems/btrfs.txt @@ -1,8 +1,8 @@ - BTRFS - ===== +BTRFS +===== -Btrfs is a new copy on write filesystem for Linux aimed at +Btrfs is a copy on write filesystem for Linux aimed at implementing advanced features while focusing on fault tolerance, repair and easy administration. Initially developed by Oracle, Btrfs is licensed under the GPL and open for contribution from anyone. @@ -34,9 +34,175 @@ The main Btrfs features include: * Online filesystem defragmentation +Mount Options +============= - MAILING LIST - ============ +When mounting a btrfs filesystem, the following option are accepted. +Unless otherwise specified, all options default to off. + + alloc_start=<bytes> + Debugging option to force all block allocations above a certain + byte threshold on each block device. The value is specified in + bytes, optionally with a K, M, or G suffix, case insensitive. + Default is 1MB. + + autodefrag + Detect small random writes into files and queue them up for the + defrag process. Works best for small files; Not well suited for + large database workloads. + + check_int + check_int_data + check_int_print_mask=<value> + These debugging options control the behavior of the integrity checking + module (the BTRFS_FS_CHECK_INTEGRITY config option required). + + check_int enables the integrity checker module, which examines all + block write requests to ensure on-disk consistency, at a large + memory and CPU cost. + + check_int_data includes extent data in the integrity checks, and + implies the check_int option. + + check_int_print_mask takes a bitmask of BTRFSIC_PRINT_MASK_* values + as defined in fs/btrfs/check-integrity.c, to control the integrity + checker module behavior. + + See comments at the top of fs/btrfs/check-integrity.c for more info. + + compress + compress=<type> + compress-force + compress-force=<type> + Control BTRFS file data compression. Type may be specified as "zlib" + "lzo" or "no" (for no compression, used for remounting). If no type + is specified, zlib is used. If compress-force is specified, + all files will be compressed, whether or not they compress well. + If compression is enabled, nodatacow and nodatasum are disabled. + + degraded + Allow mounts to continue with missing devices. A read-write mount may + fail with too many devices missing, for example if a stripe member + is completely missing. + + device=<devicepath> + Specify a device during mount so that ioctls on the control device + can be avoided. Especialy useful when trying to mount a multi-device + setup as root. May be specified multiple times for multiple devices. + + discard + Issue frequent commands to let the block device reclaim space freed by + the filesystem. This is useful for SSD devices, thinly provisioned + LUNs and virtual machine images, but may have a significant + performance impact. (The fstrim command is also available to + initiate batch trims from userspace). + + enospc_debug + Debugging option to be more verbose in some ENOSPC conditions. + + fatal_errors=<action> + Action to take when encountering a fatal error: + "bug" - BUG() on a fatal error. This is the default. + "panic" - panic() on a fatal error. + + flushoncommit + The 'flushoncommit' mount option forces any data dirtied by a write in a + prior transaction to commit as part of the current commit. This makes + the committed state a fully consistent view of the file system from the + application's perspective (i.e., it includes all completed file system + operations). This was previously the behavior only when a snapshot is + created. + + inode_cache + Enable free inode number caching. Defaults to off due to an overflow + problem when the free space crcs don't fit inside a single page. + + max_inline=<bytes> + Specify the maximum amount of space, in bytes, that can be inlined in + a metadata B-tree leaf. The value is specified in bytes, optionally + with a K, M, or G suffix, case insensitive. In practice, this value + is limited by the root sector size, with some space unavailable due + to leaf headers. For a 4k sectorsize, max inline data is ~3900 bytes. + + metadata_ratio=<value> + Specify that 1 metadata chunk should be allocated after every <value> + data chunks. Off by default. + + noacl + Disable support for Posix Access Control Lists (ACLs). See the + acl(5) manual page for more information about ACLs. + + nobarrier + Disables the use of block layer write barriers. Write barriers ensure + that certain IOs make it through the device cache and are on persistent + storage. If used on a device with a volatile (non-battery-backed) + write-back cache, this option will lead to filesystem corruption on a + system crash or power loss. + + nodatacow + Disable data copy-on-write for newly created files. Implies nodatasum, + and disables all compression. + + nodatasum + Disable data checksumming for newly created files. + + notreelog + Disable the tree logging used for fsync and O_SYNC writes. + + recovery + Enable autorecovery attempts if a bad tree root is found at mount time. + Currently this scans a list of several previous tree roots and tries to + use the first readable. + + skip_balance + Skip automatic resume of interrupted balance operation after mount. + May be resumed with "btrfs balance resume." + + space_cache (*) + Enable the on-disk freespace cache. + nospace_cache + Disable freespace cache loading without clearing the cache. + clear_cache + Force clearing and rebuilding of the disk space cache if something + has gone wrong. + + ssd + nossd + ssd_spread + Options to control ssd allocation schemes. By default, BTRFS will + enable or disable ssd allocation heuristics depending on whether a + rotational or nonrotational disk is in use. The ssd and nossd options + can override this autodetection. + + The ssd_spread mount option attempts to allocate into big chunks + of unused space, and may perform better on low-end ssds. ssd_spread + implies ssd, enabling all other ssd heuristics as well. + + subvol=<path> + Mount subvolume at <path> rather than the root subvolume. <path> is + relative to the top level subvolume. + + subvolid=<ID> + Mount subvolume specified by an ID number rather than the root subvolume. + This allows mounting of subvolumes which are not in the root of the mounted + filesystem. + You can use "btrfs subvolume list" to see subvolume ID numbers. + + subvolrootid=<objectid> (deprecated) + Mount subvolume specified by <objectid> rather than the root subvolume. + This allows mounting of subvolumes which are not in the root of the mounted + filesystem. + You can use "btrfs subvolume show " to see the object ID for a subvolume. + + thread_pool=<number> + The number of worker threads to allocate. The default number is equal + to the number of CPUs + 2, or 8, whichever is smaller. + + user_subvol_rm_allowed + Allow subvolumes to be deleted by a non-root user. Use with caution. + +MAILING LIST +============ There is a Btrfs mailing list hosted on vger.kernel.org. You can find details on how to subscribe here: @@ -49,8 +215,8 @@ http://dir.gmane.org/gmane.comp.file-systems.btrfs - IRC - === +IRC +=== Discussion of Btrfs also occurs on the #btrfs channel of the Freenode IRC network. diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index 9a8622a5b86..2b3b8329697 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -1,5 +1,5 @@ config BTRFS_FS - tristate "Btrfs filesystem Unstable disk format" + tristate "Btrfs filesystem support" select LIBCRC32C select ZLIB_INFLATE select ZLIB_DEFLATE @@ -52,3 +52,23 @@ config BTRFS_FS_CHECK_INTEGRITY In most cases, unless you are a btrfs developer who needs to verify the integrity of (super)-block write requests during the run of a regression test, say N + +config BTRFS_FS_RUN_SANITY_TESTS + bool "Btrfs will run sanity tests upon loading" + depends on BTRFS_FS + help + This will run some basic sanity tests on the free space cache + code to make sure it is acting as it should. These are mostly + regression tests and are only really interesting to btrfs devlopers. + + If unsure, say N. + +config BTRFS_DEBUG + bool "Btrfs debugging support" + depends on BTRFS_FS + help + Enable run-time debugging support for the btrfs filesystem. This may + enable additional and expensive checks with negative impact on + performance, or export extra information via sysfs. + + If unsure, say N. diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index bd605c87adf..b4fb4155811 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -352,6 +352,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info, err = __resolve_indirect_ref(fs_info, search_commit_root, time_seq, ref, parents, extent_item_pos); + if (err == -ENOMEM) + goto out; if (err) continue; @@ -367,7 +369,7 @@ static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info, new_ref = kmalloc(sizeof(*new_ref), GFP_NOFS); if (!new_ref) { ret = -ENOMEM; - break; + goto out; } memcpy(new_ref, ref, sizeof(*ref)); new_ref->parent = node->val; @@ -377,7 +379,7 @@ static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info, } ulist_reinit(parents); } - +out: ulist_free(parents); return ret; } @@ -421,7 +423,10 @@ static int __add_missing_keys(struct btrfs_fs_info *fs_info, BUG_ON(!ref->wanted_disk_byte); eb = read_tree_block(fs_info->tree_root, ref->wanted_disk_byte, fs_info->tree_root->leafsize, 0); - BUG_ON(!eb); + if (!eb || !extent_buffer_uptodate(eb)) { + free_extent_buffer(eb); + return -EIO; + } btrfs_tree_read_lock(eb); if (btrfs_header_level(eb) == 0) btrfs_item_key_to_cpu(eb, &ref->key_for_search, 0); @@ -443,7 +448,7 @@ static int __add_missing_keys(struct btrfs_fs_info *fs_info, * having a parent). * mode = 2: merge identical parents */ -static int __merge_refs(struct list_head *head, int mode) +static void __merge_refs(struct list_head *head, int mode) { struct list_head *pos1; @@ -489,7 +494,6 @@ static int __merge_refs(struct list_head *head, int mode) } } - return 0; } /* @@ -582,7 +586,8 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq, default: WARN_ON(1); } - BUG_ON(ret); + if (ret) + return ret; } return 0; @@ -680,7 +685,8 @@ static int __add_inline_refs(struct btrfs_fs_info *fs_info, default: WARN_ON(1); } - BUG_ON(ret); + if (ret) + return ret; ptr += btrfs_extent_inline_ref_size(type); } @@ -762,7 +768,9 @@ static int __add_keyed_refs(struct btrfs_fs_info *fs_info, default: WARN_ON(1); } - BUG_ON(ret); + if (ret) + return ret; + } return ret; @@ -880,18 +888,14 @@ again: if (ret) goto out; - ret = __merge_refs(&prefs, 1); - if (ret) - goto out; + __merge_refs(&prefs, 1); ret = __resolve_indirect_refs(fs_info, search_commit_root, time_seq, &prefs, extent_item_pos); if (ret) goto out; - ret = __merge_refs(&prefs, 2); - if (ret) - goto out; + __merge_refs(&prefs, 2); while (!list_empty(&prefs)) { ref = list_first_entry(&prefs, struct __prelim_ref, list); @@ -900,7 +904,8 @@ again: if (ref->count && ref->root_id && ref->parent == 0) { /* no parent == root of tree */ ret = ulist_add(roots, ref->root_id, 0, GFP_NOFS); - BUG_ON(ret < 0); + if (ret < 0) + goto out; } if (ref->count && ref->parent) { struct extent_inode_elem *eie = NULL; @@ -911,7 +916,10 @@ again: info_level); eb = read_tree_block(fs_info->extent_root, ref->parent, bsz, 0); - BUG_ON(!eb); + if (!eb || !extent_buffer_uptodate(eb)) { + free_extent_buffer(eb); + return -EIO; + } ret = find_extent_in_eb(eb, bytenr, *extent_item_pos, &eie); ref->inode_list = eie; @@ -920,6 +928,8 @@ again: ret = ulist_add_merge(refs, ref->parent, (uintptr_t)ref->inode_list, (u64 *)&eie, GFP_NOFS); + if (ret < 0) + goto out; if (!ret && extent_item_pos) { /* * we've recorded that parent, so we must extend @@ -930,7 +940,6 @@ again: eie = eie->next; eie->next = ref->inode_list; } - BUG_ON(ret < 0); } kfree(ref); } @@ -1180,6 +1189,20 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid, return ret; } +/* + * this iterates to turn a name (from iref/extref) into a full filesystem path. + * Elements of the path are separated by '/' and the path is guaranteed to be + * 0-terminated. the path is only given within the current file system. + * Therefore, it never starts with a '/'. the caller is responsible to provide + * "size" bytes in "dest". the dest buffer will be filled backwards. finally, + * the start point of the resulting string is returned. this pointer is within + * dest, normally. + * in case the path buffer would overflow, the pointer is decremented further + * as if output was written to the buffer, though no more output is actually + * generated. that way, the caller can determine how much space would be + * required for the path to fit into the buffer. in that case, the returned + * value will be smaller than dest. callers must check this! + */ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path, u32 name_len, unsigned long name_off, struct extent_buffer *eb_in, u64 parent, @@ -1249,32 +1272,6 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path, } /* - * this iterates to turn a btrfs_inode_ref into a full filesystem path. elements - * of the path are separated by '/' and the path is guaranteed to be - * 0-terminated. the path is only given within the current file system. - * Therefore, it never starts with a '/'. the caller is responsible to provide - * "size" bytes in "dest". the dest buffer will be filled backwards. finally, - * the start point of the resulting string is returned. this pointer is within - * dest, normally. - * in case the path buffer would overflow, the pointer is decremented further - * as if output was written to the buffer, though no more output is actually - * generated. that way, the caller can determine how much space would be - * required for the path to fit into the buffer. in that case, the returned - * value will be smaller than dest. callers must check this! - */ -char *btrfs_iref_to_path(struct btrfs_root *fs_root, - struct btrfs_path *path, - struct btrfs_inode_ref *iref, - struct extent_buffer *eb_in, u64 parent, - char *dest, u32 size) -{ - return btrfs_ref_to_path(fs_root, path, - btrfs_inode_ref_name_len(eb_in, iref), - (unsigned long)(iref + 1), - eb_in, parent, dest, size); -} - -/* * this makes the path point to (logical EXTENT_ITEM *) * returns BTRFS_EXTENT_FLAG_DATA for data, BTRFS_EXTENT_FLAG_TREE_BLOCK for * tree blocks and <0 on error. @@ -1461,8 +1458,6 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info, iterate_extent_inodes_t *iterate, void *ctx) { int ret; - struct list_head data_refs = LIST_HEAD_INIT(data_refs); - struct list_head shared_refs = LIST_HEAD_INIT(shared_refs); struct btrfs_trans_handle *trans; struct ulist *refs = NULL; struct ulist *roots = NULL; @@ -1508,11 +1503,9 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info, iterate, ctx); } ulist_free(roots); - roots = NULL; } free_leaf_list(refs); - ulist_free(roots); out: if (!search_commit_root) { btrfs_put_tree_mod_seq(fs_info, &tree_mod_seq_elem); diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h index 310a7f6d09b..0f446d7ca2c 100644 --- a/fs/btrfs/backref.h +++ b/fs/btrfs/backref.h @@ -59,9 +59,6 @@ int paths_from_inode(u64 inum, struct inode_fs_paths *ipath); int btrfs_find_all_roots(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr, u64 time_seq, struct ulist **roots); -char *btrfs_iref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path, - struct btrfs_inode_ref *iref, struct extent_buffer *eb, - u64 parent, char *dest, u32 size); char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path, u32 name_len, unsigned long name_off, struct extent_buffer *eb_in, u64 parent, diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index d9b97d4960e..08b286b2a2c 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -93,7 +93,7 @@ struct btrfs_inode { unsigned long runtime_flags; - /* Keep track of who's O_SYNC/fsycing currently */ + /* Keep track of who's O_SYNC/fsyncing currently */ atomic_t sync_writers; /* full 64 bit generation number, struct vfs_inode doesn't have a big diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 15b94089abc..b189bd1e7a3 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -82,6 +82,10 @@ struct compressed_bio { u32 sums; }; +static int btrfs_decompress_biovec(int type, struct page **pages_in, + u64 disk_start, struct bio_vec *bvec, + int vcnt, size_t srclen); + static inline int compressed_bio_size(struct btrfs_root *root, unsigned long disk_size) { @@ -106,7 +110,6 @@ static int check_compressed_csum(struct inode *inode, u64 disk_start) { int ret; - struct btrfs_root *root = BTRFS_I(inode)->root; struct page *page; unsigned long i; char *kaddr; @@ -121,7 +124,7 @@ static int check_compressed_csum(struct inode *inode, csum = ~(u32)0; kaddr = kmap_atomic(page); - csum = btrfs_csum_data(root, kaddr, csum, PAGE_CACHE_SIZE); + csum = btrfs_csum_data(kaddr, csum, PAGE_CACHE_SIZE); btrfs_csum_final(csum, (char *)&csum); kunmap_atomic(kaddr); @@ -739,7 +742,7 @@ static int comp_num_workspace[BTRFS_COMPRESS_TYPES]; static atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES]; static wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]; -struct btrfs_compress_op *btrfs_compress_op[] = { +static struct btrfs_compress_op *btrfs_compress_op[] = { &btrfs_zlib_compress, &btrfs_lzo_compress, }; @@ -910,8 +913,9 @@ int btrfs_compress_pages(int type, struct address_space *mapping, * be contiguous. They all correspond to the range of bytes covered by * the compressed extent. */ -int btrfs_decompress_biovec(int type, struct page **pages_in, u64 disk_start, - struct bio_vec *bvec, int vcnt, size_t srclen) +static int btrfs_decompress_biovec(int type, struct page **pages_in, + u64 disk_start, struct bio_vec *bvec, + int vcnt, size_t srclen) { struct list_head *workspace; int ret; diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index 9afb0a62ae8..0c803b4fbf9 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -30,8 +30,6 @@ int btrfs_compress_pages(int type, struct address_space *mapping, unsigned long *total_in, unsigned long *total_out, unsigned long max_out); -int btrfs_decompress_biovec(int type, struct page **pages_in, u64 disk_start, - struct bio_vec *bvec, int vcnt, size_t srclen); int btrfs_decompress(int type, unsigned char *data_in, struct page *dest_page, unsigned long start_byte, size_t srclen, size_t destlen); int btrfs_decompress_buf2page(char *buf, unsigned long buf_start, diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index ca9d8f1a3bb..de6de8e60b4 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -37,16 +37,11 @@ static int balance_node_right(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *dst_buf, struct extent_buffer *src_buf); -static void del_ptr(struct btrfs_trans_handle *trans, struct btrfs_root *root, - struct btrfs_path *path, int level, int slot); +static void del_ptr(struct btrfs_root *root, struct btrfs_path *path, + int level, int slot); static void tree_mod_log_free_eb(struct btrfs_fs_info *fs_info, struct extent_buffer *eb); -struct extent_buffer *read_old_tree_block(struct btrfs_root *root, u64 bytenr, - u32 blocksize, u64 parent_transid, - u64 time_seq); -struct extent_buffer *btrfs_find_old_tree_block(struct btrfs_root *root, - u64 bytenr, u32 blocksize, - u64 time_seq); +static int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path); struct btrfs_path *btrfs_alloc_path(void) { @@ -208,7 +203,7 @@ struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root) * tree until you end up with a lock on the root. A locked buffer * is returned, with a reference held. */ -struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root) +static struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root) { struct extent_buffer *eb; @@ -361,6 +356,44 @@ static inline void tree_mod_log_write_unlock(struct btrfs_fs_info *fs_info) } /* + * Increment the upper half of tree_mod_seq, set lower half zero. + * + * Must be called with fs_info->tree_mod_seq_lock held. + */ +static inline u64 btrfs_inc_tree_mod_seq_major(struct btrfs_fs_info *fs_info) +{ + u64 seq = atomic64_read(&fs_info->tree_mod_seq); + seq &= 0xffffffff00000000ull; + seq += 1ull << 32; + atomic64_set(&fs_info->tree_mod_seq, seq); + return seq; +} + +/* + * Increment the lower half of tree_mod_seq. + * + * Must be called with fs_info->tree_mod_seq_lock held. The way major numbers + * are generated should not technically require a spin lock here. (Rationale: + * incrementing the minor while incrementing the major seq number is between its + * atomic64_read and atomic64_set calls doesn't duplicate sequence numbers, it + * just returns a unique sequence number as usual.) We have decided to leave |