diff options
-rw-r--r-- | Documentation/filesystems/ntfs.txt | 42 | ||||
-rw-r--r-- | fs/ntfs/ChangeLog | 38 | ||||
-rw-r--r-- | fs/ntfs/Makefile | 2 | ||||
-rw-r--r-- | fs/ntfs/file.c | 2247 |
4 files changed, 2280 insertions, 49 deletions
diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt index a5fbc8e897f..614de312490 100644 --- a/Documentation/filesystems/ntfs.txt +++ b/Documentation/filesystems/ntfs.txt @@ -50,9 +50,14 @@ userspace utilities, etc. Features ======== -- This is a complete rewrite of the NTFS driver that used to be in the kernel. - This new driver implements NTFS read support and is functionally equivalent - to the old ntfs driver. +- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and + earlier kernels. This new driver implements NTFS read support and is + functionally equivalent to the old ntfs driver and it also implements limited + write support. The biggest limitation at present is that files/directories + cannot be created or deleted. See below for the list of write features that + are so far supported. Another limitation is that writing to compressed files + is not implemented at all. Also, neither read nor write access to encrypted + files is so far implemented. - The new driver has full support for sparse files on NTFS 3.x volumes which the old driver isn't happy with. - The new driver supports execution of binaries due to mmap() now being @@ -78,7 +83,20 @@ Features - The new driver supports fsync(2), fdatasync(2), and msync(2). - The new driver supports readv(2) and writev(2). - The new driver supports access time updates (including mtime and ctime). - +- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present + only very limited support for highly fragmented files, i.e. ones which have + their data attribute split across multiple extents, is included. Another + limitation is that at present truncate(2) will never create sparse files, + since to mark a file sparse we need to modify the directory entry for the + file and we do not implement directory modifications yet. +- The new driver supports write(2) which can both overwrite existing data and + extend the file size so that you can write beyond the existing data. Also, + writing into sparse regions is supported and the holes are filled in with + clusters. But at present only limited support for highly fragmented files, + i.e. ones which have their data attribute split across multiple extents, is + included. Another limitation is that write(2) will never create sparse + files, since to mark a file sparse we need to modify the directory entry for + the file and we do not implement directory modifications yet. Supported mount options ======================= @@ -439,6 +457,22 @@ ChangeLog Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. +2.1.25: + - Write support is now extended with write(2) being able to both + overwrite existing file data and to extend files. Also, if a write + to a sparse region occurs, write(2) will fill in the hole. Note, + mmap(2) based writes still do not support writing into holes or + writing beyond the initialized size. + - Write support has a new feature and that is that truncate(2) and + open(2) with O_TRUNC are now implemented thus files can be both made + smaller and larger. + - Note: Both write(2) and truncate(2)/open(2) with O_TRUNC still have + limitations in that they + - only provide limited support for highly fragmented files. + - only work on regular, i.e. uncompressed and unencrypted files. + - never create sparse files although this will change once directory + operations are implemented. + - Lots of bug fixes and enhancements across the board. 2.1.24: - Support journals ($LogFile) which have been modified by chkdsk. This means users can boot into Windows after we marked the volume dirty. diff --git a/fs/ntfs/ChangeLog b/fs/ntfs/ChangeLog index 3b8ff231808..03015c7b236 100644 --- a/fs/ntfs/ChangeLog +++ b/fs/ntfs/ChangeLog @@ -1,16 +1,15 @@ ToDo/Notes: - Find and fix bugs. - - In between ntfs_prepare/commit_write, need exclusion between - simultaneous file extensions. This is given to us by holding i_sem - on the inode. The only places in the kernel when a file is resized - are prepare/commit write and ntfs_truncate() for both of which i_sem - is held. Just have to be careful in read-/writepage and other helpers + - The only places in the kernel where a file is resized are + ntfs_file_write*() and ntfs_truncate() for both of which i_sem is + held. Just have to be careful in read-/writepage and other helpers not running under i_sem that we play nice... Also need to be careful - with initialized_size extention in ntfs_prepare_write and writepage. - UPDATE: The only things that need to be checked are - prepare/commit_write as well as the compressed write and the other - attribute resize/write cases like index attributes, etc. For now - none of these are implemented so are safe. + with initialized_size extension in ntfs_file_write*() and writepage. + UPDATE: The only things that need to be checked are the compressed + write and the other attribute resize/write cases like index + attributes, etc. For now none of these are implemented so are safe. + - Implement filling in of holes in aops.c::ntfs_writepage() and its + helpers. - Implement mft.c::sync_mft_mirror_umount(). We currently will just leave the volume dirty on umount if the final iput(vol->mft_ino) causes a write of any mirrored mft records due to the mft mirror @@ -20,7 +19,7 @@ ToDo/Notes: - Enable the code for setting the NT4 compatibility flag when we start making NTFS 1.2 specific modifications. -2.1.25-WIP +2.1.25 - (Almost) fully implement write(2) and truncate(2). - Change ntfs_map_runlist_nolock(), ntfs_attr_find_vcn_nolock() and {__,}ntfs_cluster_free() to also take an optional attribute search @@ -49,7 +48,12 @@ ToDo/Notes: extend the allocation of an attributes. Optionally, the data size, but not the initialized size can be extended, too. - Implement fs/ntfs/inode.[hc]::ntfs_truncate(). It only supports - uncompressed and unencrypted files. + uncompressed and unencrypted files and it never creates sparse files + at least for the moment (making a file sparse requires us to modify + its directory entries and we do not support directory operations at + the moment). Also, support for highly fragmented files, i.e. ones + whose data attribute is split across multiple extents, is severly + limited. When such a case is encountered, EOPNOTSUPP is returned. - Enable ATTR_SIZE attribute changes in ntfs_setattr(). This completes the initial implementation of file truncation. Now both open(2)ing a file with the O_TRUNC flag and the {,f}truncate(2) system calls @@ -61,6 +65,16 @@ ToDo/Notes: and cond_resched() in the main loop as we could be dirtying a lot of pages and this ensures we play nice with the VM and the system as a whole. + - Implement file operations ->write, ->aio_write, ->writev for regular + files. This replaces the old use of generic_file_write(), et al and + the address space operations ->prepare_write and ->commit_write. + This means that both sparse and non-sparse (unencrypted and + uncompressed) files can now be extended using the normal write(2) + code path. There are two limitations at present and these are that + we never create sparse files and that we only have limited support + for highly fragmented files, i.e. ones whose data attribute is split + across multiple extents. When such a case is encountered, + EOPNOTSUPP is returned. 2.1.24 - Lots of bug fixes and support more clean journal states. diff --git a/fs/ntfs/Makefile b/fs/ntfs/Makefile index a3ce2c0e7dd..d0d45d1c853 100644 --- a/fs/ntfs/Makefile +++ b/fs/ntfs/Makefile @@ -6,7 +6,7 @@ ntfs-objs := aops.o attrib.o collate.o compress.o debug.o dir.o file.o \ index.o inode.o mft.o mst.o namei.o runlist.o super.o sysctl.o \ unistr.o upcase.o -EXTRA_CFLAGS = -DNTFS_VERSION=\"2.1.25-WIP\" +EXTRA_CFLAGS = -DNTFS_VERSION=\"2.1.25\" ifeq ($(CONFIG_NTFS_DEBUG),y) EXTRA_CFLAGS += -DDEBUG diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c index be9fd1dd423..cf2a0e2330d 100644 --- a/fs/ntfs/file.c +++ b/fs/ntfs/file.c @@ -19,11 +19,24 @@ * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ -#include <linux/pagemap.h> #include <linux/buffer_head.h> +#include <linux/pagemap.h> +#include <linux/pagevec.h> +#include <linux/sched.h> +#include <linux/swap.h> +#include <linux/uio.h> +#include <linux/writeback.h> +#include <asm/page.h> +#include <asm/uaccess.h> + +#include "attrib.h" +#include "bitmap.h" #include "inode.h" #include "debug.h" +#include "lcnalloc.h" +#include "malloc.h" +#include "mft.h" #include "ntfs.h" /** @@ -56,6 +69,2176 @@ static int ntfs_file_open(struct inode *vi, struct file *filp) #ifdef NTFS_RW /** + * ntfs_attr_extend_initialized - extend the initialized size of an attribute + * @ni: ntfs inode of the attribute to extend + * @new_init_size: requested new initialized size in bytes + * @cached_page: store any allocated but unused page here + * @lru_pvec: lru-buffering pagevec of the caller + * + * Extend the initialized size of an attribute described by the ntfs inode @ni + * to @new_init_size bytes. This involves zeroing any non-sparse space between + * the old initialized size and @new_init_size both in the page cache and on + * disk (if relevant complete pages are zeroed in the page cache then these may + * simply be marked dirty for later writeout). There is one caveat and that is + * that if any uptodate page cache pages between the old initialized size and + * the smaller of @new_init_size and the file size (vfs inode->i_size) are in + * memory, these need to be marked dirty without being zeroed since they could + * be non-zero due to mmap() based writes. + * + * As a side-effect, the file size (vfs inode->i_size) may be incremented as, + * in the resident attribute case, it is tied to the initialized size and, in + * the non-resident attribute case, it may not fall below the initialized size. + * + * Note that if the attribute is resident, we do not need to touch the page + * cache at all. This is because if the page cache page is not uptodate we + * bring it uptodate later, when doing the write to the mft record since we + * then already have the page mapped. And if the page is uptodate, the + * non-initialized region will already have been zeroed when the page was + * brought uptodate and the region may in fact already have been overwritten + * with new data via mmap() based writes, so we cannot just zero it. And since + * POSIX specifies that the behaviour of resizing a file whilst it is mmap()ped + * is unspecified, we choose not to do zeroing and thus we do not need to touch + * the page at all. For a more detailed explanation see ntfs_truncate() which + * is in fs/ntfs/inode.c. + * + * @cached_page and @lru_pvec are just optimisations for dealing with multiple + * pages. + * + * Return 0 on success and -errno on error. In the case that an error is + * encountered it is possible that the initialized size will already have been + * incremented some way towards @new_init_size but it is guaranteed that if + * this is the case, the necessary zeroing will also have happened and that all + * metadata is self-consistent. + * + * Locking: This function locks the mft record of the base ntfs inode and + * maintains the lock throughout execution of the function. This is required + * so that the initialized size of the attribute can be modified safely. + */ +static int ntfs_attr_extend_initialized(ntfs_inode *ni, const s64 new_init_size, + struct page **cached_page, struct pagevec *lru_pvec) +{ + s64 old_init_size; + loff_t old_i_size; + pgoff_t index, end_index; + unsigned long flags; + struct inode *vi = VFS_I(ni); + ntfs_inode *base_ni; + MFT_RECORD *m = NULL; + ATTR_RECORD *a; + ntfs_attr_search_ctx *ctx = NULL; + struct address_space *mapping; + struct page *page = NULL; + u8 *kattr; + int err; + u32 attr_len; + + read_lock_irqsave(&ni->size_lock, flags); + old_init_size = ni->initialized_size; + old_i_size = i_size_read(vi); + BUG_ON(new_init_size > ni->allocated_size); + read_unlock_irqrestore(&ni->size_lock, flags); + ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " + "old_initialized_size 0x%llx, " + "new_initialized_size 0x%llx, i_size 0x%llx.", + vi->i_ino, (unsigned)le32_to_cpu(ni->type), + (unsigned long long)old_init_size, + (unsigned long long)new_init_size, old_i_size); + if (!NInoAttr(ni)) + base_ni = ni; + else + base_ni = ni->ext.base_ntfs_ino; + /* Use goto to reduce indentation and we need the label below anyway. */ + if (NInoNonResident(ni)) + goto do_non_resident_extend; + BUG_ON(old_init_size != old_i_size); + m = map_mft_record(base_ni); + if (IS_ERR(m)) { + err = PTR_ERR(m); + m = NULL; + goto err_out; + } + ctx = ntfs_attr_get_search_ctx(base_ni, m); + if (unlikely(!ctx)) { + err = -ENOMEM; + goto err_out; + } + err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, + CASE_SENSITIVE, 0, NULL, 0, ctx); + if (unlikely(err)) { + if (err == -ENOENT) + err = -EIO; + goto err_out; + } + m = ctx->mrec; + a = ctx->attr; + BUG_ON(a->non_resident); + /* The total length of the attribute value. */ + attr_len = le32_to_cpu(a->data.resident.value_length); + BUG_ON(old_i_size != (loff_t)attr_len); + /* + * Do the zeroing in the mft record and update the attribute size in + * the mft record. + */ + kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset); + memset(kattr + attr_len, 0, new_init_size - attr_len); + a->data.resident.value_length = cpu_to_le32((u32)new_init_size); + /* Finally, update the sizes in the vfs and ntfs inodes. */ + write_lock_irqsave(&ni->size_lock, flags); + i_size_write(vi, new_init_size); + ni->initialized_size = new_init_size; + write_unlock_irqrestore(&ni->size_lock, flags); + goto done; +do_non_resident_extend: + /* + * If the new initialized size @new_init_size exceeds the current file + * size (vfs inode->i_size), we need to extend the file size to the + * new initialized size. + */ + if (new_init_size > old_i_size) { + m = map_mft_record(base_ni); + if (IS_ERR(m)) { + err = PTR_ERR(m); + m = NULL; + goto err_out; + } + ctx = ntfs_attr_get_search_ctx(base_ni, m); + if (unlikely(!ctx)) { + err = -ENOMEM; + goto err_out; + } + err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, + CASE_SENSITIVE, 0, NULL, 0, ctx); + if (unlikely(err)) { + if (err == -ENOENT) + err = -EIO; + goto err_out; + } + m = ctx->mrec; + a = ctx->attr; + BUG_ON(!a->non_resident); + BUG_ON(old_i_size != (loff_t) + sle64_to_cpu(a->data.non_resident.data_size)); + a->data.non_resident.data_size = cpu_to_sle64(new_init_size); + flush_dcache_mft_record_page(ctx->ntfs_ino); + mark_mft_record_dirty(ctx->ntfs_ino); + /* Update the file size in the vfs inode. */ + i_size_write(vi, new_init_size); + ntfs_attr_put_search_ctx(ctx); + ctx = NULL; + unmap_mft_record(base_ni); + m = NULL; + } + mapping = vi->i_mapping; + index = old_init_size >> PAGE_CACHE_SHIFT; + end_index = (new_init_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + do { + /* + * Read the page. If the page is not present, this will zero + * the uninitialized regions for us. + */ + page = read_cache_page(mapping, index, + (filler_t*)mapping->a_ops->readpage, NULL); + if (IS_ERR(page)) { + err = PTR_ERR(page); + goto init_err_out; + } + wait_on_page_locked(page); + if (unlikely(!PageUptodate(page) || PageError(page))) { + page_cache_release(page); + err = -EIO; + goto init_err_out; + } + /* + * Update the initialized size in the ntfs inode. This is + * enough to make ntfs_writepage() work. + */ + write_lock_irqsave(&ni->size_lock, flags); + ni->initialized_size = (index + 1) << PAGE_CACHE_SHIFT; + if (ni->initialized_size > new_init_size) + ni->initialized_size = new_init_size; + write_unlock_irqrestore(&ni->size_lock, flags); + /* Set the page dirty so it gets written out. */ + set_page_dirty(page); + page_cache_release(page); + /* + * Play nice with the vm and the rest of the system. This is + * very much needed as we can potentially be modifying the + * initialised size from a very small value to a really huge + * value, e.g. + * f = open(somefile, O_TRUNC); + * truncate(f, 10GiB); + * seek(f, 10GiB); + * write(f, 1); + * And this would mean we would be marking dirty hundreds of + * thousands of pages or as in the above example more than + * two and a half million pages! + * + * TODO: For sparse pages could optimize this workload by using + * the FsMisc / MiscFs page bit as a "PageIsSparse" bit. This + * would be set in readpage for sparse pages and here we would + * not need to mark dirty any pages which have this bit set. + * The only caveat is that we have to clear the bit everywhere + * where we allocate any clusters that lie in the page or that + * contain the page. + * + * TODO: An even greater optimization would be for us to only + * call readpage() on pages which are not in sparse regions as + * determined from the runlist. This would greatly reduce the + * number of pages we read and make dirty in the case of sparse + * files. + */ + balance_dirty_pages_ratelimited(mapping); + cond_resched(); + } while (++index < end_index); + read_lock_irqsave(&ni->size_lock, flags); + BUG_ON(ni->initialized_size != new_init_size); + read_unlock_irqrestore(&ni->size_lock, flags); + /* Now bring in sync the initialized_size in the mft record. */ + m = map_mft_record(base_ni); + if (IS_ERR(m)) { + err = PTR_ERR(m); + m = NULL; + goto init_err_out; + } + ctx = ntfs_attr_get_search_ctx(base_ni, m); + if (unlikely(!ctx)) { + err = -ENOMEM; + goto init_err_out; + } + err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, + CASE_SENSITIVE, 0, NULL, 0, ctx); + if (unlikely(err)) { + if (err == -ENOENT) + err = -EIO; + goto init_err_out; + } + m = ctx->mrec; + a = ctx->attr; + BUG_ON(!a->non_resident); + a->data.non_resident.initialized_size = cpu_to_sle64(new_init_size); +done: + flush_dcache_mft_record_page(ctx->ntfs_ino); + mark_mft_record_dirty(ctx->ntfs_ino); + if (ctx) + ntfs_attr_put_search_ctx(ctx); + if (m) + unmap_mft_record(base_ni); + ntfs_debug("Done, initialized_size 0x%llx, i_size 0x%llx.", + (unsigned long long)new_init_size, i_size_read(vi)); + return 0; +init_err_out: + write_lock_irqsave(&ni->size_lock, flags); + ni->initialized_size = old_init_size; + write_unlock_irqrestore(&ni->size_lock, flags); +err_out: + if (ctx) + ntfs_attr_put_search_ctx(ctx); + if (m) + unmap_mft_record(base_ni); + ntfs_debug("Failed. Returning error code %i.", err); + return err; +} + +/** + * ntfs_fault_in_pages_readable - + * + * Fault a number of userspace pages into pagetables. + * + * Unlike include/linux/pagemap.h::fault_in_pages_readable(), this one copes + * with more than two userspace pages as well as handling the single page case + * elegantly. + * + * If you find this difficult to understand, then think of the while loop being + * the following code, except that we do without the integer variable ret: + * + * do { + * ret = __get_user(c, uaddr); + * uaddr += PAGE_SIZE; + * } while (!ret && uaddr < end); + * + * Note, the final __get_user() may well run out-of-bounds of the user buffer, + * but _not_ out-of-bounds of the page the user buffer belongs to, and since + * this is only a read and not a write, and since it is still in the same page, + * it should not matter and this makes the code much simpler. + */ +static inline void ntfs_fault_in_pages_readable(const char __user *uaddr, + int bytes) +{ + const char __user *end; + volatile char c; + + /* Set @end to the first byte outside the last page we care about. */ + end = (const char __user*)PAGE_ALIGN((ptrdiff_t __user)uaddr + bytes); + + while (!__get_user(c, uaddr) && (uaddr += PAGE_SIZE, uaddr < end)) + ; +} + +/** + * ntfs_fault_in_pages_readable_iovec - + * + * Same as ntfs_fault_in_pages_readable() but operates on an array of iovecs. + */ +static inline void ntfs_fault_in_pages_readable_iovec(const struct iovec *iov, + size_t iov_ofs, int bytes) +{ + do { + const char __user *buf; + unsigned len; + + buf = iov->iov_base + iov_ofs; + len = iov->iov_len - iov_ofs; + if (len > bytes) + len = bytes; + ntfs_fault_in_pages_readable(buf, len); + bytes -= len; + iov++; + iov_ofs = 0; + } while (bytes); +} + +/** + * __ntfs_grab_cache_pages - obtain a number of locked pages + * @mapping: address space mapping from which to obtain page cache pages + * @index: starting index in @mapping at which to begin obtaining pages + * @nr_pages: number of page cache pages to obtain + * @pages: array of pages in which to return the obtained page cache pages + * @cached_page: allocated but as yet unused page + * @lru_pvec: lru-buffering pagevec of caller + * + * Obtain @nr_pages locked page cache pages from the mapping @maping and + * starting at index @index. + * + * If a page is newly created, increment its refcount and add it to the + * caller's lru-buffering pagevec @lru_pvec. + * + * This is the same as mm/filemap.c::__grab_cache_page(), except that @nr_pages + * are obtained at once instead of just one page and that 0 is returned on + * success and -errno on error. + * + * Note, the page locks are obtained in ascending page index order. + */ +static inline int __ntfs_grab_cache_pages(struct address_space *mapping, + pgoff_t index, const unsigned nr_pages, struct page **pages, + struct page **cached_page, struct pagevec *lru_pvec) +{ + int err, nr; + + BUG_ON(!nr_pages); + err = nr = 0; + do { + pages[nr] = find_lock_page(mapping, index); + if (!pages[nr]) { + if (!*cached_page) { + *cached_page = page_cache_alloc(mapping); + if (unlikely(!*cached_page)) { + err = -ENOMEM; + goto err_out; + } + } + err = add_to_page_cache(*cached_page, mapping, index, + GFP_KERNEL); + if (unlikely(err)) { + if (err == -EEXIST) + continue; + goto err_out; + } + pages[nr] = *cached_page; + page_cache_get(*cached_page); + if (unlikely(!pagevec_add(lru_pvec, *cached_page))) + __pagevec_lru_add(lru_pvec); + *cached_page = NULL; + } + index++; + nr++; + } while (nr < nr_pages); +out: + return err; +err_out: + while (nr > 0) { + unlock_page(pages[--nr]); + page_cache_release(pages[nr]); + } + goto out; +} + +static inline int ntfs_submit_bh_for_read(struct buffer_head *bh) +{ + lock_buffer(bh); + get_bh(bh); + bh->b_end_io = end_buffer_read_sync; + return submit_bh(READ, bh); +} + +/** + * ntfs_prepare_pages_for_non_resident_write - prepare pages for receiving data + * @pages: array of destination pages + * @nr_pages: number of pages in @pages + * @pos: byte position in file at which the write begins + * @bytes: number of bytes to be written + * + * This is called for non-resident attributes from ntfs_file_buffered_write() + * with i_sem held on the inode (@pages[0]->mapping->host). There are + * @nr_pages pages in @pages which are locked but not kmap()ped. The source + * data has not yet been copied into the @pages. + * + * Need to fill any holes with actual clusters, allocate buffers if necessary, + * ensure all the buffers are mapped, and bring uptodate any buffers that are + * only partially being written to. + * + * If @nr_pages is greater than one, we are guaranteed that the cluster size is + * greater than PAGE_CACHE_SIZE, that all pages in @pages are entirely inside + * the same cluster and that they are the entirety of that cluster, and that + * the cluster is sparse, i.e. we need to allocate a cluster to fill the hole. + * + * i_size is not to be modified yet. + * + * Return 0 on success or -errno on error. + */ +static int ntfs_prepare_pages_for_non_resident_write(struct page **pages, + unsigned nr_pages, s64 pos, size_t bytes) +{ + VCN vcn, highest_vcn = 0, cpos, cend, bh_cpos, bh_cend; + LCN lcn; + s64 bh_pos, vcn_len, end, initialized_size; + sector_t lcn_block; + struct page *page; + struct inode *vi; + ntfs_inode *ni, *base_ni = NULL; + ntfs_volume *vol; + runlist_element *rl, *rl2; + struct buffer_head *bh, *head, *wait[2], **wait_bh = wait; + ntfs_attr_search_ctx *ctx = NULL; + MFT_RECORD *m = NULL; + ATTR_RECORD *a = NULL; + unsigned long flags; + u32 attr_rec_len = 0; + unsigned blocksize, u; + int err, mp_size; + BOOL rl_write_locked, was_hole, is_retry; + unsigned char blocksize_bits; + struct { + u8 runlist_merged:1; + u8 mft_attr_mapped:1; + u8 mp_rebuilt:1; + u8 attr_switched:1; + } status = { 0, 0, 0, 0 }; + + BUG_ON(!nr_pages); + BUG_ON(!pages); + BUG_ON(!*pages); + vi = pages[0]->mapping->host; + ni = NTFS_I(vi); + vol = ni->vol; + ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page " + "index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%x.", + vi->i_ino, ni->type, pages[0]->index, nr_pages, + (long long)pos, bytes); + blocksize_bits = vi->i_blkbits; + blocksize = 1 << blocksize_bits; + u = 0; + do { + struct page *page = pages[u]; + /* + * create_empty_buffers() will create uptodate/dirty buffers if + * the page is uptodate/dirty. + */ + if (!page_has_buffers(page)) { + create_empty_buffers(page, blocksize, 0); + if (unlikely(!page_has_buffers(page))) + return -ENOMEM; + } + } while (++u < nr_pages); + rl_write_locked = FALSE; + rl = NULL; + err = 0; + vcn = lcn = -1; + vcn_len = 0; + lcn_block = -1; + was_hole = FALSE; + cpos = pos >> vol->cluster_size_bits; + end = pos + bytes; + cend = (end + vol->cluster_size - 1) >> vol->cluster_size_bits; + /* + * Loop over each page and for each page over each buffer. Use goto to + * reduce indentation. + */ + u = 0; +do_next_page: + page = pages[u]; + bh_pos = (s64)page->index << PAGE_CACHE_SHIFT; + bh = head = page_buffers(page); + do { + VCN cdelta; + s64 bh_end; + unsigned bh_cofs; + + /* Clear buffer_new on all buffers to reinitialise state. */ + if (buffer_new(bh)) + clear_buffer_new(bh); + bh_end = bh_pos + blocksize; + bh_cpos = bh_pos >> vol->cluster_size_bits; + bh_cofs = bh_pos & vol->cluster_size_mask; + if (buffer_mapped(bh)) { + /* + * The buffer is already mapped. If it is uptodate, + * ignore it. + */ + if (buffer_uptodate(bh)) + continue; + /* + * The buffer is not uptodate. If the page is uptodate + * set the buffer uptodate and otherwise ignore it. + */ + if (PageUptodate(page)) { + set_buffer_uptodate(bh); + continue; + } + /* + * Neither the page nor the buffer are uptodate. If + * the buffer is only partially being written to, we + * need to read it in before the write, i.e. now. + */ + if ((bh_pos < pos && bh_end > pos) || + (bh_pos < end && bh_end > end)) { + /* + * If the buffer is fully or partially within + * the initialized size, do an actual read. + * Otherwise, simply zero the buffer. + */ + read_lock_irqsave(&ni->size_lock, flags); + initialized_size = ni->initialized_size; + read_unlock_irqrestore(&ni->size_lock, flags); + if (bh_pos < initialized_size) { + ntfs_submit_bh_for_read(bh); + *wait_bh++ = bh; + } else { + u8 *kaddr = kmap_atomic(page, KM_USER0); + memset(kaddr + bh_offset(bh), 0, + blocksize); + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(page); + set_buffer_uptodate(bh); + } + } + continue; + } + /* Unmapped buffer. Need to map it. */ + bh->b_bdev = vol->sb->s_bdev; + /* + * If the current buffer is in the same clusters as the map + * cache, there is no need to check the runlist again. The + * map cache is made up of @vcn, which is the first cached file + * cluster, @vcn_len which is the number of cached file + * clusters, @lcn is the device cluster corresponding to @vcn, + * and @lcn_block is the block number corresponding to @lcn. + */ + cdelta = bh_cpos - vcn; + if (likely(!cdelta || (cdelta > 0 && cdelta < vcn_len))) { +map_buffer_cached: + BUG_ON(lcn < 0); + bh->b_blocknr = lcn_block + + (cdelta << (vol->cluster_size_bits - + blocksize_bits)) + + (bh_cofs >> blocksize_bits); + set_buffer_mapped(bh); + /* + * If the page is uptodate so is the buffer. If the + * buffer is fully outside the write, we ignore it if + * it was already allocated and we mark it dirty so it + * gets written out if we allocated it. On the other + * hand, if we allocated the buffer but we are not + * marking it dirty we set buffer_new so we can do + * error recovery. + */ + if (PageUptodate(page)) { + if (!buffer_uptodate(bh)) + set_buffer_uptodate(bh); + if (unlikely(was_hole)) { + /* We allocated the buffer. */ + unmap_underlying_metadata(bh->b_bdev, + bh->b_blocknr); + if (bh_end <= pos || bh_pos >= end) + mark_buffer_dirty(bh); + else + set_buffer_new(bh); + } + continue; + } + /* Page is _not_ uptodate. */ + if (likely(!was_hole)) { + /* + * Buffer was already allocated. If it is not + * uptodate and is only partially being written + * to, we need to read it in before the write, + * i.e. now. + */ + if (!buffer_uptodate(bh) && ((bh_pos < pos && + bh_end > pos) || + (bh_end > end && + bh_end > end))) { + /* + * If the buffer is fully or partially + * within the initialized size, do an + * actual read. Otherwise, simply zero + * the buffer. + */ + read_lock_irqsave(&ni->size_lock, + flags); + initialized_size = ni->initialized_size; + read_unlock_irqrestore(&ni->size_lock, + flags); + if (bh_pos < initialized_size) { + ntfs_submit_bh_for_read(bh); + *wait_bh++ = bh; + } else { + u8 *kaddr = kmap_atomic(page, + KM_USER0); + memset(kaddr + bh_offset(bh), + 0, blocksize); + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(page); + set_buffer_uptodate(bh); + } + } + continue; + } + /* We allocated the buffer. */ + unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); + /* + * If the buffer is fully outside the write, zero it, + * set it uptodate, and mark it dirty so it gets + * written out. If it is partially being written to, + * zero region surrounding the write but leave it to + * commit write to do anything else. Finally, if the + * buffer is fully being overwritten, do nothing. + */ + if (bh_end <= pos || bh_pos >= end) { + if (!buffer_uptodate(bh)) { + u8 *kaddr = kmap_atomic(page, KM_USER0); + memset(kaddr + bh_offset(bh), 0, + blocksize); + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(page); + set_buffer_uptodate(bh); + } + mark_buffer_dirty(bh); + continue; + } + set_buffer_new(bh); + if (!buffer_uptodate(bh) && + (bh_pos < pos || bh_end > end)) { + u8 *kaddr; + unsigned pofs; + + kaddr = kmap_atomic(page, KM_USER0); + if (bh_pos < pos) { + pofs = bh_pos & ~PAGE_CACHE_MASK; + memset(kaddr + pofs, 0, pos - bh_pos); + } + if (bh_end > end) { + pofs = end & ~PAGE_CACHE_MASK; + memset(kaddr + pofs, 0, bh_end - end); + } + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(page); + } + continue; + } + /* + * Slow path: this is the first buffer in the cluster. If it + * is outside allocated size and is not uptodate, zero it and + * set it uptodate. + */ + read_lock_irqsave(&ni->size_lock, flags); + initialized_size = ni->allocated_size; + read_unlock_irqrestore(&ni->size_lock, flags); + if (bh_pos > initialized_size) { + if (PageUptodate(page)) { + if (!buffer_uptodate(bh)) + set_buffer_uptodate(bh); + } else if (!buffer_uptodate(bh)) { + u8 *kaddr = kmap_atomic(page, KM_USER0); + memset(kaddr + bh_offset(bh), 0, blocksize); + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(page); + set_buffer_uptodate(bh); + } + continue; + } + is_retry = FALSE; + if (!rl) { + down_read(&ni->runlist.lock); +retry_remap: + rl = ni->runlist.rl; + } + if (likely(rl != NULL)) { + /* Seek to element containing target cluster. */ + while (rl->length && rl[1].vcn <= bh_cpos) + rl++; + lcn = ntfs_rl_vcn_to_lcn(rl, bh_cpos); + if (likely(lcn >= 0)) { + /* + * Successful remap, setup the map cache and + * use that to deal with the buffer. + */ + was_hole = FALSE; + vcn = bh_cpos; + vcn_len = rl[1].vcn - vcn; + lcn_block = lcn << (vol->cluster_size_bits - + blocksize_bits); + /* + * If the number of remaining clusters in the + * @pages is smaller or equal to the number of + * cached clusters, unlock the runlist as the + * map cache will be used from now on. + */ + if (likely(vcn + vcn_len >= cend)) { + if (rl_write_locked) { + up_write(&ni->runlist.lock); + rl_write_locked = FALSE; + } else + up_read(&ni->runlist.lock); + rl = NULL; + } + goto map_buffer_cached; + } + } else + lcn = LCN_RL_NOT_MAPPED; + /* + * If it is not a hole and not out of bounds, the runlist is + * probably unmapped so try to map it now. + */ + if (unlikely(lcn != LCN_HOLE && lcn != LCN_ENOENT)) { + if (likely(!is_retry && lcn == LCN_RL_NOT_MAPPED)) { + /* Attempt to map runlist. */ + if (!rl_write_locked) { + /* + * We need the runlist locked for + * writing, so if it is locked for + * reading relock it now and retry in + * case it changed whilst we dropped + * the lock. + */ + up_read(&ni->runlist.lock); + down_write(&ni->runlist.lock); + rl_write_locked = TRUE; + goto retry_remap; + } + err = ntfs_map_runlist_nolock(ni, bh_cpos, + NULL); + if (likely(!err)) { + is_retry = TRUE; + goto retry_remap; + } + /* + * If @vcn is out of bounds, pretend @lcn is + * LCN_ENOENT. As long as the buffer is out + * of bounds this will work fine. + */ + if (err == -ENOENT) { + lcn = LCN_ENOENT; + err = 0; + goto rl_not_mapped_enoent; + } + } else + err = -EIO; + /* Failed to map the buffer, even after retrying. */ + bh->b_blocknr = -1; + ntfs_error(vol->sb, "Failed to write to inode 0x%lx, " + "attribute type 0x%x, vcn 0x%llx, " + "vcn offset 0x%x, because its " + "location on disk could not be " + "determined%s (error code %i).", + ni->mft_no, ni->type, + (unsigned long long)bh_cpos, + (unsigned)bh_pos & + vol->cluster_size_mask, + is_retry ? " even after retrying" : "", + err); + break; + } +rl_not_mapped_enoent: + /* + * The buffer is in a hole or out of bounds. We need to fill + * the hole, unless the buffer is in a cluster which is not + * touched by the write, in which case we just leave the buffer + * unmapped. This can only happen when the cluster size is + * less than the page cache size. + */ + if (unlikely(vol->cluster_size < PAGE_CACHE_SIZE)) { + bh_cend = (bh_end + vol->cluster_size - 1) >> + vol->cluster_size_bits; + if ((bh_cend <= cpos || bh_cpos >= cend)) { + bh->b_blocknr = -1; + /* + * If the buffer is uptodate we skip it. If it + * is not but the page is uptodate, we can set + * the buffer uptodate. If the page is not + * uptodate, we can clear the buffer and set it + * uptodate. Whether this is worthwhile is + * debatable and this could be removed. + */ + if (PageUptodate(page)) { + if (!buffer_uptodate(bh)) + set_buffer_uptodate(bh); + } else if (!buffer_uptodate(bh)) { + u8 *kaddr = kmap_atomic(page, KM_USER0); + memset(kaddr + bh_offset(bh), 0, + blocksize); + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(page); + set_buffer_uptodate(bh); + } + continue; + } + } + /* + * Out of bounds buffer is invalid if it was not really out of + * bounds. |