This project is mirrored from Pull mirroring updated .
  1. 10 Dec, 2012 1 commit
  2. 28 Nov, 2012 3 commits
    • Theodore Ts'o's avatar
      ext4: rationalize ext4_extents.h inclusion · 4a092d73
      Theodore Ts'o authored
      Previously, ext4_extents.h was being included at the end of ext4.h,
      which was bad for a number of reasons: (a) it was not being included
      in the expected place, and (b) it caused the header to be included
      multiple times.  There were #ifdef's to prevent this from causing any
      problems, but it still was unnecessary.
      By moving the function declarations that were in ext4_extents.h to
      ext4.h, which is standard practice for where the function declarations
      for the rest of ext4.h can be found, we can remove ext4_extents.h from
      being included in ext4.h at all, and then we can only include
      ext4_extents.h where it is needed in ext4's source files.
      It should be possible to move a few more things into ext4.h, and
      further reduce the number of source files that need to #include
      ext4_extents.h, but that's a cleanup for another day.
      Reported-by: default avatarSachin Kamat <>
      Reported-by: default avatarWei Yongjun <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Lukas Czerner's avatar
      ext4: simple cleanup in fiemap codepath · 06348679
      Lukas Czerner authored
      This commit is simple cleanup of fiemap codepath which has not been
      included in previous commit to make the changes clearer. In this commit
      we rename cbex variable to newex in ext4_fill_fiemap_extents() because
      callback is no longer present
      Signed-off-by: default avatarLukas Czerner <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Lukas Czerner's avatar
      ext4: prevent race while walking extent tree for fiemap · 91dd8c11
      Lukas Czerner authored
      Currently ext4_ext_walk_space() only takes i_data_sem for read when
      searching for the extent at given block with ext4_ext_find_extent().
      Then it drops the lock and the extent tree can be changed at will.
      However later on we're searching for the 'next' extent, but the extent
      tree might already have changed, so the information might not be
      In fact we can hit BUG_ON(end <= start) if the extent got inserted into
      the tree after the one we found and before the block we were searching
      for. This has been reproduced by running xfstests 225 in loop on s390x
      architecture, but theoretically we could hit this on any other
      architecture as well, but probably not as often.
      Moreover the extent currently in delayed allocation might be allocated
      after we search the extent tree and before we search extent status tree
      delayed buffers resulting in those delayed buffers being completely
      missed, even though completely written and allocated.
      We fix all those problems in several steps:
       1. remove unnecessary callback indirection
       2. rename functions
              ext4_ext_walk_space -> ext4_fill_fiemap_extents
              ext4_ext_fiemap_cb -> ext4_find_delayed_extent
       3. move fiemap_fill_next_extent() into ext4_fill_fiemap_extents()
       4. hold the i_data_sem for:
       5. call fiemap_fill_next_extent after releasing the i_data_sem
       6. move path reinitialization into the critical section.
      Signed-off-by: default avatarLukas Czerner <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  3. 09 Nov, 2012 3 commits
  4. 08 Nov, 2012 3 commits
  5. 10 Oct, 2012 1 commit
  6. 05 Oct, 2012 2 commits
    • Dmitry Monakhov's avatar
      ext4: serialize fallocate with ext4_convert_unwritten_extents · 60d4616f
      Dmitry Monakhov authored
      Fallocate should wait for pended ext4_convert_unwritten_extents()
      otherwise following race may happen:
      ftruncate( ,12288);
      fallocate( ,0, 4096)
      io_sibmit( ,0, 4096); /* Write to fallocated area, split extent if needed */
      fallocate( ,0, 8192); /* Grow extent and broke assumption about extent */
      Later kwork completion will do:
       ->ext4_convert_unwritten_extents (0, 4096)
         ->ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_IO_CONVERT_EXT);
          ->ext4_ext_map_blocks() /* Will find new extent:  ex = [0,2] !!!!!! */
              /* convert [0,2] extent to initialized, but only[0,1] was written */
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Dmitry Monakhov's avatar
      ext4: fix ext4_flush_completed_IO wait semantics · c278531d
      Dmitry Monakhov authored
      BUG #1) All places where we call ext4_flush_completed_IO are broken
          because buffered io and DIO/AIO goes through three stages
          1) submitted io,
          2) completed io (in i_completed_io_list) conversion pended
          3) finished  io (conversion done)
          And by calling ext4_flush_completed_IO we will flush only
          requests which were in (2) stage, which is wrong because:
           1) punch_hole and truncate _must_ wait for all outstanding unwritten io
            regardless to it's state.
           2) fsync and nolock_dio_read should also wait because there is
              a time window between end_page_writeback() and ext4_add_complete_io()
              As result integrity fsync is broken in case of buffered write
              to fallocated region:
              fsync                                      blkdev_completion
                <-- filemap_write_and_wait_range return
         	 sees empty i_completed_io_list but pended
         	 conversion still exist
      BUG #2) Race window becomes wider due to the 'ext4: completed_io
      locking cleanup V4' patch series
      This patch make following changes:
      1) ext4_flush_completed_io() now first try to flush completed io and when
         wait for any outstanding unwritten io via ext4_unwritten_wait()
      2) Rename function to more appropriate name.
      3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
         prevent endless wait
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
      Reviewed-by: default avatarJan Kara <>
  7. 01 Oct, 2012 2 commits
    • Dmitry Monakhov's avatar
      ext4: fix ext_remove_space for punch_hole case · 6f2080e6
      Dmitry Monakhov authored
      Inode is allowed to have empty leaf only if it this is blockless inode.
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Dmitry Monakhov's avatar
      ext4: punch_hole should wait for DIO writers · 02d262df
      Dmitry Monakhov authored
      punch_hole is the place where we have to wait for all existing writers
      (writeback, aio, dio), but currently we simply flush pended end_io request
      which is not sufficient. Other issue is that punch_hole performed w/o i_mutex
      held which obviously result in dangerous data corruption due to
      This patch performs following changes:
      - Guard punch_hole with i_mutex
      - Recheck inode flags under i_mutex
      - Block all new dio readers in order to prevent information leak caused by
        read-after-free pattern.
      - punch_hole now wait for all writers in flight
        NOTE: XXX write-after-free race is still possible because new dirty pages
        may appear due to mmap(), and currently there is no easy way to stop
        writeback while punch_hole is in progress.
      [ Fixed error return from ext4_ext_punch_hole() to make sure that we
        release i_mutex before returning EPERM or ETXTBUSY -- Ted ]
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  8. 29 Sep, 2012 3 commits
    • Dmitry Monakhov's avatar
      ext4: completed_io locking cleanup · 28a535f9
      Dmitry Monakhov authored
      Current unwritten extent conversion state-machine is very fuzzy.
      - For unknown reason it performs conversion under i_mutex. What for?
        My diagnosis:
        We already protect extent tree with i_data_sem, truncate and punch_hole
        should wait for DIO, so the only data we have to protect is end_io->flags
        modification, but only flush_completed_IO and end_io_work modified this
        flags and we can serialize them via i_completed_io_lock.
        Currently all these games with mutex_trylock result in the following deadlock
         truncate:                          kworker:
          ext4_setattr                       ext4_end_io_work
          inode_dio_wait(inode)  ->BLOCK
                                   DEADLOCK<- mutex_trylock()
        unlink $MNT/file
        fallocate -l $((1024*1024*1024)) $MNT/file
        aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
        sleep 2
        truncate -s 0 $MNT/file
      Or use 286's xfstests
      This patch makes state machine simple and clean:
      (1) xxx_end_io schedule final extent conversion simply by calling
          ext4_add_complete_io(), which append it to ei->i_completed_io_list
          NOTE1: because of (2A) work should be queued only if
          ->i_completed_io_list was empty, otherwise the work is scheduled already.
      (2) ext4_flush_completed_IO is responsible for handling all pending
          end_io from ei->i_completed_io_list
          Flushing sequence consists of following stages:
          A) LOCKED: Atomically drain completed_io_list to local_list
          B) Perform extents conversion
          C) LOCKED: move converted io's to to_free list for final deletion
             	     This logic depends on context which we was called from.
          D) Final end_io context destruction
          NOTE1: i_mutex is no longer required because end_io->flags modification
          is protected by ei->ext4_complete_io_lock
      Full list of changes:
      - Move all completion end_io related routines to page-io.c in order to improve
        logic locality
      - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
      - remove EXT4_IO_END_FSYNC
      - Improve SMP scalability by removing useless i_mutex which does not
        protect io->flags anymore.
      - Reduce lock contention on i_completed_io_lock by optimizing list walk.
      - Rename ext4_end_io_nolock to end4_end_io and make it static
      - Check flush completion status to ext4_ext_punch_hole(). Because it is
        not good idea to punch blocks from corrupted inode.
      Changes since V3 (in request to Jan's comments):
        Fall back to active flush_completed_IO() approach in order to prevent
        performance issues with nolocked DIO reads.
      Changes since V2:
        Fix use-after-free caused by race truncate vs end_io_work
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Dmitry Monakhov's avatar
      ext4: fix unwritten counter leakage · 82e54229
      Dmitry Monakhov authored
      ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
      once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back
      on error path.
       - add missed error checks to prevent counter leakage
       - ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal
         that conversion finished.
       - add BUG_ON to ext4_free_end_io() to prevent similar leakage in future.
      Visible effect of this bug is that unaligned aio_stress may deadlock
      Reviewed-by: default avatarJan Kara <>
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Dmitry Monakhov's avatar
      ext4: ext4_inode_info diet · f45ee3a1
      Dmitry Monakhov authored
      Generic inode has unused i_private pointer which may be used as cur_aio_dio
      TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
            to have concurent AIO_DIO requests.
      Reviewed-by: default avatarZheng Liu <>
      Reviewed-by: default avatarJan Kara <>
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  9. 27 Sep, 2012 2 commits
  10. 19 Sep, 2012 1 commit
    • Andrey Sidorov's avatar
      ext4: speed up truncate/unlink by not using bforget() unless needed · 18888cf0
      Andrey Sidorov authored
      Do not iterate over data blocks scanning for bh's to forget as they're
      never exist. This improves time taken by unlink / truncate syscall.
      Tested by continuously truncating file that is being written by dd.
      Another test is rm -rf of linux tree while tar unpacks it. With
      ordered data mode condition unlikely(!tbh) was always met in
      ext4_free_blocks. With journal data mode tbh was found only few times,
      so optimisation is also possible.
      Unlinking fallocated 60G file after doing sync && echo 3 >
      /proc/sys/vm/drop_caches && time rm --help
      X86 before (linux 3.6-rc4):
      # time rm -f test1
      real    0m2.710s
      user    0m0.000s
      sys     0m1.530s
      X86 after:
      # time rm -f test1
      real    0m0.644s
      user    0m0.003s
      sys     0m0.060s
      MIPS before (linux 2.6.37):
      # time rm -f test1
      real    0m 4.93s
      user    0m 0.00s
      sys     0m 4.61s
      MIPS after:
      # time rm -f test1
      real    0m 0.16s
      user    0m 0.00s
      sys     0m 0.06s
      Signed-off-by: default avatar"Theodore Ts'o" <>
      Signed-off-by: default avatarAndrey Sidorov <>
  11. 19 Aug, 2012 2 commits
  12. 17 Aug, 2012 3 commits
    • Zheng Liu's avatar
      ext4: make the zero-out chunk size tunable · 67a5da56
      Zheng Liu authored
      Currently in ext4 the length of zero-out chunk is set to 7 file system
      blocks.  But if an inode has uninitailized extents from using
      fallocate to preallocate space, and the workload issues many random
      writes, this can cause a fragmented extent tree that will
      unnecessarily grow the extent tree.
      So create a new sysfs tunable, extent_max_zeroout_kb, which controls
      the maximum size where blocks will be zeroed out instead of creating a
      new uninitialized extent.  The default of this has been sent to 32kb.
      CC: Zach Brown <>
      CC: Andreas Dilger <>
      Signed-off-by: default avatarZheng Liu <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Theodore Ts'o's avatar
      ext4: collapse a single extent tree block into the inode if possible · ecb94f5f
      Theodore Ts'o authored
      If an inode has more than 4 extents, but then later some of the
      extents are merged together, we can optimize the file system by moving
      the extents up into the inode, and discarding the extent tree block.
      This is important, because if there are a large number of inodes with
      an external extent tree blocks where the contents could fit in the
      inode, this can significantly increase the fsck time of the file
      Google-Bug-Id: 6801242
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Theodore Ts'o's avatar
      ext4: fix kernel BUG on large-scale rm -rf commands · 89a4e48f
      Theodore Ts'o authored
      Commit 968dee77: "ext4: fix hole punch failure when depth is greater
      than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel
      crashes when users ran run "rm -rf" on large directory hierarchy on
      ext4 filesystems on RAID devices:
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
          Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710)
          Call Trace:
           [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110
           [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0
           [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0
           [<ffffffff81207e05>] ext4_truncate+0xf5/0x100
           [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490
           [<ffffffff811a1312>] evict+0xa2/0x1a0
           [<ffffffff811a1513>] iput+0x103/0x1f0
           [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0
           [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40
           [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50
           [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b
          Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00
          RIP  [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0
      This could be reproduced as follows:
      The problem in commit 968dee77
       was that caused the variable 'i' to
      be left uninitialized if the truncate required more space than was
      available in the journal.  This resulted in the function
      ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused
      ext4_ext_remove_space() to restart the truncate operation after
      starting a new jbd2 handle.
      Reported-by: default avatarMaciej Żenczykowski <>
      Reported-by: default avatarMarti Raudsepp <>
      Tested-by: default avatarFengguang Wu <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  13. 23 Jul, 2012 1 commit
    • Ashish Sangwan's avatar
      ext4: fix hole punch failure when depth is greater than 0 · 968dee77
      Ashish Sangwan authored
      Whether to continue removing extents or not is decided by the return
      value of function ext4_ext_more_to_rm() which checks 2 conditions:
      a) if there are no more indexes to process.
      b) if the number of entries are decreased in the header of "depth -1".
      In case of hole punch, if the last block to be removed is not part of
      the last extent index than this index will not be deleted, hence the
      number of valid entries in the extent header of "depth - 1" will
      remain as it is and ext4_ext_more_to_rm will return 0 although the
      required blocks are not yet removed.
      This patch fixes the above mentioned problem as instead of removing
      the extents from the end of file, it starts removing the blocks from
      the particular extent from which removing blocks is actually required
      and continue backward until done.
      Signed-off-by: default avatarAshish Sangwan <>
      Signed-off-by: default avatarNamjae Jeon <>
      Reviewed-by: default avatarLukas Czerner <>
  14. 09 Jul, 2012 1 commit
  15. 30 Jun, 2012 1 commit
  16. 01 Jun, 2012 1 commit
    • Hugh Dickins's avatar
      ext4: hole-punch use truncate_pagecache_range · 5e44f8c3
      Hugh Dickins authored
      When truncating a file, we unmap pages from userspace first, as that's
      usually more efficient than relying, page by page, on the fallback in
      truncate_inode_page() - particularly if the file is mapped many times.
      Do the same when punching a hole: 3.4 added truncate_pagecache_range()
      to do the unmap and trunc, so use it in ext4_ext_punch_hole(), instead
      of calling truncate_inode_pages_range() directly.
      Signed-off-by: default avatarHugh Dickins <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  17. 28 May, 2012 1 commit
  18. 29 Apr, 2012 2 commits
  19. 16 Apr, 2012 1 commit
  20. 13 Apr, 2012 1 commit
  21. 22 Mar, 2012 1 commit
    • Lukas Czerner's avatar
      ext4: remove restrictive checks for EOFBLOCKS_FL · afcff5d8
      Lukas Czerner authored
      We are going to remove the EOFBLOCKS_FL flag in the future, so this is
      the first part of the removal. We can not remove it entirely just now,
      since the e2fsck is still checking for it and it might cause headache to
      some people. Instead, remove the restrictive checks now and the rest
      later, when the new e2fsck code is out and common enough.
      This is also needed because punch hole already breaks the EOFBLOCKS_FL
      semantics, so it might cause the some troubles. So simply remove it.
      Signed-off-by: default avatarLukas Czerner <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  22. 20 Mar, 2012 4 commits
    • Theodore Ts'o's avatar
    • Lukas Czerner's avatar
      ext4: give more helpful error message in ext4_ext_rm_leaf() · dc1841d6
      Lukas Czerner authored
      The error message produced by the ext4_ext_rm_leaf() when we are
      removing blocks which accidentally ends up inside the existing extent,
      is not very helpful, because we would like to also know which extent did
      we collide with.
      This commit changes the error message to get us also the information
      about the extent we are colliding with.
      Signed-off-by: default avatarLukas Czerner <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Lukas Czerner's avatar
      ext4: remove unused code from ext4_ext_map_blocks() · 7877191c
      Lukas Czerner authored
      Since the commit 'Rewrite punch hole to use ext4_ext_remove_space()'
      reworked the punch hole implementation to use ext4_ext_remove_space()
      instead of ext4_ext_map_blocks(), we can remove the code which is no
      longer needed from the ext4_ext_map_blocks().
      Signed-off-by: default avatarLukas Czerner <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Lukas Czerner's avatar
      ext4: rewrite punch hole to use ext4_ext_remove_space() · 5f95d21f
      Lukas Czerner authored
      This commit rewrites ext4 punch hole implementation to use
      ext4_ext_remove_space() instead of its home gown way of doing this via
      ext4_ext_map_blocks(). There are several reasons for changing this.
      Firstly it is quite non obvious that punching hole needs to
      ext4_ext_map_blocks() to punch a hole, especially given that this
      function should map blocks, not unmap it. It also required a lot of new
      code in ext4_ext_map_blocks().
      Secondly the design of it is not very effective. The reason is that we
      are trying to punch out blocks in ext4_ext_punch_hole() in opposite
      direction than in ext4_ext_rm_leaf() which causes the ext4_ext_rm_leaf()
      to iterate through the whole tree from the end to the start to find the
      requested extent for every extent we are going to punch out.
      And finally the current implementation does not use the existing code,
      but bring a lot of new code, which is IMO unnecessary since there
      already is some infrastructure we can use. Specifically
      This commit changes ext4_ext_remove_space() to accept 'end' parameter so
      we can not only truncate to the end of file, but also remove the space
      in the middle of the file (punch a hole). Moreover, because the last
      block to punch out, might be in the middle of the extent, we have to
      split the extent at 'end + 1' so ext4_ext_rm_leaf() can easily either
      remove the whole fist part of split extent, or change its size.
      ext4_ext_remove_space() is then used to actually remove the space
      (extents) from within the hole, instead of ext4_ext_map_blocks().
      Note that this also fix the issue with punch hole, where we would forget
      to remove empty index blocks from the extent tree, resulting in double
      free block error and file system corruption. This is simply because we
      now use different code path, where this problem does not exist.
      This has been tested with fsx running for several days and xfstests,
      plus xfstest #251 with '-o discard' run on the loop image (which
      converts discard requestes into punch hole to the backing file). All of
      it on 1K and 4K file system block size.
      Signed-off-by: default avatarLukas Czerner <>
      Signed-off-by: default avatar"Theodore Ts'o" <>