This project is mirrored from Pull mirroring updated .
  1. 10 Dec, 2012 2 commits
  2. 28 Nov, 2012 1 commit
    • Theodore Ts'o's avatar
      ext4: rationalize ext4_extents.h inclusion · 4a092d73
      Theodore Ts'o authored
      Previously, ext4_extents.h was being included at the end of ext4.h,
      which was bad for a number of reasons: (a) it was not being included
      in the expected place, and (b) it caused the header to be included
      multiple times.  There were #ifdef's to prevent this from causing any
      problems, but it still was unnecessary.
      By moving the function declarations that were in ext4_extents.h to
      ext4.h, which is standard practice for where the function declarations
      for the rest of ext4.h can be found, we can remove ext4_extents.h from
      being included in ext4.h at all, and then we can only include
      ext4_extents.h where it is needed in ext4's source files.
      It should be possible to move a few more things into ext4.h, and
      further reduce the number of source files that need to #include
      ext4_extents.h, but that's a cleanup for another day.
      Reported-by: default avatarSachin Kamat <>
      Reported-by: default avatarWei Yongjun <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  3. 09 Nov, 2012 1 commit
  4. 08 Nov, 2012 1 commit
  5. 22 Oct, 2012 1 commit
  6. 10 Oct, 2012 1 commit
    • Theodore Ts'o's avatar
      ext4: fix metadata checksum calculation for the superblock · 06db49e6
      Theodore Ts'o authored
      The function ext4_handle_dirty_super() was calculating the superblock
      on the wrong block data.  As a result, when the superblock is modified
      while it is mounted (most commonly, when inodes are added or removed
      from the orphan list), the superblock checksum would be wrong.  We
      didn't notice because the superblock *was* being correctly calculated
      in ext4_commit_super(), and this would get called when the file system
      was unmounted.  So the problem only became obvious if the system
      crashed while the file system was mounted.
      Fix this by removing the poorly designed function signature for
      ext4_superblock_csum_set(); if it only took a single argument, the
      pointer to a struct superblock, the ambiguity which caused this
      mistake would have been impossible.
      Reported-by: default avatarGeorge Spelvin <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  7. 05 Oct, 2012 1 commit
    • Dmitry Monakhov's avatar
      ext4: fix ext4_flush_completed_IO wait semantics · c278531d
      Dmitry Monakhov authored
      BUG #1) All places where we call ext4_flush_completed_IO are broken
          because buffered io and DIO/AIO goes through three stages
          1) submitted io,
          2) completed io (in i_completed_io_list) conversion pended
          3) finished  io (conversion done)
          And by calling ext4_flush_completed_IO we will flush only
          requests which were in (2) stage, which is wrong because:
           1) punch_hole and truncate _must_ wait for all outstanding unwritten io
            regardless to it's state.
           2) fsync and nolock_dio_read should also wait because there is
              a time window between end_page_writeback() and ext4_add_complete_io()
              As result integrity fsync is broken in case of buffered write
              to fallocated region:
              fsync                                      blkdev_completion
                <-- filemap_write_and_wait_range return
         	 sees empty i_completed_io_list but pended
         	 conversion still exist
      BUG #2) Race window becomes wider due to the 'ext4: completed_io
      locking cleanup V4' patch series
      This patch make following changes:
      1) ext4_flush_completed_io() now first try to flush completed io and when
         wait for any outstanding unwritten io via ext4_unwritten_wait()
      2) Rename function to more appropriate name.
      3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
         prevent endless wait
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
      Reviewed-by: default avatarJan Kara <>
  8. 29 Sep, 2012 4 commits
    • Dmitry Monakhov's avatar
      ext4: serialize dio nonlocked reads with defrag workers · 17335dcc
      Dmitry Monakhov authored
      Inode's block defrag and ext4_change_inode_journal_flag() may
      affect nonlocked DIO reads result, so proper synchronization
      - Add missed inode_dio_wait() calls where appropriate
      - Check inode state under extra i_dio_count reference.
      Reviewed-by: default avatarJan Kara <>
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Dmitry Monakhov's avatar
      ext4: completed_io locking cleanup · 28a535f9
      Dmitry Monakhov authored
      Current unwritten extent conversion state-machine is very fuzzy.
      - For unknown reason it performs conversion under i_mutex. What for?
        My diagnosis:
        We already protect extent tree with i_data_sem, truncate and punch_hole
        should wait for DIO, so the only data we have to protect is end_io->flags
        modification, but only flush_completed_IO and end_io_work modified this
        flags and we can serialize them via i_completed_io_lock.
        Currently all these games with mutex_trylock result in the following deadlock
         truncate:                          kworker:
          ext4_setattr                       ext4_end_io_work
          inode_dio_wait(inode)  ->BLOCK
                                   DEADLOCK<- mutex_trylock()
        unlink $MNT/file
        fallocate -l $((1024*1024*1024)) $MNT/file
        aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
        sleep 2
        truncate -s 0 $MNT/file
      Or use 286's xfstests
      This patch makes state machine simple and clean:
      (1) xxx_end_io schedule final extent conversion simply by calling
          ext4_add_complete_io(), which append it to ei->i_completed_io_list
          NOTE1: because of (2A) work should be queued only if
          ->i_completed_io_list was empty, otherwise the work is scheduled already.
      (2) ext4_flush_completed_IO is responsible for handling all pending
          end_io from ei->i_completed_io_list
          Flushing sequence consists of following stages:
          A) LOCKED: Atomically drain completed_io_list to local_list
          B) Perform extents conversion
          C) LOCKED: move converted io's to to_free list for final deletion
             	     This logic depends on context which we was called from.
          D) Final end_io context destruction
          NOTE1: i_mutex is no longer required because end_io->flags modification
          is protected by ei->ext4_complete_io_lock
      Full list of changes:
      - Move all completion end_io related routines to page-io.c in order to improve
        logic locality
      - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
      - remove EXT4_IO_END_FSYNC
      - Improve SMP scalability by removing useless i_mutex which does not
        protect io->flags anymore.
      - Reduce lock contention on i_completed_io_lock by optimizing list walk.
      - Rename ext4_end_io_nolock to end4_end_io and make it static
      - Check flush completion status to ext4_ext_punch_hole(). Because it is
        not good idea to punch blocks from corrupted inode.
      Changes since V3 (in request to Jan's comments):
        Fall back to active flush_completed_IO() approach in order to prevent
        performance issues with nolocked DIO reads.
      Changes since V2:
        Fix use-after-free caused by race truncate vs end_io_work
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Dmitry Monakhov's avatar
      ext4: give i_aiodio_unwritten a more appropriate name · e27f41e1
      Dmitry Monakhov authored
      AIO/DIO prefix is wrong because it account unwritten extents which
      also may be scheduled from buffered write endio
      Reviewed-by: default avatarJan Kara <>
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Dmitry Monakhov's avatar
      ext4: ext4_inode_info diet · f45ee3a1
      Dmitry Monakhov authored
      Generic inode has unused i_private pointer which may be used as cur_aio_dio
      TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
            to have concurent AIO_DIO requests.
      Reviewed-by: default avatarZheng Liu <>
      Reviewed-by: default avatarJan Kara <>
      Signed-off-by: default avatarDmitry Monakhov <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  9. 05 Sep, 2012 2 commits
    • Theodore Ts'o's avatar
      ext4: grow the s_group_info array as needed · 28623c2f
      Theodore Ts'o authored
      Previously we allocated the s_group_info array with enough space for
      any future possible growth of the file system via online resize.  This
      is unfortunate because it wastes memory, and it doesn't work for the
      meta_bg scheme, since there is no limit based on the number of
      reserved gdt blocks.  So add the code to grow the s_group_info array
      as needed.
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Theodore Ts'o's avatar
      ext4: grow the s_flex_groups array as needed when resizing · 117fff10
      Theodore Ts'o authored
      Previously, we allocated the s_flex_groups array to the maximum size
      that the file system could be resized.  There was two problems with
      this approach.  First, it wasted memory in the common case where the
      file system was not resized.  Secondly, once we start allowing online
      resizing using the meta_bg scheme, there is no maximum size that the
      file system can be resized.  So instead, we need to grow the
      s_flex_groups at inline resize time.
      Signed-off-by: default avatar"Theodore Ts'o" <>
  10. 17 Aug, 2012 2 commits
    • Zheng Liu's avatar
      ext4: make the zero-out chunk size tunable · 67a5da56
      Zheng Liu authored
      Currently in ext4 the length of zero-out chunk is set to 7 file system
      blocks.  But if an inode has uninitailized extents from using
      fallocate to preallocate space, and the workload issues many random
      writes, this can cause a fragmented extent tree that will
      unnecessarily grow the extent tree.
      So create a new sysfs tunable, extent_max_zeroout_kb, which controls
      the maximum size where blocks will be zeroed out instead of creating a
      new uninitialized extent.  The default of this has been sent to 32kb.
      CC: Zach Brown <>
      CC: Andreas Dilger <>
      Signed-off-by: default avatarZheng Liu <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Theodore Ts'o's avatar
      ext4: add max_dir_size_kb mount option · df981d03
      Theodore Ts'o authored
      Very large directories can cause significant performance problems, or
      perhaps even invoke the OOM killer, if the process is running in a
      highly constrained memory environment (whether it is VM's with a small
      amount of memory or in a small memory cgroup).
      So it is useful, in cloud server/data center environments, to be able
      to set a filesystem-wide cap on the maximum size of a directory, to
      ensure that directories never get larger than a sane size.  We do this
      via a new mount option, max_dir_size_kb.  If there is an attempt to
      grow the directory larger than max_dir_size_kb, the system call will
      return ENOSPC instead.
      Google-Bug-Id: 6863013
      Signed-off-by: default avatar"Theodore Ts'o" <>
  11. 23 Jul, 2012 3 commits
    • Jan Kara's avatar
      ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super() · 044ce47f
      Jan Kara authored
      The last user of ext4_mark_super_dirty() in ext4_file_open() is so
      rare it can well be modifying the superblock properly by journalling
      the change.  Change it and get rid of ext4_mark_super_dirty() as it's
      not needed anymore.
      Artem: small amendments.
      Artem: tested using xfstests for both journalled and non-journalled ext4.
      Signed-off-by: default avatarJan Kara <>
      Signed-off-by: default avatarArtem Bityutskiy <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
      Tested-by: default avatarArtem Bityutskiy <>
    • Theodore Ts'o's avatar
      ext4: remove dynamic array size in ext4_chksum() · 3108b54b
      Theodore Ts'o authored
      The ext4_checksum() inline function was using a dynamic array size,
      which is not legal C.  (It is a gcc extension).
      Remove it.
      Cc: "Darrick J. Wong" <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Aditya Kali's avatar
      ext4: make quota as first class supported feature · 7c319d32
      Aditya Kali authored
      This patch adds support for quotas as a first class feature in ext4;
      which is to say, the quota files are stored in hidden inodes as file
      system metadata, instead of as separate files visible in the file system
      directory hierarchy.
      It is based on the proposal at:                                                                                                        
      This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
      which, when turned on, enables quota accounting at mount time
      iteself. Also, the quota inodes are stored in two additional superblock
      fields.  Some changes introduced by this patch that should be pointed
      out are:
      1) Two new ext4-superblock fields - s_usr_quota_inum and
         s_grp_quota_inum for storing the quota inodes in use.
      2) Default quota inodes are: inode#3 for tracking userquota and inode#4
         for tracking group quota. The superblock fields can be set to use
         other inodes as well.
      3) If the QUOTA feature and corresponding quota inodes are set in
         superblock, the quota usage tracking is turned on at mount time. On
         'quotaon' ioctl, the quota limits enforcement is turned
         on. 'quotaoff' ioctl turns off only the limits enforcement in this
      4) When QUOTA feature is in use, the quota mount options 'quota',
         'usrquota', 'grpquota' are ignored by the kernel.
      5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
         quota inodes. The default reserved inodes will not be visible to user
         as regular files.
      6) The quota-tools will need to be modified to support hidden quota
         files on ext4. E2fsprogs will also include support for creating and
         fixing quota files.
      7) Support is only for the new V2 quota file format.
      Tested-by: default avatarJan Kara <>
      Reviewed-by: default avatarJan Kara <>
      Reviewed-by: default avatarJohann Lombardi <>
      Signed-off-by: default avatarAditya Kali <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
  12. 09 Jul, 2012 2 commits
    • Zheng Liu's avatar
      ext4: add a new nolock flag in ext4_map_blocks · 729f52c6
      Zheng Liu authored
      EXT4_GET_BLOCKS_NO_LOCK flag is added to indicate that we don't need
      to acquire i_data_sem lock in ext4_map_blocks.  Meanwhile, it changes
      ext4_get_block() to not start a new journal because when we do a
      overwrite dio, there is no any metadata that needs to be modified.
      We define a new function called ext4_get_block_write_nolock, which is
      used in dio overwrite nolock.  In this function, it doesn't try to
      acquire i_data_sem lock and doesn't start a new journal as it does a
      CC: Tao Ma <>
      CC: Eric Sandeen <>
      CC: Robin Dong <>
      Signed-off-by: default avatarZheng Liu <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Theodore Ts'o's avatar
      ext4: fix overhead calculation used by ext4_statfs() · 952fc18e
      Theodore Ts'o authored
      Commit f975d6bc
       introduced bug which caused ext4_statfs() to
      miscalculate the number of file system overhead blocks.  This causes
      the f_blocks field in the statfs structure to be larger than it should
      be.  This would in turn cause the "df" output to show the number of
      data blocks in the file system and the number of data blocks used to
      be larger than they should be.
      Signed-off-by: default avatar"Theodore Ts'o" <>
  13. 30 Jun, 2012 1 commit
  14. 31 May, 2012 1 commit
  15. 28 May, 2012 1 commit
  16. 27 May, 2012 1 commit
  17. 15 May, 2012 1 commit
  18. 29 Apr, 2012 9 commits
  19. 16 Apr, 2012 1 commit
  20. 20 Mar, 2012 1 commit
  21. 19 Mar, 2012 1 commit
  22. 05 Mar, 2012 2 commits
    • Curt Wohlgemuth's avatar
      ext4: add comments to definition of ext4_io_end_t · 4188188b
      Curt Wohlgemuth authored
      This should make it more clear what this structure is used
      for, and how some of the (mutually exclusive) fields are
      used to keep page cache references.
      Signed-off-by: default avatarCurt Wohlgemuth <>
      Signed-off-by: default avatar"Theodore Ts'o" <>
    • Jeff Moyer's avatar
      ext4: fix race between sync and completed io work · 491caa43
      Jeff Moyer authored
      The following command line will leave the aio-stress process unkillable
      on an ext4 file system (in my case, mounted on /mnt/test):
      aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2
      This is using the aio-stress program from the xfstests test suite.
      That particular command line tells aio-stress to do random writes to
      20 files from 20 threads (one thread per file).  The files are NOT
      preallocated, so you will get writes to random offsets within the
      file, thus creating holes and extending i_size.  It also opens the
      file with O_DIRECT and O_SYNC.
      On to the problem.  When an I/O requires unwritten extent conversion,
      it is queued onto the completed_io_list for the ext4 inode.  Two code
      paths will pull work items from this list.  The first is the
      ext4_end_io_work routine, and the second is ext4_flush_completed_IO,
      which is called via the fsync path (and O_SYNC handling, as well).
      There are two issues I've found in these code paths.  First, if the
      fsync path beats the work routine to a particular I/O, the work
      routine will free the io_end structure!  It does not take into account
      the fact that the io_end may still be in use by the fsync path.  I've
      fixed this issue by adding yet another IO_END flag, indicating that
      the io_end is being processed by the fsync path.
      The second problem is that the work routine will make an assignment to
      io->flag outside of the lock.  I have witnessed this result in a hang
      at umount.  Moving the flag setting inside the lock resolved that
      The problem was introduced by commit b82e384c
       ("ext4: optimize
      locking for end_io extent conversion"), which first appeared in 3.2.
      As such, the fix should be backported to that release (probably along
      with the unwritten extent conversion race fix).
      Signed-off-by: default avatarJeff Moyer <>
      Signed-off-by: default avatar"Theodore Ts'o" <>