This project is mirrored from https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git. Pull mirroring updated .
  1. 17 Dec, 2010 1 commit
  2. 16 Dec, 2010 1 commit
  3. 14 Dec, 2010 1 commit
    • Theodore Ts'o's avatar
      ext4: Turn off multiple page-io submission by default · 1449032b
      Theodore Ts'o authored
      
      
      Jon Nelson has found a test case which causes postgresql to fail with
      the error:
      
      psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
      
      Under memory pressure, it looks like part of a file can end up getting
      replaced by zero's.  Until we can figure out the cause, we'll roll
      back the change and use block_write_full_page() instead of
      ext4_bio_write_page().  The new, more efficient writing function can
      be used via the mount option mblk_io_submit, so we can test and fix
      the new page I/O code.
      
      To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
      memory such that the system just at the end of triggering writeback
      before running the following sql script:
      
      begin;
      create temporary table foo as select x as a, ARRAY[x] as b FROM
      generate_series(1, 10000000 ) AS x;
      create index foo_a_idx on foo (a);
      create index foo_b_idx on foo USING GIN (b);
      rollback;
      
      If the temporary table is created on a hard drive partition which is
      encrypted using dm_crypt, then under memory pressure, approximately
      30-40% of the time, pgsql will issue the above failure.
      
      This patch should fix this problem, and the problem will come back if
      the file system is mounted with the mblk_io_submit mount option.
      
      Reported-by: default avatarJon Nelson <jnelson@jamponi.net>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1449032b
  4. 08 Nov, 2010 1 commit
  5. 02 Nov, 2010 1 commit
  6. 28 Oct, 2010 18 commits
    • Theodore Ts'o's avatar
      ext4: BUG_ON fix: check if page has buffers before calling page_buffers() · b1142e8f
      Theodore Ts'o authored
      
      
      We need to make check if a page does not have buffes by checking
      page_has_buffers(page) before calling page_buffers(page) in
      ext4_writepage().  Otherwise page_buffers() could throw a BUG_ON.
      
      Thanks also to Markus Trippelsdorf and Avinash Kurup who also reported
      the problem.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: default avatarSedat Dilek <sedat.dilek@googlemail.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@googlemail.com>
      b1142e8f
    • Dmitry Monakhov's avatar
      ext4: optimize orphan_list handling for ext4_setattr · 3d287de3
      Dmitry Monakhov authored
      
      
      Surprisingly chown() on ext4 is not SMP scalable operation. 
      Due to unconditional orphan_del(NULL, inode) in ext4_setattr()
      result in significant performance overhead because of global orphan
      mutex, especially in no-journal mode (where orphan_add() is noop).
      It is possible to skip explicit orphan_del if possible.
      Results of fchown() micro-benchmark in no-journal mode
      while (1) {
         iteration++;
         fchown(fd, uid, gid);
         fchown(fd, uid + 1, gid + 1)
      }
      measured: iterations per millisecond
      | nr_tasks | w/o patch | with patch |
      |        1 |       142 |        185 |
      |        4 |       109 |        642 |
      
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      3d287de3
    • Theodore Ts'o's avatar
      ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static · 4a873a47
      Theodore Ts'o authored
      
      
      Fix a namespace leak by moving the function to the file where it is
      used and making it static.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      4a873a47
    • Theodore Ts'o's avatar
      ext4: make various ext4 functions be static · 1f109d5a
      Theodore Ts'o authored
      
      
      These functions have no need to be exported beyond file context.
      
      No functions needed to be moved for this commit; just some function
      declarations changed to be static and removed from header files.
      
      (A similar patch was submitted by Eric Sandeen, but I wanted to handle
      code movement in separate patches to make sure code changes didn't
      accidentally get dropped.)
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1f109d5a
    • Eric Sandeen's avatar
      ext4: update writeback_index based on last page scanned · 72f84e65
      Eric Sandeen authored
      
      
      As pointed out in a prior patch, updating the mapping's
      writeback_index based on pages written isn't quite right;
      what the writeback index is really supposed to reflect is
      the next page which should be scanned for writeback during
      periodic flush.
      
      As in write_cache_pages(), write_cache_pages_da() does
      this scanning for us as we assemble the mpd for later
      writeout.  If we keep track of the next page after the
      current scan, we can easily update writeback_index without
      worrying about pages written vs. pages skipped, etc.
      
      Without this, an fsync will reset writeback_index to
      0 (its starting index) + however many pages it wrote, which
      can mess up the progress of periodic flush.
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      72f84e65
    • Eric Sandeen's avatar
      ext4: implement writeback livelock avoidance using page tagging · 5b41d924
      Eric Sandeen authored
      This is analogous to Jan Kara's commit,
      f446daae
      
      
      mm: implement writeback livelock avoidance using page tagging
      
      but since we forked write_cache_pages, we need to reimplement
      it there (and in ext4_da_writepages, since range_cyclic handling
      was moved to there)
      
      If you start a large buffered IO to a file, and then set
      fsync after it, you'll find that fsync does not complete
      until the other IO stops.
      
      If you continue re-dirtying the file (say, putting dd
      with conv=notrunc in a loop), when fsync finally completes
      (after all IO is done), it reports via tracing that
      it has written many more pages than the file contains;
      in other words it has synced and re-synced pages in
      the file multiple times.
      
      This then leads to problems with our writeback_index
      update, since it advances it by pages written, and
      essentially sets writeback_index off the end of the
      file...
      
      With the following patch, we only sync as much as was
      dirty at the time of the sync.
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5b41d924
    • Eric Sandeen's avatar
      ext4: tidy up a void argument in inode.c · bbd08344
      Eric Sandeen authored
      
      
      This doesn't fix anything at all, it just removes a vestige
      of prior use from __mpage_da_writepage()
      
      __mpage_da_writepage() had a *void argument leftover from
      its previous life as a callback; make it reflect the actual type.
      
      Fixing this up makes it slightly more obvious to read, and 
      enables proper typechecking.
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      bbd08344
    • Namhyung Kim's avatar
      ext4: Check return value of sb_getblk() and friends · 87783690
      Namhyung Kim authored
      
      
      Fail block allocation if sb_getblk() returns NULL. In that case,
      sb_find_get_block() also likely to fail so that it should skip
      calling ext4_forget().
      
      Signed-off-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      87783690
    • Theodore Ts'o's avatar
      ext4: use bio layer instead of buffer layer in mpage_da_submit_io · bd2d0210
      Theodore Ts'o authored
      
      
      Call the block I/O layer directly instad of going through the buffer
      layer.  This should give us much better performance and scalability,
      as well as lowering our CPU utilization when doing buffered writeback.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      bd2d0210
    • Theodore Ts'o's avatar
      ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io() · 1de3e3df
      Theodore Ts'o authored
      
      
      This massively simplifies the ext4_da_writepages() code path by
      completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
      code iterating over a set of pages using pagevec_lookup(), and folds
      that functionality into mpage_da_submit_io()'s existing
      pagevec_lookup() loop.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1de3e3df
    • Theodore Ts'o's avatar
      ext4: inline walk_page_buffers() into mpage_da_submit_io · 3ecdb3a1
      Theodore Ts'o authored
      
      
      Expand the call:
      
        if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
                              ext4_bh_delay_or_unwritten))
      	goto redirty_page
      
      into mpage_da_submit_io().
      
      This will allow us to merge in mpage_put_bnr_to_bhs() in the next
      patch.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      3ecdb3a1
    • Theodore Ts'o's avatar
      ext4: inline ext4_writepage() into mpage_da_submit_io() · cb20d518
      Theodore Ts'o authored
      
      
      As a prepratory step to switching to bio_submit, inline
      ext4_writepage() into mpage_da_submit() and then simplify things a
      bit.  This makes it clearer what mpage_da_submit needs to do.
      
      Also, move the ClearPageChecked(page) call into
      __ext4_journalled_writepage(), as a minor bit of cleanup refactoring.
      
      This also allows us to pull i_size_read() and
      ext4_should_journal_data() out of the loop, which should be a very
      minor CPU savings.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      cb20d518
    • Theodore Ts'o's avatar
      ext4: simplify ext4_writepage() · a42afc5f
      Theodore Ts'o authored
      
      
      The actual code in ext4_writepage() is unnecessarily convoluted.
      Simplify it so it is easier to understand, but otherwise logically
      equivalent.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      a42afc5f
    • Theodore Ts'o's avatar
      ext4: call mpage_da_submit_io() from mpage_da_map_blocks() · 5a87b7a5
      Theodore Ts'o authored
      
      
      Eventually we need to completely reorganize the ext4 writepage
      callpath, but for now, we simplify things a little by calling
      mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
      places where we call mpage_da_map_blocks() it is followed up by a call
      to mpage_da_submit_io().
      
      We're also a wee bit better with respect to error handling, but there
      are still a number of issues where it's not clear what the right thing
      is to do with ext4 functions deep in the writeback codepath fails.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5a87b7a5
    • Eric Sandeen's avatar
      ext4: queue conversion after adding to inode's completed IO list · c999af2b
      Eric Sandeen authored
      
      
      By queuing the io end on the unwritten workqueue before adding it
      to our inode's list of completed IOs, I think we run the risk
      of the work getting completed, and the IO freed, before we try
      to add it to the inode's i_completed_io_list.
      
      It should be safe to add it to the inode's list of completed
      IOs, and -then- queue it for completion, I think.
      
      Thanks to Dave Chinner for pointing out the race.
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarJiaying Zhang <jiayingz@google.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      c999af2b
    • Toshiyuki Okajima's avatar
      ext4: fix potential infinite loop in ext4_da_writepages() · 0c9169cc
      Toshiyuki Okajima authored
      
      
      On linux-2.6.36-rc2, if we execute the following script, we can hang
      the system when the /bin/sync command is executed:
      
      ========================================================================
      #!/bin/sh
      
      echo -n "HANG UP TEST: "
      /bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
      /sbin/mkfs.ext4 -Fq /tmp/img
      /bin/mount -o loop -t ext4 /tmp/img /mnt
      /bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
      seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
      /bin/sync
      /bin/umount /mnt
      echo "DONE"
      exit 0
      ========================================================================
      
      We can see the following backtrace if we get the kdump when this
      hangup occurs:
      
      ======================================================================
      kthread()
      => bdi_writeback_thread()
         => wb_do_writeback()
            => wb_writeback()
               => writeback_inodes_wb()
                  => writeback_sb_inodes()
                     => writeback_single_inode()
                        => ext4_da_writepages()  ---+ 
                                      ^ infinite    |
                                      |   loop      |
                                      +-------------+
      ======================================================================
      
      The reason why this hangup happens is described as follows:
      1) We write the last extent block of the file whose size is the filesystem 
         maximum size.
      2) "BH_Delay" flag is set on the buffer_head of its block.
      3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
         - the member, "m_len" of struct mpage_da_data is 1
        mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
        cannot clear "BH_Delay" flag of the buffer_head because the type of
        m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.
      
        Therefore an infinite loop occurs because ext4_da_writepages()
        cannot write the page (which corresponds to the block) since
        "BH_Delay" flag isn't cleared.
      ----------------------------------------------------------------------
      static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
      				struct ext4_map_blocks *map)
      {
      ...
      	int blocks = map->m_len;
      ...
      		do {
      			// cur_logical = 4294967295
      			// map->m_lblk = 4294967295
      			// blocks = 1
      			// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
      			// (cur_logical >= map->m_lblk + blocks) => true
      			if (cur_logical >= map->m_lblk + blocks)
      				break;
      ----------------------------------------------------------------------
      
      NOTE: Mounting with the nodelalloc option will avoid this codepath,
      and thus, avoid this hang
      
      Signed-off-by: default avatarToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      0c9169cc
    • Eric Sandeen's avatar
      ext4: don't bump up LONG_MAX nr_to_write by a factor of 8 · b443e733
      Eric Sandeen authored
      
      
      I'm uneasy with lots of stuff going on in ext4_da_writepages(),
      but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
      making anything better, so avoid the multiplier in that case.
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      b443e733
    • Eric Sandeen's avatar
      ext4: stop looping in ext4_num_dirty_pages when max_pages reached · 659c6009
      Eric Sandeen authored
      
      
      Today we simply break out of the inner loop when we have accumulated
      max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
      in the while (!done) loop, this does potentially a lot of work
      with no net effect.
      
      When we have accumulated max_pages, just clean up and return.
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      659c6009
  7. 26 Oct, 2010 1 commit
  8. 09 Aug, 2010 4 commits
    • Al Viro's avatar
      convert ext4 to ->evict_inode() · 0930fcc1
      Al Viro authored
      
      
      pretty much brute-force...
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0930fcc1
    • Christoph Hellwig's avatar
      remove inode_setattr · 1025774c
      Christoph Hellwig authored
      
      
      Replace inode_setattr with opencoded variants of it in all callers.  This
      moves the remaining call to vmtruncate into the filesystem methods where it
      can be replaced with the proper truncate sequence.
      
      In a few cases it was obvious that we would never end up calling vmtruncate
      so it was left out in the opencoded variant:
      
       spufs: explicitly checks for ATTR_SIZE earlier
       btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
       ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
      
      In addition to that ncpfs called inode_setattr with handcrafted iattrs,
      which allowed to trim down the opencoded variant.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1025774c
    • Christoph Hellwig's avatar
      introduce __block_write_begin · 6e1db88d
      Christoph Hellwig authored
      
      
      Split up the block_write_begin implementation - __block_write_begin is a new
      trivial wrapper for block_prepare_write that always takes an already
      allocated page and can be either called from block_write_begin or filesystem
      code that already has a page allocated.  Remove the handling of already
      allocated pages from block_write_begin after switching all callers that
      do it to __block_write_begin.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6e1db88d
    • Christoph Hellwig's avatar
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig authored
      
      
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  9. 05 Aug, 2010 1 commit
    • Jan Kara's avatar
      ext4: Fix dirtying of journalled buffers in data=journal mode · 56d35a4c
      Jan Kara authored
      
      
      In data=journal mode, we still use block_write_begin() to prepare
      page for writing. This function can occasionally mark buffer dirty
      which violates journalling assumptions - when a buffer is part of
      a transaction, it should be dirty and a buffer can be already part
      of a forget list of some transaction when block_write_begin()
      gets called. This violation of journalling assumptions then results
      in "JBD: Spotted dirty metadata buffer..." warnings.
      
      In fact, temporary dirtying the buffer while the page is still locked
      does not really cause problems to the journalling because we won't write
      the buffer until the page gets unlocked. So we just have to make sure
      to clear dirty bits before unlocking the page.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      56d35a4c
  10. 04 Aug, 2010 1 commit
  11. 29 Jul, 2010 1 commit
  12. 27 Jul, 2010 8 commits
  13. 26 Jul, 2010 1 commit